State Aggregation Considered Necessary

I kept meaning to send a message about this since I'm a partial source of the counter-opinion about the manageability of per-flow state growth, but I had previously kept putting off sinking the time.

I think it is true that the growth in the amount of flow state (for any system using per-flow state, be it RSVP or anything else) in a given switch *will* be bounded by the growth of the bandwidth through the switch, and not by the size of the network as a whole. (This assumes that per-flow bandwidth does not decrease over time, and no, I don't consider most WWW requests to be "flows", or at any rate flows which I would bother setting up state for, so I think this assumption is pretty good.)

So, the amount of memory needed per switch for flow-state will not grow as O(N^2), or anything like that (which is the canard you used to hear). It will only grow as *local* conditions change - which is good, it means you have control over that growth, so that if you have a big fast switch, you need more memory, but the two will grow together.

(It's also the case that memory is getting larger at a higher rate than speeds are getting faster, since memory capacity goes as the square of the device feature size, whereas device speed is generally linear in feature size - another nice feature.)

In realizing all that, though, I missed something very important - which is that what's important is the *per flow cost* of the state - and although the total *amount* of per-flow memory needed goes up linearly with the amount of traffic, the total *cost* of that memory does not, but at a *higher* rate, since faster switches generally need higher-speed state memory, and memory cost is generally somewhat proportional to speed.

In other words, the unit (i.e. per-flow) cost of per-flow memory is higher in high-speed switches than in low-speed switches. This is a negative economy of scale on high-speed switches, which is the inverse of what you want.

The implications of this for flow aggregation are pretty obvious. I had previously opposed flow aggregation on the grounds that i) it was extra complexity, and ii) it didn't buy the users any additional capability. That analysis led me to reject flow aggregation. (Believe it or not, I don't at all like "kitchen sink" designs, although I expect many people who look at Nimrod don't believe that! :-)

However, that simplistic analysis hits the complex hard rock of reality, which is that you have to have aggregation for economic reasons. (And a tip of the hatly hat to the people at BBN who disagreed with me about putting aggregation in Nimrod! You guys were right, I was wrong. So there! :-)

So, in a large network, one using a hierarchy of switch speeds to construct the mesh (as seems the most workable real-world design), you cannot in fact support per-flow state all the way through the network, but need to aggregate flows.

Note that you don't have to have a *lot* of aggregation as you go on up - just enough to keep the unit cost of per-flow memory declining at an appropriate rate.

E.g. (using made-up numbers) if the faster memory for your higher- speed switch is 5 times as expensive, and your faster switch is otherwise 3 times more cost-effective per unit traffic flow, you only need to aggregate by a factor of 15 to match i) the decrease in per-flow unit costs for memory for per-flow state, and ii) the overall decrease in per-flow unit costs elsewhere in the switch.

To sum all this up, the issue with growth of state, as routers get larger (and therefore faster), is not simply the *amount* of memory, but also the *unit cost* of that memory. If the amount needed is proportional to the amount of traffic, but the speed (a.k.a. unit cost :-) is larger, you have a negative economy of scale in your routers.

There's an interesting corollary (one I don't have time to explore here) to this notion that you *have* to have aggregation of flows in the network. My initial analysis was all in terms of unicast flows, not multicast. However, the same argument holds for multicast flows.

If we have many small, topologically-widespread, multicast groups, we are going to have a problem with multicast state growth (for things like routing information *as well as* RSVP) in the core routers. So, we are going to have to figure out how to aggregate multicast flows (i.e. groups, since all groups have per-group state in current multicast routing) before multicast will fully scale.

Back to JNC's home page