PC Hardware for High-Performance Network Experiments

Last Updated Jan 17, 1997. Still some obsolete information, though.

This page holds a collection of opinions, and the occasional fact, developed while trying to convince some high-end PC's to be useful as networking research platforms. Our experiments cover PC's used both as hosts and IP routers. For hosts, our interest is primarily in human-communication capabilties - video, audio, and interactive graphics, rather than the more traditional data throughput measures. Our Fast PC Routers are experimental platforms designed for whizzo performance and flexibility in an experimental testbed environment.

This page is colored by our using BSD-derived Unix as the basic operating system environment. Some things would look very different from a Windows perspective. More on that at a later date.

General Information

Motherboards and PCI Chipsets

ATM Interfaces

Fast Ethernet Controllers

T1 Interface Cards

Display (Graphics) Cards

Video Frame Capture Cards

Audio Input and Output

Configurations

Motherboards and PCI Chipsets

The advent of the PCI bus has finally brought reasonable I/O capabilities to the PC world. Unfortunately, your choice of motherboard and PCI chipset can make a huge difference in performance for typical networked applications. This is doubly true because the "PCI chipset" manages the CPU's connection with system memory, as well as controlling and arbitrating the PCI bus.

Performance Factors

The first thing everyone hears about the PCI bus is that it has 132MB/s of I/O bandwidth. This fact is of almost no relevance to network system designers. The performance actually achieved by a PCI-bus computer depends on a number of more subtle factors. Among these are:

Memory Timing.: In many PC applications, L2 cache hit rates are quite high. This is less true for the applications of interest to us. Both the uncached data transfer rate and the cached data burst-fill rate can have substantial performance impact. Current and near-future PCI chipsets support new memory technologies such as EDO and synchronous DRAM, which have substantially faster burst-fill rates.
Interface Card PCI Implementation: The level of attention paid to performance in the interface card's PCI implementation. Is becoming increasingly important. Use of Memory Read Multiple commands, attention to cache alignment, and similar issues make a huge difference. Unfortunately, it's difficult to know how a particular interface will perform unless you either have access to technical design information or perform some exhaustive characterization tests.
Bus Arbitration Overhead.: Early PCI chipset designs were oriented towards large block data transfers, and were not well suited to supporting small transfers from several different bus masters. Near-future designs will improve on this situation. Related issues, including correct handling of the PCI latency timers and low-overhead interrupt handling, are also important.
CPU, memory, PCI and ISA bus decoupling.: Early designs had minimal buffering between these three system components. This resulted in little if any concurrency between the actions of the CPU and its I/O system, a problem which is particularly noticable on IP routers with both slow and fast interfaces.

PCI

Jan 17, 1997: This section is basically one generation out of date. I left it in for background until I have a chance to fix it.

PCI chipset design for the Pentium processor has evolved through several cycles. The most recent chips from Intel have very good performance potential when compared with earlier devices. For the Pentium Pro, a second generation chipset has recently become available, but does not totally replace the original version.

Pentium

For some time now Intel's Triton chipset was the best choice for Pentium PCI motherboards. Now called the 430FX, this chipset was the first to support running a PCI bus at close to full speed. However, this chipset is now obsolete. It's manufacture will shortly be discontinued by Intel, although motherboards built using it will probably be in the supply pipeline for a while. Also, early versions had a bug in the CPU to PCI logic which drastically limited throughput. The most recent revision works well. Motherboards using this design wouldn't be the choice for a new purchase, but if you have one it will work OK for most things.

The follow-on to the Triton chipset (sometimes called the Triton II, though not officially) supports Intel's new Concurrent PCI enhancements, which improve PCI's capabilities for both fast packet I/O and multimedia data management. It should be well suited for audio and ATM applications in particular. This chipset has two variants. The 430HX, has an efficient controller for current fast memory designs (EDO), a PCI bus controller tuned well for short-burst traffic, and supports ECC memory and other "server" requirements. The 430VX supports Synchronous DRAM for sigificantly improved main-memory access timings, but appears to have a slightly less capable PCI controller.

Motherboards built around the Triton II chipsets have been available since late April of 1996. Unfortunately (again) the first steppings of this chipset had a number of problems, but the pipeline has now cleared, and boards purchased today should be OK. The Configurations section gives some specific examples of systems which have worked for us.

We are currently using motherboards constructed around the 430HX chipset for desktop workstations and mbone terminals, with very good results. However, the 430VX is more of a puzzle. When used with Synchronous DRAM (SDRAM), itshould be a great design for networked applications, because SDRAM's fast cycle times should reduce memory access overhead with the high data cache miss rates typical of messaging applications. However, real-life experience with this chip is mixed. The fact that some people report excellent performance suggests that the difficulty may lie with board or system configuration rather than the chipset itself.We don't yet have any local experience to report yet.

Pentium Pro

On the Pentium Pro front, until recently the only choice was the 450GX/KX Orion chipset. This design works well for the typical server machine, but appears to have somewhat higher bus arbitration overhead than the Triton, and can achieve its best main memory access timing only with four-way interleaving, which implies a lot of SIMMS. The chip suports only the older Fast-Page-Mode memory. Again (there is a theme here) the first revisions of the chipset had a bug which limited I/O throughput, but the current version is OK.

The 440FX "Natoma" chipset for the Pentium Pro has several advantages, and one or two disadvantages. This chipset supports Intel's Concurrent PCI enhancements, which, as with the new Pentium chipsets, improves PCI's capabilities for both fast packet I/O and multimedia data support. The chipset also supports faster main memory timings through the use of EDO memory, and is somewhat cheaper than the Orion. On the other side of the coin, the Orion GX, originally intended for server use, supports four processor SMP, while the Natoma supports only two processors. When you do use four-way interleaved memory, the overall performance of an Orion GX memory system is frequently better than a Natoma design.

The Orion KX, which was designed for workstations rather than servers, has been completely replaced by the Natoma. Motheroards with the Orion KX , including Intel's widespread Aurora design, are now essentially obsolete and should not be purchased new. They still work well, of course, as long as you have one with a non-buggy revision of the chipset. Earlier buggy versions suffer from either terrible I/O performance or occasional system hangs, depending on configuration.

ATM Interfaces

We are currently using the Efficient Networks PCI ATM interface exclusively. This card consists of a fiber or UTP physical interface running at 155mb/s, a basic ATM cell processor with 8 peak-rate-controlled sending queues, an AAL5 segmentation and reassembly engine, a 512KB or 2MB onboard memory buffer, and a PCIbus DMA controller.

This card has a number of useful features and a couple of limitations. The most interesting feature is that the DMA controller is completely independent of the ATM and AAL5 SAR functions. The choice of how to get or packet into or out of the buffer memory is completely independent from the ATM and AAL processing. Depending on the circumstances, the CPU might choose to copy the packet to host memory itself, DMA parts of the packet to host memory, or read the packet header and DMA the packet body directly to another PCI bus device without ever touching host memory. This flexibility offers the potential for a very high performance software design.

A tradeoff of this flexibility is that the CPU may have to think about each packet twice; once when the packet is received (to set up the DMA operation) and once when DMA is complete. For our needs, the value of the capability appears to outweigh the costs, but this may not be true in a more conventional end-node-only situation.

Another limitation of the ENI card for our purposes is that it supports only a 10-bit VCI space and a zero-bit VPI space (that is, only VPI zero is available). This is reasonable for an host-oriented design, but quite limiting for use in a router.

PCI ATM cards from Adaptec are closely related to the Efficient cards. They use the same basic SAR engine but differ in that Adaptec has modified the SAR ASIC to incorporate a PCI interface directly, rather than using an external interface chip between the (originally Sun-SBUS) SAR ASIC and the PCI bus. The Adaptec cards also have a full SONET physical interface, rather than the SONET-LITE interface used by ENI. This is probably irrelevant to those connecting these cards to a private switch, but might be important for somone wanting to connect directly to a public carrier.

It appears to us that the Adaptec PCI bus interface has some nice features, but we don't currently have any significant experience with using these cards.

A number of other PCIbus ATM interface cards have become available in the past couple of years. Like the ENI and Adaptec design, these interfaces are targeted at the end-node (rather than the router) market. This usually implies that the interfaces support a small or non-existant VPI space, and may not have adequate SAR and cell rate control to handle hundreds or thousands of simultaneously active VC's.

A product we have considered using in the past comes from Zeitnet. This card reassembles ATM cells into packets in host memory rather than onboard buffer memory. Reassembling directly into host memory reduces the latency from wire to application, and may reduce the CPU's overhead in a traditional protocol implementation framework. However, it removes the possibility of pre-analyzing a reassembled packet so that it can be moved directly to the appropriate outgoing interface or local memory buffers.

Fast Ethernet Controllers

Of the many 100 MB Ethernet (100B-TX) controller chips available, we have the most experience with those from Digital Equipment Corp. and Intel. Both of these chips are widely available; the Intel chip in the company's PRO-100/B PCI Fast Ethernet Interface[Note: the "B" is important!], and the DEC chip in the DEC DE-500 card and several third-party products.

The DEC 21140 chip is a marvelous design with a single annoying flaw. It combines a 10 and 100MB ethernet controller with a DMA-master PCI bus interface. The DMA interface is efficient and simple to program. The Ethernet controller is capable of excellent performance, and supports a full-duplex mode when connected directly to a matching interface.

At present, this chip exhibits the highest small-packet forwarding performance we have seen with our software. It is capable of maintaining the theoretical 100B-TX packet rate at packet sizes ranging down to the legal minimum.

DEC has recently extended their family of Ethernet chips to include designs with improved PCI bus handling and lower CPU overhead. We haven't had a chance to try these modified versions yet.

The Intel 82557, like the DEC design, integrates a 10/100MB ethernet controller and PCI bus interface. The chip can be operated with very low CPU overhead, and provides a number of parameters to control its use of the CPU and PCI bus. It is used in the Intel Pro-100/B Fast Ethernet card, as well as others.

A previous version of this page said:

Our current experience with the Intel chip is somewhat surprising. We've measured excellent large-packet performance with this chip, easily saturating the data link with very low CPU usage. However, with our current software, the chip's performance falls off rapidly as packet sizes go down. At present, the chip will not saturate a 100MB link with packet sizes below 600 bytes. However, we have not yet put enough effort into tuning the driver for this chip to judge whether this limitation is inherent or simply a result of poor software.

It now seems likely that this was in fact a hardware mismatch between the Intel ethernet chip and the Intel PCI chipset in use at the time, and that with sightly different hardware the Intel design would give excellent performance, along with lower system overhead than the Digital design. We have not yet had a chance to repeat our experiments, but expect to shortly

A definite problem with the Intel design, though, is that the interrupt control structure is not well suited to a combined interrupt-polling driver. This adds overhead not needed by the Digital chip, particularly in many-small-packet situations. In contrast, the Intel chip's ability to align received ethernet payload information on longword boundaries measurably improves software performance, particularly on the P6, which is sensitive to such things.

The bottom line is that at present we see no obvious "best" choice for a 100MB ethernet card. The DEC and Intel chips appear to have complementary characteristics.

T1 Interface Cards

Much of our current testbed infrastructure is glued together with T1 lines. We are currently using T1 interface cards from Niwot Networks. The Niwot cards are ISA-bus devices. They use the fairly common MK5025 synchronous line interface chip, and support direct packet DMA to host memory (no shared memory). Each card can support two T1 lines operating at full rate.

These cards originally came in two flavors; one with onboard DSU/CSU circuitry, and one which requires an external DSU/CSU. Currently, only the external-DSU variant is available in small quantities. Each design has advantages. The external-DSU approach gives the full error monitoring and electrical isolation of a separate DSU/CSU. However, the internal-CSU card has a number of neat features. The most interesting is its ability to use its two serial framers to drive two separate logical channels (sets of DS0 slots) on the same T1 line, giving you two logically separate "wires" between the source and destination. Niwot has expressed a willingness to make more of the internal-CSU cards if demand appears.

An alternative to Niwot is the family of cards from Emerging Technology. Like the Niwot cards, these use MK5025's to operate the serial lines. Rather than direct host memory DMA, they use a shared memory design. ET's high end card uses the high-performance MK50H25 chip variant, and can operate private lines at 7-10Mb/s rates. A similar class of cards, although using a different chipset, is availablel from SDL Communications.

Unfortunately for those building high performance routers from PC's, the ISA bus is something of a disaster. This is because at least until recently performing ISA I/O more or less brought the rest of the machine to a stop, and things are only slightly better with the newest PCI chipsets. Ideally a multiport PCI T1 card is called for. We are currently developing code for a four-channel PCI T1 interface sold by SBE, Inc. This card uses the Motorola"QUICC" communications controller (unfortunately not the PowerPC version), together with onboard buffer memory and a PCI interface ASIC. The mechanical design and packaging of this card is simply beautiful. No performance information available as yet.

Display Cards

PC display card capabilities are changing at lightning speed. Until recently, the focus has been entirely on two-dimensional graphics accelerators. With the advent of the PCI bus and fast processors, interest has shifted to live-video acceleration and 3-D capabilities, and a high-end market willing to pay for decent analog design has developed.

These capabilities could easily drive a new generation of Internet applications. Unfortunately, the fast-moving market has created tremendous pressure to keep design and programming information proprietary, and the sheer speed of product development makes it difficult for the lone software hacker at a research institution to keep up. This, more than anything else, may be the thing that pushes us towards Windows, and its vendor-supported drivers, for end-node application development.

(Of course, the vendors can't keep up either; Microsoft has gone through several "high-performance" graphics API's in the past few years, and many vendors' software drivers are well behind the theoretical capabilities of their chips.)

The basic point of all this is that anything factual in this section probably will be wrong before you see it...

That said, three basic options stand out; cards based on the ATI Mach64 accelerators, those based on the Tsheg labs ET6000 chip, and those based on the S3 accelerators. We are currently using ATI Graphics Pro Turbo cards with 4MB of video memory in most of our machines. These cards are generally well supported by the freely available X implementation (XFree86). Beyond this general support, the latest beta version of XF86 supports a direct-video extension for these cards, allowing the application to write directly into the card's framebuffer memory. We have not yet made use of this capability in a running application, but it is of obvious use for live-video development.

ATI is rapidly developing extensions to the Mach64 chip family to support video processing (colorspace conversion and scaling) and 3D graphics. Their high-end cards also support a daughtercard plugin slot for MPEG accelerators and the like. Technical information about these developments is available from ATI under non-disclosure. We have not yet gotten around to figuring out how restrictive ATI's NDA actually is.

The Tseng Labs ET6000 chip is a strong 2D performer with easily available programming information. It's key trick is being able to simultaneously manage several different types of visual regions, for example easily overlaying a window displaying YUV422 coded video scaled 2 to 1 over a traditional 8-bit colormapped RGB image or a 24-bit RGB truecolor photographic background. The ET6000 uses "Multibank DRAM", which supports close to 1GB/sec of bandwidth between the display memory and the DAC's, and has a reasonably good bitblt engine for good X (and Windows) performance. It does not, however, have any 3D acceleration capability.

Another alternative is any of a number of products based on the S3 display chipsets. Unlike ATI, S3 does not itself make boards. Rather, they design and sell chips to a number of board manufacturers (Number9, Diamond, ELSA, many others), giving you a wider choice of price-quality-performance points within the same basic programming framework. S3 is also moving agressively into the video-processing and 3D accelerator market, and has added significant video capability to their latest products. S3's market strategy leaves them more used to dealing with third-party software developers than ATI. Perhaps unfortunately, they are a bit more formal about it; not only is a fairly standard non-disclosure agreement required but they want to know some things about what you plan to do with the information.

Video Frame Capture Cards

Our video frame grabber card of choice is the Matrox Meteor. This card uses a straightforward design based on the well known Philips digital video chips. The chipset, and thus the Meteor, is well documented and capable of good performance - with the correct motherboard and software it can digitize and capture 640x480 video images at 30 frames per second. The quality of the analog circuit design is adequate if not wonderful - the digitized data can be a bit noisy at times, but is probably as good as can be expected from commercial-grade equipment installed inside a running PC.

The Meteor can generate a variety of digital output formats, including straight RGB and a number of YUV variants. Further, the device can selectively digitize only even or odd fields, perform some useful image enhancement and filtering operations, and arrange the digitized data in memory in a number of useful ways, such as separating the even and odd fields or leaving extra space at row and column start/finish points as required by later signal processing operations.

The Meteor is available in a couple of versions, which vary in the input sources supported. The standard card accepts PAL/NTSC composite video and S/Video input. The Meteor/RGB additionally accepts three-channel RGB input. A third version with a built-in TV tuner is apparently available, but we have no further information about it.

The Meteor is supported by a FreeBSD device driver which implements simple single-frame capture, continuous asynchronous capture, and a pipelined synchronous mode which captures into a multi-frame circular buffer with low-water / high-water flow control. The driver provides control over most of the video settings supported by the chipset, but does not currently support an ideal set of DMA and memory management operations - some further development in this area should enable applications to exhibit higher video quality.

The currently distributed version of the MBONE application VIC supports this card/driver combination. Patches are also available for NV.

Many folks have reported bad interactions between the Meteor and Intel's 440FX "Natoma" P6 PCI chipset. The general symptom is that using the board locks the machine up. There appear to be two separate problems at work, one involving a small timing error in the Philips PCI interface in some DMA modes, and one (perhaps) involving a bandwidth limitation in the PCI chipset. Matrox is apparently building a special "slow" version of the Meteor to work around this problem, a slightly less than ideal solution.More on this soon.

We have little experience with alternatives to the Matrox video cards. Given the popularity and simplicity of the Philips chips, it seems likely that additional cards based on this design will appear. The crucial factor in such cards will be the quality of the board layout and design of the analog video circuitry. If you come across one with ferrite-bead power supply filtering and metal shielding of the input section, you are probably onto something. We'll list such things here if we ever see one.

There are a number of video framegrabber cards on the market which use chipsets other than the Philips design. Some of these are based on high-quality chips such as the Brooktree video DACs, and perform quite well. They may be excellent choices if adequate documentation is available and someone develops the necessary software support.

Audio Input and Output

To date, most PC Audio development has focused on the requirements of game players and CD-ROM multimedia authors. Within the Internet research community the focus has been placed on interactive communication and conferencing. This leads to a mismatch.

The typical PC audio card has a variety of sound creation mechanisms, including a digitizer (ADC) and some form of synthesizer. It also has a multi-channel mixer and a digital-to-analog converter (DAC) to generate the final output signal. What it often does not have is a way to simultaneously move digitized data from the microphone to the computer and from the computer to the speakers. It can only do one or the other at a time; it is a half-duplex device.

Unfortunately this is a serious limitation for interactive teleconferencing applications. If at all possible, you should avoid this limitation by using one of the few soundcards with full-duplex capabilities. Among these are the Gravis Ultrasound and the Turtle Beach Tropez. Note that the industry-standard Creative Labs Soundblaster family, which includes both plugin cards and the most common on-motherboard sound chips, are not full duplex.

The second big problem with soundcards is that, as with display cards, the field is hotly competitive, and many manufacturers consider their hardware designs proprietary and do not publish the low-level programming information. This allows them to update the hardware frequently without creating compatability problems, but makes it very difficult to use the products in non-Windows environments. As an example, the Turtle beach Tropez, mentioned above, has been replaced by the Tropez+, which uses an entirely different set of components.

The other important characteristic of a sound card is analog circuit component and layout quality. There is a wide variation among different cards in this area. Fortunately, the problems resulting from bad analog design can be measured and evaluated, and the results are documented in any competent product review.

Currently we are using cards purchased some time ago in our laboratory machines.As time progresses we will try to accumulate more information about specific current products on this page.

Configurations

We've moved our information about specific configurations to a separate page. Our machines come from a small local system integrator with whom we have an ongoing relationship. Similar machines are available from any number of local and mail-order suppliers. Within the CAIRN project, the ISI-East folks are investigating the possibility of arranging for one of these suppliers to sell the listed machines as package deals.

This work is supported by the Defense Advanced Research Projects Agency under contract DABT63-94-C-0072, and by the Intel Corporation.

John Wroclawski / MIT Lab for Computer Science / jtw@lcs.mit.edu