Enabling the 25GHz PC

The burning issue for the computer industry is that buses such as PCI-X, AGP, DDR and Pentium MCI have become information bottlenecks that are strangling data throughput. Now, Acuid Corporation has come up with a solution.

The burning issue for the computer industry is that buses such as PCI-X, AGP, DDR and the Pentium MCI have become information bottlenecks that are strangling data throughput.

Between 1974 and 2002, processor clock rates have gone from 1MHz to Pentiums demonstrated running at 4GHz, a 4000 fold increase, but the I/O bandwidth has not kept pace, as shown in <a href=’http://www.e4engineering.com/content_images/acuidone11.gif’>Figure 1</a>. New I/O technologies have emerged which removes this bottleneck for several more generations of processor.

To prevent the performance of new processors being completely strangled by data I/O limitations, each processor vendor has invested in the development of their own bus. Hypertransport has been led by AMD, RapidIO by Motorola and 3GIO has been defined by Intel, all to address this problem.

However, the improvement provided by these new buses is still grossly insufficient. Consider the example of a RapidIO bus, 32 bits wide, operating at the maximum rate contemplated by the specification, namely 2Gbps. This bus would be clogged by the data from just two RLDRAM chips, yet at a huge cost in protocol latency and system complexity to run the RapidIO protocol.

Many companies, including Intel, have demonstrated logic circuits running at 100GHz using fully depleted Silicon on Insulator. Of what use will 25, 50 or even 100GHz processors be, given the limits to data transfer? What is needed are order of magnitude improvements in bus speeds.

When a processor requires new blocks of code or data, it has to wait until they have been loaded from memory. Predictive fetches and multiple level caches are used in the processor to take advantage of instruction loops and to reuse data thus reducing the number of external transfers and processor waiting time. However, high speed on-chip cache costs much more than external memory modules. The ideal situation would be for the processor to load instructions and data directly from external memory without having to wait, thus reducing, or even removing, the need for a cache. Early microprocessors all operated this way because the I/O bandwidth was sufficient.

Computer buses, such as DDR and Rambus for memory, AGP for graphics, and PCI-X, are incapable of supplying data fast enough to keep any modern processor fully loaded. Although all of these buses continue to be enhanced, they are nearing the limits of their potential performance. Providing data to allow processors to operate at full speed requires a new data transfer technology.

Acuid have developed a solution to this problem, with each signal operating at up to 25Gbps. This is compared in <a href=’http://www.e4engineering.com/content_images/acuidtwo22.gif’>Figure 2 </a> with the fastest serial interfaces that achieve 3.2Gbps, which is really 2.5Gbps after taking away the layer of coding these interfaces require: even 10Gbps Ethernet requires four signals at 2.5Gbps.

The performance standard for parallel buses is being set by the memory vendors with DDR II achieving 400Mbps (64 and 72 bits wide) and RLDRAM (Reduced Latency DRAM) achieving 800Mbps (32 bits wide). These figures are the limits of current CMOS processes, unless Acuid technology is used.

What is it that allows Acuid ports to run ten to a hundred times faster than other ports?

Acuid’s key step was to discover a way to measure extremely accurately the time at which events occur inside a chip. The importance of this can be understood by analogy with an op amp. An op amp on its own is unusable as an amplifier: it has a high gain, but this varies enormously with changes in temperature, process and power supply.

The addition of feedback using just two resistors transforms the op amp into an idealised component; one with a predetermined gain that does not vary.

By measuring time and using feedback, Acuid obtained a tool both to understand the physics of high speed links and to control them. Problems of skew, the existence of setup and hold times, intersymbol interference (ISI) and varying track lengths, are all resolved.

The fundamental changes this technology is having on I/O design is evident from the comparison in <a href=’http://www.e4engineering.com/content_images/acuidthree33.gif’>Table 1</a>.

In this table, key parameters of the highest performance silicon serial I/O available today, namely a typical OC192 SPI-4 implementation, and the highest performance production parallel interface that is available, namely an RLDRAM at 800Mbps 32 bit wide, are compared with the equivalent Acuid interface. There are several striking differences. Consider first the most basic parameter: the sampling rate. The RLDRAM simply takes a single sample per bit period using a precision clock and precise PCB layout. In OC192, the interface takes a sample using the encoded clock and then tries to predict when to take the next sample.

The Acuid interface samples the data at 16 times the data rate and then determines which sample to use: as it were, sampling with hindsight. Predicting the future is always more difficult than taking decisions with 20:20 hindsight and this process of sampling by hindsight is a major advance in the Acuid interfaces.

Another striking contrast is the timing precision, in particular, the jitter. Jitter is a key parameter that determines the level of bit errors in a communication channel, and from that, the maximum data rate. A state of the art ultra low jitter clock, fabricated on thick film, has a jitter of 3ps, while typical clocks have a jitter of tens of ps. The Acuid timing control system, which continuously measures the timing variation and then uses that measurement in a feedback loop, allows the jitter to be controlled to 1ps, which is a lower level than even the best clock sources.

For a fairer comparison with serial links, coding should be taken into account. Serial links using Clock Data Recovery to achieve 3.2Gbps need 8b/10b coding to embed the clock into the data stream. This reduces the useful data rate carried by the channel to 2.5Gbps. Acuid can implement 13Gbps serial or parallel links without coding. However, Acuid uses 8b/13b coding only for serial links, to achieve coded data rates of 25Gbps, giving a useful 25% increase in useful data rate to 16Gbps. The 8b/13b coding scheme reduces the distortion caused by the limited channel bandwidth, i.e. the ESD structures, the packaging, the circuit board, etc.

The Acuid links can be combined to form parallel buses, which act as a Synchronous Conveyor of data. Data is introduced at one end of the Synchronous Conveyor, and after the wire and port delay, appears at the other end of the link. Combining the links in parallel causes no degradation in the speed of data transfer, nor is there any increase in latency. Acuid links require no protocol at the speed of 13Gbps per signal, and therefore have a very low latency.

Getting back to the PC application, the Acuid ports provide low latency channels that achieve improvements in speed of a factor of 10 to 100, providing a sound foundation for another decade of growth in the computer industry. This is achieved byusing innovation rather than trying to push manufacturing technology into uncharted territory. In the Acuid parallel buses, every signal transfers data at 13Gbps, which is multiplied by the full width of the bus.

These ports have very low bit errors, so there is no need to incorporate error correction or protocol framing. This means the full bandwidth can be used for data transfer.

Using Acuid buses of only 6-bits width for input and output gives a 156Gbps data transfer total. This would be sufficient data transfer capability to fully load a multi-GHz processor, thus maximising the performance of the computing system.

<a href=’http://www.e4engineering.com/content_images/acuidfour44.gif’>Figure 3</a> shows a computer architecture example using Acuid buses to eliminate data bottlenecks.

In contrast, the Intel P4 MCI bus, despite using 20 times more wires, only provides a tenth of this performance. An Acuid bus with the same number of signals as the Intel P4 MCI bus, would convey many terabits per second of data, far more than is needed in processors even off the edge of the roadmaps projected by any processor vendor.

On the web