The first UltraSPARC processor with on-chip cryptographic accelerators was the UltraSPARC T1 processor; each of the processor's eight cores has an associated crypto accelerator that is targeted at offloading/accelerating public-key cryptography. Basically, this accelerator, termed the modular arithmetic unit (MAU), performs modular exponentiation operations that lie at the heart of algorithms such as RSA and Diffie-Hellman.
With the UltraSPARC T2 processor, each core's crypto accelerator retains its MAU unit, but is also enhanced by the introduction of a cipher/hash unit:
- UltraSPARC T1 accelerators (MAUs):
- Target modular arithmetic operations
- Accelerate public-key cryptography (e.g. RSA, DSA, Diffie-Hellman)
- UltraSPARC T2 accelerators also accelerate:
- Bulk encryption (RC4, DES, 3DES, AES)
- Secure hash (MD5, SHA-1, SHA-256)
- Additional public key algorithms (Elliptic Curve Cryptography)
On the T2, the two sub-units that constitute the accelerator can operate in parallel, such that each core's accelerator can be performing an RSA operation and an AES operation in parallel, as illustrated in the following figure:
-

Communication with the cipher/hash unit is via a memory-based control word queue. To offload an operation to the accelerator, it is necessary to generate a control-word that provides the accelerator with the information required to perform the operation e.g. pointers to src, dst, keys, IVs. As a result, the accelerator is essentially stateless, which is extremely important in application spaces where there can be literally thousands of simultaneous connections (e.g. Secure Web, Secure VoIP). Additionally, given this light-weight interface, the overheads associated with offloading an operation to the accelerator can be extremely minimal, allowing even short duration operations to be cost effectively offloaded.
It is possible to interact with the accelerator in a synchronous or asynchronous manner, such that, if desired, it possible to go off and perform other useful processing on the core while the crypto operation is being performed in parallel on the accelerator; this provides an additional level of parallelism that is not achieved when ISA customization is used to achieve crypto acceleration.
Why on-chip accelerators?
In comparison to using onchip crypto accelerators, the use of offchip, look-aside, accelerators, will tend to increase CPU utilization, consume additional I/O bandwidth and introduce additional latency. This tends to make the use of offchip cards problematic for the effective acceleration of bulk ciphers, especially for small or moderately sized packets. While recent announcements for `high-performance' offchip accelerators, using HT or FSB connectivity, may help reduce some of these issues, repeatedly ping-ponging the data off and on chip is inevitably less efficient that using on-chip accelerators that are tightly coupled with the processor cores.
Zero-cost Security?
In today's environment, security is becoming ever more essential, whether the focus be web servers, databases, file systems or networking. However, the high cost associated with security is problematic; if a system that is capable of performing X operations per second when running in an non-secure mode, when transitioned to secure operation, the throughput of operations that the system can sustain will fall drastically. 2X slowdowns are commonplace and 5X, or even 10X, slowdowns are not that uncommon.
As a result of this high cost, there is often significant reluctance to develop and deploy the comprehensive security strategies that are required in today's world; leading to the serious consequences that we read about all too frequently.
So what is typically done to remedy this situation?
Looking at these security overheads, the vast majority of the overhead is frequently attributable to the cryptographic operations that underpin the security protocols. However, general purpose processors are ill suited to performing cryptographic operations. As a result, there is significant advantages to offloading the cryptographic processing to custom hardware that can perform the operations orders of magnitude faster than can be achieved on the processor.
Accordingly, accelerators should allow the conversion of the significant security overheads into virtually negligible overheads. Essentially, accelerators should allow us to achieve zero cost security! (by which I mean that there should be a negligible performance impact associated with going secure).
Unfortunately, accelerators have largely failed to deliver on this.
This is basically a result of the way we have architected and deployed accelerators; we have a system, and then, almost as an afterthought, we add in the PCI-based accelerator card. With this architecture, the cost of offloading an operation to the accelerator can be very high, significantly limiting the type of cryptographic operation that can be cost effectively offloaded; its frequently more cost effective to just perform the processing on the processor!
With the UltraSPARC T2 processor, we have moved the crypto accelerators on-chip and tightly coupled them with the processor cores. As a result, it has been possible to radically reduce the overheads associated with offloading an operation to the accelerators. In turn, this allows the T2 accelerators to cost effectively handle a much broader range of cryptographic operations than traditional offchip accelerators and enables the UltraSPARC T2 processor to deliver zero cost security in a wide variety of application spaces.