The performance benefits of the hardware cryptographic accelerators can be considered at a mirobenchmark level or an application level.
Microbenchmark performance benefits
The peak cryptographic performance for the various supported ciphers and cryptographic hashes is illustrated in the following tables:
| Symmetric cipher | Modes of operation | Chip-wide performance (Gb/s [8-cores, 1.4GHz]) |
|---|---|---|
| RC4 | - | 83 |
| DES | ECB, CBC, CFB64 | 83 |
| 3DES | ECB, CBC, CFB64 | 27 |
| AES-128 | ECB, CBC, CTR | 44 |
| AES-192 | ECB, CBC, CTR | 36 |
| AES-256 | ECB, CBC, CTR | 31 |
| Cryptographic hash | Chip-wide performance (Gb/s [8-cores, 1.4GHz]) |
|---|---|
| MD5 | 41 |
| SHA-1 | 32 |
| SHA-256 | 41 |
| Publick-key algorithm | Chip-wide performance (private-key Ops/sec[8-cores, 1.4GHz]) |
|---|---|
| RSA-1024 | 37,000 |
| RSA-2048 | 6,000 |
| ECCp-160 | 52,000 |
| ECCb-163 | 92,000 |
Performance is dependent on the size of the object being processed, although the interface to the hardware is very efficient, as illustrated in the following figure: 
Additionally, the hardware accelerator is capable of sustaining multiple oustanding read and write requests, such that data can be sourced from DRAM without impacting performance.
When compared against other processors, the performance delivered by UltraSPARC T2 processor is fairly significant, as illustrated in the following figure: [in this figure, AES-128-CBC processing for 8KB objects is undertaken. On the x86 processing is performed in software via OpenSSL. On the UltraSPARC T2 processing is undertaken by hardware, with the offload occuring via the Solaris userland cryptographi framework or via the Solaris kernel cryptographic framework] 
From the previous figure it is apparent that, when focused on pure AES performance, the UltraSPARC T2 processor is capable or significantly outperforming competitive processors -- 2 T2 cores delivering more performance than 8 x86 cores.
RSA performance
For RSA, the performance observed rapidly approaches the hardware peak performance, even using a limited number of requesting threads, as illustrated in the following figure: 
Accordingly, the T2 is capable of delivering up to 37K RSA-1024 sign operations per second, while still over 50% idle!
ECC performance
The T2 MAU provides HW support for both prime and binary curves. Given the hardware support for Galois field operations, performance for binary curves is especially impressive compared to the performance delivered by traditional processors, as illustrated in the following figure (ecdsa performance sign operations): 
Application-level performance benefits
When looking at application level benefits, it is important to ensure that an apples-2-apples comparison is undertaken. Also, it is important to ensure that the set-up is not cherry-picked to showcase particular strengths. As a result, focus on industry-standard benchmarks is optimal for these comparisons.
One such benchmark that has a significant focus on cryptography is the banking workload from SPECweb2005. In this workload clients interact with a banks webserver, and, as would be expected, all communication is secured -- HTTPS. In the following table the performance of the UltraSPARC T2 processor is compared with other systems:
| Processor | SPECweb2005 Banking |
|---|---|
| 1 x T2 [1.4GHz] | 70,000 |
| 2 x Quad-core Opteron Processor (2356) [2.3GHz] | 50,856 |
| 2 x Quad-core Xeon Processor X5460 [3.2GHz] | 51,840 |
| 4 x Quad-core Xeon Processor X7350 [3.0GHz] | 71,104 |
In the T2 system, the RSA, MD5 and RC4 operations are offloaded to the on-chip cryptographic accelerators, whereas the Opteron and Xeon processors perform this crypto processing in software. It is apparent that a single-socket UltraSPARC T2 processor provides equivalent performance to 4-socket x64 systems containing Quad-core processors. Alternaitvely, on a per socket basis, T2 outperforms the competition by over 2.7X.
While this performance leadership is not attributable to the hardware crypto support (onchip NICs, and abundance of threads help somewhat too), the cryptographic overheads associated with HTTPS are pretty significant - RSA ops for session establishment and then RC4 and MD5 (these are the algorithms used for SPECweb2005 anyway) operations to secure and authenticate the subsequent traffic, as illustrated in the following figure:

It is therefore not surprising that providing hardware support to accelerate cryptographic processing provides a significant performance advantage to the UltraSPARC T2 processor on SPECweb05 banking
Comments (1)
Jul 23, 2008
cfreese says:
Why is there so much overhead in using the accelerators? Is there some o...Why is there so much overhead in using the accelerators? Is there some other
API that wouldn't be bound by overhead? A quick benchmark
using OpenSSL "speed" shows that the T2 (hardware) performance drops
drastically if one uses smaller than 8K blocks.
Example: RC4
DL380= HP DL380, 2x Xeon 5450, 16GB, RedHat 5.2 server
openssl speed rc4
T5120= Sun T5120, 1x T2 @ 1.167GHz, 16GB, Solaris 05/08
openssl speed rc4
openssl speed -engine pkcs11 -evp rc4
Example: aes-128-cbc
DL380= HP DL380, 2x Xeon 5450, 16GB, RedHat 5.2 server
openssl speed aes-128-cbc
T5120= Sun T5120, 1x T2 @ 1.167GHz, 16GB, Solaris 05/08
openssl speed aes-128-ebc
openssl speed -engine pkcs11 -evp aes-128-cbc