LDoms Community CookbookContentsIn this Section ... |
Section IntroductionThis section provides details on the hardware platforms that can run Logical Domains and their capabilities. Additionally it provides details on platform specific LDoms tasks such as Split-PCI procedures and other advanced approaches to make the most of the systems' hardware. LDoms Capable Systems DetailsSystems OverviewIn this section we will examine topologies of the systems under examination in this document, and their affect on various aspects of Logical Domains functionality. We will pay particular attention to the bus architectures and the implications for partitioning physical devices with LDoms. As systems may be available in various combinations and quantities of componentry, we will not be able to cover in detail all permutations. We will instead cover the most common configurations of each system. Sun Logical Domains is designed to run on the sun4v-based systems, often referred to as "Chip Multi-Threading" or CMT systems. These systems have the UltraSPARC "T series" chips (code named "Niagara") which contain multiple cores, with many threads-per-core. Each of the various generations of systems and processors differ in configuration and hence capabilities in terms of Logical Domains. The following is a listing of some of the currently available systems and their maximum LDoms capabilities (not an exhaustive list):
(Maximum number of domains may be dependent upon LDoms version) UltraSPARC-T1 SystemsSun SPARC Enterprise T1000OverviewSun Fire/SPARC Enterprise T1000 (codenamed as Erie) is the entry-level UltraSPARC-T1 (Niagara) processor based server, which is optimized for heavily threaded workloads including, but not limited to web servers, service oriented architectures, application development and portals. Sun Fire T1000 also offers extensive RAS features and unique power efficiency. The Sun Fire T1000 server was renamed to Sun SPARC Enterprise T1000. Sun SPARC Enterprise T2000OverviewThe Sun SPARC Enterprise T2000 Server (also known and marketed as Sun Fire T2000 Server) is the first UltraSPARC T1 (known as Niagara) based system which is optimized for multi-threaded workloads combined with maximum power efficiency. The Marketing Name for this family of systems is Sun Fire CoolThreads? Server. The system has two PCI buses hence you can create a maximum of two I/O domains and the device layout is subtly different on the different versions of the Sun Fire T2000. Older Sun Fire T2000 PCI buses(Contributed by Alex Chartre)The buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b). On the older version of the Sun Fire T2000 the internal disks and DVD-ROM are connected to a disk controller card in one of the PCI-X slots, so they are on bus pci@7c0 (bus_b). The following describes the device assignments:
As you can see, both buses have two network interfaces, but other resources are not so evenly spread: pci@7c0 (bus_b) has all the internal disks, the DVD-ROM and 4 PCI slots while pci@780 (bus_a) has only one PCI slot. So there is no problem to create an I/O domain with bus pci@7c0 (bus_b) because you can have all the basic hardware resources you need (i.e. a disk and a network interface). But when using bus pci@780 (bus_a), you only get some network interfaces but no disk. Hence to create an I/O domain with pci@780 (bus_a) you will have to add a PCI-E card (either a Fiber Channel or a SCSI host adapter) in the PCI-E slot 0 to get access to some storage devices. You also have to ensure that the card you are adding can be used to boot the system. Older Sun Fire T2000 Split-PCI setupIn order to have more than one I/O domain you need to remove one of the busses in the system away from the control domain. This is often described as "Split-PCI".
Configuration of the primary domain
Check that disks are on bus pci@7c0 (bus_b)
You have to ensure that the system disk is on bus pci@7c0 (bus_b) and that any disk on bus pci@780 (bus_a) is not being used by the primary domain. Check that network interfaces are on bus pci@7c0 (bus_b)
You have to ensure that the network interfaces you are using (especially the primary network interface) are on bus pci@7c0 (bus_b). If your primary network interface (for example e1000g0) is not on bus pci@7c0 (bus_b) then you will have to reconfigure your system so that it uses another interface (for example e1000g2) which has to be on bus pci@7c0 (bus_b). If you have to change the network interface, don't forget to correctly reconnect the network cables (for example move the network cable from e1000g0 to e1000g2). Remove the appropriate bus from the control domain
Configuration of the alternate I/O Domain
You may now connect to the console of the domain "alternate" to install it. The installation can be done through the network with a "boot net" like for installing a regular Sparc system. Newer Sun SPARC Enterprise T2000 PCI buses(Contributed by Alex Chartre)The buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b). On the newer versions of the Sun SPARC Enterprise T2000 systems, we now have an on-board disk controller which is on bus pci@780 (bus_a). The internal disks are connected to that on-board controller so they are now on bus pci pci@780 (bus_a), but the DVD-ROM is still on bus pci@7c0 (bus_b). So for these versions of the Sun Fire T2000, we have this layout:
So if you want to setup a split PCI configuration on a newer Sun Fire T2000 you have to add either a Fiber Channel or a SCSI host adapter in one of the PCI-E or PCI-X slots on bus pci@7c0 (bus_b) like this:
So there is no problem to create an I/O domain with bus pci@780 (bus_a) because you can have all the basic hardware resources you need (i.e. a disk and a network interface). But when using bus pci@7c0 (bus_b), you only get some network interfaces but no disk. Hence to create an I/O domain with pci@780 (bus_a) you will have to add a PCI-E card (either a Fiber Channel or a SCSI host adapter) in the PCI-E slot 0 to get access to some storage devices. You also have to ensure that the card you are adding can be used to boot the system. Newer Sun SPARC Enterprise T2000 Split-PCI setupIn order to have more than one I/O domain you need to remove one of the busses in the system away from the control domain. This is often described as "Split-PCI".
Configuration of the primary domain
Check that disks are on bus pci@7c0 (bus_b)
You have to ensure that the system disk is on bus pci@780 (bus_a) and that any disk on bus pci@780 (bus_a) is not being used by the primary domain. Check that network interfaces are on bus pci@780 (bus_a)
You have to ensure that the network interfaces you are using (especially the primary network interface) are on bus pci@7c0 (bus_b). If your primary network interface (for example e1000g0) is not on bus pci@7c0 (bus_b) then you will have to reconfigure your system so that it uses another interface (for example e1000g2) which has to be on bus pci@7c0 (bus_b). If you have to change the network interface, don't forget to correctly reconnect the network cables (for example move the network cable from e1000g0 to e1000g2).
Configuration of the alternate I/O Domain
You may now connect to the console of that domain to install it. The installation can be done through the network with a "boot net" like for installing a regular Sparc system. Sun Blade T6300 Server Module(Contributed by Sudhir Bhole)OverviewSun Blade T6300 Server Module is the first SPARC blade (based on UltraSPARC T1) for the Contellation family of blade products. The system has two PCI buses hence you can create a maximum of two I/O domains. PCI BusesThe buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b). The following describes the device assignments:
As you can see, bus_a has two network interfaces but no disks, while bus_b has four disks, but no network interfaces. Hence to create an I/O domain with bus_a, you will have to add a PCI Express Module (Fibre Channel or a SCSI host adapter) in PCI-EM Slot 0. You also have to ensure that the card you are adding can be used to boot the system. Likewise to create an I/O domain with bus_b, you will have to add another PCI Express Module (with network ports) in PCI-EM Slot 1. UltraSPARC T2 systemsSun SPARC Enterprise T5120/T5220OverviewThe Sun SPARC Enterprise T5120 and T5220 are the first systems based on the UltraSPARC T2 Chip. The UltraSPARC T2 chip doubles the threads from 32 to 64 as compared to the UltraSPARC T1 chip. It also eliminates the Floating Point limitation in the UltraSPARC T1 chip by having a FP unit per core on the UltraSPARC T2 chip. Bus TopologyThe T5120 and T5220 systems have a single PCI bus connected disk, USB, terminal and 4 x enet. The system also contains an on-chip 10Gb networking function (see below). You can see the IO bus of the system with the following command: primary# ldm list-bindings primary Sun Blade T6320 Server Module(Contributed by Sudhir Bhole)OverviewSun Blade T6300 Server Module is a UltraSPARC T2 based blade. Bus TopologyIt has a single PCI bus connected disk, USB, terminal and two network interfaces. The system also contains an on-chip 10Gb networking function (see below). Network Interface Unit (NIU) on T5120/T5220/T6320OverviewThe Network Interface Unit (NIU) is a component of the UltraSPARC-T2 cpu which connects a pair of on-chip 10 Gb/s Ethernet MACs to the rest of the system; basically it provides two network interfaces managed by the nxge driver. Because the NIU is part of the cpu, it appears as a device independent of any PCI bus, and so it can be assigned to any domain independently of the assignment of the PCI bus. In addition, a network interface managed by the NIU can be physically shared with some other domains so that these other domains will be able to directly use the network interface without having go through a virtual switch or a service domain; this is what we call network hybrid I/O. With network hybrid I/O, the physical network interface is still owned by an I/O domain and associated to a virtual switch. However if a virtual network interface of a guest domain is connected to such a virtual switch and using the network hybrid I/O mode then that guest domain have some direct access (dedicated DMA channels, interrupts, register) to the physical network interface (owned by the I/O domain). So the guest domain will be able to directly sent or received network packets to or from the physical network interface. The guest domain domain will then only use the virtual switch in the guest domain when it initially sets up its virtual network interface or when it sends broadcast or multicast packets. The figure below illustrates the behavior of a virtual network interface using network hybrid I/O: Why Use Hybrid NIU?
System Requirements to use Hybrid NIU modeTo use network hybrid, there are some hardware and software requirements:
A single XAUI adapter (i.e. nxge interface) can be shared with a maximum of 3 guest domains. So if you have 2 XAUI adapters then you can have up to 6 guest domains using a virtual network interface in hybrid mode. Hybrid NIU procedure
Note that the "mode=hybrid" is just an hint for the system so that it should try to use the hybrid mode. If the system can not use the hybrid, for example because the interface does not support this mode or because the maximum number of domains using hybrid mode has already been reached, then the system will configure the virtual network interface with the traditional virtual I/O mode. UltraSPARC T2-Plus SystemsSun SPARC Enterprise T5140/T5240OverviewThe Sun SPARC Enterprise T5140 server begin the next wave of high-efficiency systems based on the second generation of Sun's Chip Multi-threaded Technology (CMT). It follows the previously announced Sun SPARC Enterprise T5120 and T5220 servers, with which it shares numerous features. The Sun SPARC Enterprise T5140 server utilizes multiple UltraSPARC T2 Plus processors. As the system has multiple chips, it uses cache coherency protocols via connections between the chips to manage the sharing of data. Bus TopologyThe T5140 has two PCI-E busses - one for each of the two processors of the system - and can be split via Logical Domains functions, providing two discrete I/O systems able to be dedicated to a domain. Sun SPARC Enterprise T5440T5440 OverviewT5440 is the largest member of the UltraSPARC T2+ processor based SMP T5x40 product family. The server is optimized for heavily threaded workloads including, but not limited to web servers, service oriented architectures, application servers, portals and databases. The T5440 is housed in the 4RU brushed aluminum chasses also used in the x86 product line. The server has 4 SAS disks (73 or 146GB 10k RPM or 73GB 15k RPM drives), offers a total of 256GB memory (update Fall 2008: 512MB total is now available using 8G Dimms) in its 64 DIMM sockets and 8 PCI-Express slots. Four of the PCI-Express slots are x8, two slots are x8 with a x16 connector and two slots are x8 and shared with the XAUI 10Gbps Ethernet interfaces. The server also has four GbEthernet? interfaces on the motherboard and offers redundant power and cooling via its four, hot-pluggable power supplies. The T5440 is expected to support 128 LDOMs. A T5440 comes with 4 UltraSPARC-T2 installed on it. Each of the UltraSPARC-T2 is directly connected to ¼th of the entire system memory with 1Gigabyte memory interleaving and owns a PCIe Root-Complex. Bus TopologyThe T5440 has 4 PLX units, allowing all processors to access devices in a single partition, and can provide facility for up to 4 partitions - 1 for each of the physical sockets. Each of the partitions have access to certain devices such as disks and networks, so combinations may be required. T5440 LDoms Architectures(By Pallab Bhattacharya)Creating DomainsWhen creating domains, IO and CPU requirement for the applications that would run in the virtualized environment should be estimated. The IO-performance of virtualized 1Gig network and virtualized disk is very close to native. But compared to native-IO, virtualized-IO consumes more CPU cycles, often in the range of 20%-50%, depending on the size and frequency of the IO. Hence, when doing resource planning for LDOMs environment, couple of points should be considered to get the best performance from the T5440 LDOMs environment. Each of these points would be discussed below in more detail:
Allocating VCPUs to a DomainThe number of VCPUs that need to be allocated to a Domain depends largely on the ability of the application to make good use of the VCPUs. Often applications fail to scale up on a large system which in itself makes it a good candidate to run in virtualized environment. However if applications are also very IO intensive, for such applications, it is worth to have them run in a IO domain instead of a proxied-IO based guest domain. In addition to the VCPUs needed by the application, care must be taken to allocate VCPUs to handle interrupts. For IO intensive applications, best performance can be obtained if each interrupt source is managed by a dedicated VCPU. Having a few spare VCPUs to manage the interrupt processing is always desirable even in a proxy-IO based guest domain. For most practical purposes, a single VCPU should be able to handle interrupts from multiple sources as long as the number of interrupts managed by a VCPU does not exceed 6K/sec. Using tools such as 'mpstat 1', the number of interrupts managed by a VCPU should be monitored. Another point that should be borne in mind is that when VCPUs are allocated to a domain, then they should be allocated at least, in multiples of 4, preferably in multiples of 8 where possible. A single UltraSPARC-T2 processor has 8 cores, each core with 8 hardware threads spread over 2 execution units. Study using micro-benchmarks show that if the strands (VCPUs) of a core are shared between 2 or more domains, then the loss in performance can vary between 5%-20% compared to domains that do not share strands. The variation depends on how CPU and memory intensive the application is. Creating I/O DomainsThere are several factors that might motivate to create IO-Domains. Applications such as
From the diagram in the previous section, it is evident that if we create 4 IO domains, two of the IO domains, configured out of PCIe-2 and PCIe-3, would need to rely on the other two IO domains for primary network and boot-disk. The other two domains, configured out of PCIe-0 and PCIe-1, would have cross-dependency on each other for primary network and boot-disk. If it becomes necessary, to avoid inter-domain dependency for primary network and boot-disk, it should be possible to put additional network cards, one for each root complex for domains created out of PCIe-0, PCIe-2 and PCIe-3, that does not have direct attached network devices and HBA with external storage one on each root complex for domains created out of PCIe-1, PCIe-2 and PCIe-3, that does not have direct attached disks. When creating only 2 IO domains, it will be better to have the first IO domain comprising of PCIe-0 and PCIe-1 and the second IO domain comprising of PCIe-2 and PCIe-3. Again, if the second IO domain need to be independent of the first IO domain for primary network and boot-disk, then attaching a network card and a HBA with external storage to the respective PCI-e root-complex (PCIe-2 and PCIe-3) is needed. As noted earlier, the number of VCPUs that need to be allocated for an IO-domain would depend on the number of interrupt sources that need to be managed by the domain in addition to the number of VCPUs that would be needed for the application. CPU Affinity Consideration in IO DomainsUsing micro-benchmarks, it has been possible to demonstrate that if the VCPUs for an IO-domain is allocated from the same chip that directly attaches to the PCIe Root-Complex for that IO-domain, then the latency improves by as much as 5%-10% than if the VCPUs are allocated randomly. For example, (from the picture on the first page) if an IO-domain is created for managing devices connected to PCIe-2, then the best performance can be obtained if the VCPUs for the IO-domain is allocated from T2. Currently LDOMs does not support affinity aware CPU allocation nor does it support specifying explicit VCPU allocation. However with little effort, affinity aware IO-domain creation is possible as shown in the next section. Creating IO Domain with Cross-Domain DependencyThe internal drives of T5440 are connected to PCIe-0. Hence it is not possible to remove the PCIe-0 from the Primary (or Control) Domain, else the Primary Domain would fail to boot and no more configurations can be done. However it is possible to remove PCIe-1, PCIe-2 and PCIe-3 from the Primary Domain. In order to create an IO-domain using PCIe-1, it has to be removed from Primary Domain. This would cause the Primary Domain to lose its primary network interface if it has been using the On-board NICs. However if there was a network card available on PCIe-0, then the primary network for Primary Domain can be switched to the ports on the network card before removing PCIe-1 from Primary Domain. If an additional network card is not available, it should still be possible to remove PCIe-1 from Primary Domain and create another IO domain (let us call it Secondary Domain) managing devices off PCIe-1. In such a case, the Primary Domain would provide the boot-disk service to the Secondary Domain and the Secondary Domain would provide the primary network service for the Primary Domain. The Pseudo-steps below outlines how this can be done. In the Primary Domain
(This would cause the Primary Domain to lose its network after reboot.) Reboot the Primary Domain and log back into the Primary Domain from Console To cause VCPUs for Secondary Domain to be allocated from T1, create a dummy domain with the rest of 56 VCPUs from T0. Bind the dummy domain.
Now the Primary Domain should have the primary network available.
The only issue with the above technique is that when the Primary Domain is rebooted, the Secondary Domain may seem to pause until the Primary Domain boots back. Similarly when the Secondary Domain is rebooted, the Primary Domain's primary network may appear to freeze until the Secondary Domain comes back online. 10 Gigabit NetworkingIf an application setup need to use 10 Gigabit interfaces, then line-rate performance can only be achieved is the application runs in an IO domain which has direct access to one or more 10 Gigabit interfaces. If the application is moderately CPU intensive, then allocating 12 VCPUs for such a domain will be sufficient in most cases. Adjustment to the number of VCPUs should be done after monitoring the CPU usage using commands such as 'mpstat 1'. Cryptographic AcceleratorsThe UltraSPARC-T2 processor is equipped with 8 Modular Arithmetic and 8 Streams Processing Units as Cryptographic Accelerators to be used by cryptographic applications. If the application running in domain need to offload cryptographic operations to the accelerators, then it is necessary to bind MAU resources to the domain. The number of MAUs bound to a domain should not be more than the numbers of cores that get assigned to the domain as part of VCPU allocation. The number of MAUs bound to a domain, can however be less than the number of cores assigned for that domain. Allocating and binding the MAUs alone would cause both the MAs and the corresponding SPs to be bound the target domain. After the domain boots, normal procedures to access the Cryptographic accelerators should be followed. Floating Point UnitsEach core on a UltraSPARC-T2 processor is equipped with a Floating Point Unit. Because Floating Point Units are not devices in traditional sense, no special allocations are done as part of LDOMs configuration. When necessary, the strands executing Floating Point instructions, gets scheduled on the FPU associated with core on which the strand is executing. This is true even if the core is shared between two different domains. T5440 Split-PCI ProcedureInitially all PCI buses are assigned to the primary domain.
Determine which busses contain devices needed by the control domain
We can see the network is located on bus pci@500
We can see that the disk is located on bus pci@400. By referring to the topology diagram shown previously, this configuration allows us to remove either of both of the following busses:
Sun Blade T6340 Server Module(Contributed by Sudhir Bhole)OverviewSun Blade T6340 Server Module is a UltraSPARC T2-Plus based blade. It utilizes multiple UltraSPARC T2 Plus processors. As the system has multiple chips, it uses cache coherency protocols via connections between the chips to manage the sharing of data. Bus TopologyThe T6340 has two PCI-E busses - one for each of the two processors of the system and can be split via Logical Domains functions, providing two discrete I/O systems able to be dedicated to a domain. |



