Clusters with big.LITTLE Processors

Beowulf Clusters are generally composed of nodes costing thousands of dollars each. These expensive systems delivery high performance and they are almost always configured with standard operating systems and networking infrastructure for shared use by a community of users. Thus, any application specific tuning or optimizations must generally be done only within the user space so that the system can be effectively shared and used by all users. While this configuration operates effectively for coarse-grained and scientific computing needs, the service delivered for finer-grained applications or applications better served by non-conventional O/S services is generally lacking. Fortunately the commercial market is now delivering inexpensive, small form factor hardware that enables the construction of low-cost Beowulf Clusters that can be tuned for specific applications. Harnessing the power in such systems can, however, be challenging.

One interesting prospect for building low-cost Beowulf clusters is the emerging ARM solutions with big.LITTLE processors. Currently development boards with these components can be purchased for as little as $100-$400 each. At these costs, the construction of a (for example) 64 node cluster would be only a few thousand dollars. Despite their low cost, however, these systems come with some significant computing power. The current odroid-xu, for example, comes in quad-quad configuration of big.LITTLE processors for just under $200 each. While the current configuration of these processors is such that only the big or LITTLE quad can be operating at a time, Samsung has announced plans to release a software patch to enable full quad-quad processing with these devices.

Experiments

I have initiated a series of experiments with these systems to customize the run time O/S services for high performance computing of fine-grained applications. In particular we are targeting parallel discrete-event driven simulation using the Time Warp mechanism.

Network Latency

One of the more significant challenges to fine-grained parallel computing on Clusters is network latency. Often these latencies (using Ethernet) can be in the 10s to 100s of microseconds even for small messages. We have already shown that reducing this latency can have dramatic impact on the run time performance of Time Warp synchronized parallel simulations. In one study with an x86 cluster, we were able to reduce latency by 60% which resulted in a 40% decrease in run time of our Time Warp applications. Key to the latency reduction was a replacement of the TCP/IP protocol network driver with an Infiniband-over-Ethernet (IBoE) driver. We also configured the IBoE driver to poll the network card instead of being triggered by hardware interrupts. While not generally usable in a shared cluster environment, its uses in a low-cost cluster environment could be possible.

In this study, we are developing a IBoE network driver for the odrid-xu platform. Our experiments will explore configuration of the driver in both polling and interrupt driven modes of execution. Furthermore, we plan to explore binding the polling driver to a dedicated LITTLE core.

Task allocation across the big.LITTLE processors

There have been several studies to develop full custom operating systems for executing Time Warp based parallel simulation. While we will avoid the design and development of a full custom O/S, we do anticipate customization of some key services of the Linux and Time Warp kernels for our needs. In particular, we are currently studying customization of five main services, namely: (i) the aforementioned network services, (ii) the construction of a Time Warp specific memory management services and fossil collection, (iii) global time management and termination detection, (iv) pending event set management and scheduling, and (v) process management and the migration (load balancing) of LPs to remote nodes for processing. A critical element in these studies is the allocation of the various processing components to the big.LITTLE cores. Initially we plan to bind these services to the LITTLE cores for execution, leaving the big cores assigned to event processing. However, explorations of other deployments will also be investigated.