<font style="font-variant: small-caps" size=+1>warped</font>2: A Time Warp Simulation Kernel

My studies with Parallel Simulation are primarily with distributed optimistic synchronization using the Time Warp mechanism. The Time Warp mechanism structures a simulation model as a collection of concurrently executing discrete-event simulations that operate without explicit synchronization. That is, the main event processing functions of the simulation model are partitioned into Logical Processes (LP) that can execute events in parallel. The LPs process events and, when necessary, exchange timestamped event messages with other LPs. The LPs maintain their own local clock and execute events aggressively without explicit synchronization to the other LPs. Thus, should an LP receive an event with an event with a timestamp earlier than the LP's current simulation time, a rollback occurs to reprocess events in their proper causal order. My studies with parallel discrete-event simulation and the Time Warp mechanism are currently focused on execution on multi-core and many-core processors and clusters. As part of these studies we have developed an open source time warp simulation kernel called warped2. The kernel is written in C++ and is freely available. The warped2 kernel is highly configurable and provides multiple Time Warp optimizations for exploration. A few warped2 simulation models are also available in the warped-models github repository.

Optimizing warped2 for Multi-Core/Many-Core Processors

Initially, the software architecture of our parallel simulator was called warped and it contained implementations for many of the Time Warp sub-algorithms (lazy cancellation, etc). It was extensively optimized for efficient execution on clusters with single-core processors (available here). We conducted preliminary experiments with extensions for threaded execution. Unfortunately the base software architecture of that kernel was ill-suited to support threaded and process based parallelism and it was decided to re-architect and implement a completely new design solution.

The design, called warped2, integrates multi-threaded execution with process-based parallelism and is suitable for execution on a single SMP node or on a cluster with SMP nodes. Currently we are focusing on optimizations to the pending event set and the protection mechanisms used to access the shared data structures. A 2-level data structure for the pending event sets helps minimize sorting costs and enables the deployment of multiple strategies to manage synchronization and access costs. As a result of this 2-level pending event set structure, the worker (event processing) threads on a single node can generally follow the critical path of event execution to result in an optimistic execution of events with nearly zero rollback events.

The Pending Event Set

The pending event set in warped2 is deployed so that the lowest time-stamped events from every LP are placed into one or more event scheduling queues (called Least Time-Stamp First or LTSF queues) for execution. The initial solution had only one LTSF queue, but lock contention for the LTSF queue became a detriment to performance once the number of worker threads exceeded 5-7. The current design permits the instantiation of multiple LTSF queues and assigns unique subsets of the LPs to each LTSF queue. The system then binds a collection of one or more "worker" threads to a specific LTSF queue that process events only from that queue. This organization alleviate contention for the LTSF queue and provides a highly efficient schedule of event execution that will closely follow the critical path of event execution. As a result, when executing on a single multi-core SMP platform, the warped2 kernel will experience few if any rollback events.

The LTSF Queue: Implementation Data Structures

We are also examining the data structure used to implement the LTSF queue. Initially warped2 contain configurations to use either an STL multi-set or splay tree for the LTSF queue. The system has recently been extended to also support the use of a ladder queue (ladder queues are a variant of calendar queues that have self-balancing capabilities built into the sizing of the calendar months). Experiments show that the ladder queue data structure performs better than either of the other two data structures. While the ladder queue has the obvious advantage of reducing the number of elements (events) sorted (at any one time) for execution, we believe it also has a number of additional properties that make it attractive for further optimization. For example the bottom rung of the latter (containing the window of events containing the smallest timestamps) is normally maintained as a sorted (by timestamp) list. Given that the LTSF queue contains events from different LPs, we postulate that these events are likely to be mostly causally independent. Should this hypothesis be mostly true, it should be possible to implement a sort-free solution to access and manage the LTSF queues. This should allow us to maintains most of the performance benefits of a least time-stamp first scheduling policy without the sorting costs. Preliminary studies show this to be true and using an unsorted bottom does indeed provide speedup with little impact on rollback frequency.

The LTSF Queue: Concurrent Access and Contention

The standard implementation of the LTSF queue is with standard mutex locks (spin locks were also explored). Experiments with lock-free algorithms and transactional memory have also been performed. Unfortunately while the lock-free algorithms provided some nominal speedups, their implementation complexity is high. Likewise, some types of transactional memory also provide nominal speedups, but we now believe that better solutions lie with alternative structures/algorithms than either of these approaches. Therefore, for now I am abandoning further studies in these directions.

In general, most discrete-event simulation models exhibit fine-grained parallelism. Measurements from the DESMetics project have shown us that the average event execution time on modern processors is somewhere between 10ns and 10ms. As a result, lock contention for the pending event set can be a significant issue. In fact, our experiments using multiple LTSF queues have shown that the best general performance occurs when we have one LTSF queue per worker thread. This is acceptable whenever the LP workload is stable and partitioning can adequately balance the LP workload across the LTSF queues. Fortunately, for many of the simulation models we have studied, this has been true. However, I would still like to discover techniques to further reduce synchronization costs (locks are expensive). Toward this end, we have started looking at redesigning the LTSF queue so that it holds bags of events rather than a singleton event in each bucket of rungs. Ideally then, each worker thread will dequeue a bag containing multiple events for execution. While this might invite higher rollback frequencies, our previous observations on the relative causal independence of events in the bottom rung suggest that this approach may have merit.