ECE 975: PDES, Spring 2011 Session

Overview

I run my seminar sessions pretty wide open to my and the student interests. Ideally I like to see the students performing independent reading/studying to help find papers and select topics under the general theme of the term's study. There are no hard prereqs for this class except that you should be skilled in the general computing field and ready for independent work/learning. If you took this class last year and would like to take it again, we can arrange for you to register for independent study to get suitable credit and appease the university rule-makers.

This term is even more wide open than my usual offerings. In this case, I know I want to think unconventionally about parallelism, but I am not entirely certain where I expect to be at the end of this quarter's studies or what particular papers are best suited to help me get there. I am really just hoping to get us all trying to develop background and perspective that will help us to attack the problem of harnessing parallelism more effectively. I will definitely need your help to get there. Please prepare to come to this class with suggestions. Help me!! My mind is completely open for this class offering.

Theme for the Term

In this session, I am planning to look at parallel and distributed systems with a special emphasis on non-traditional parallelization techniques. Last fall while preparing for a public lecture that I gave to the local professional section of the IEEE, I had time to reflect on where we are going with parallel and distributed computing. I decided that, in addition to my conventional work with parallelism, I needed to break with my (and many others) conventional approaches to parallelism (take a sequential algorithm, or programming solution, and parallelize it). This class is part of my quest to broaden my thoughts. Fortunately there are some interesting works that I think might help me/us in this quest. I hope you will join me in this quest.

Grading

The class we be organized as readings and discussions; no projects, homeworks, or exams will be assigned. My expectation is that students will read and explore this problem space on their own and bring interesting papers to the class for review and discussion.

Possible Topics/Readings

These pages will change throughout the class as we decide which papers to study. Check these pages regularly.

Introduction: A Call to the Arms of Parallelism

Processor clock speeds have hit the wall. Conventional air cooled x86 processors topped out somewhere between 3-4GHz and processor designers have focused on riding the wave of multi-core solutions.
Berkeley publishes their parallelism manifesto The Landscape of Parallel Computing Research: A View From Berkeley.
And the two main subsequent publications pushing research into parallelism:
- A. Ghuloum, "Face the Inevitable, Embrace Parallelism," Communications of the ACM, 52, 9, 36-38, Sept 2009.
- K. Asanovic, R. Bodik, J.Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick, "A View of the Parallel Computing Landscape," Communications of the ACM, 52, 10, 56-67, Oct 2009.
Hardware: Over the years, parallel hardware remains mostly unchanged. The basic categories are SIMD (now provided and named the ever popular GPGPU), tightly coupled MIMD (shared memory multi-core/many-core), and loosely coupled MIMD (Beowulf clusters, IBM Blue Gene machines). While these architectures are classic, the computation vs communication cost models have changed substantially in recent years. More on this later. We have also seen studies of massive parallelism (more than 1K processors/processing elements). In this class we will focus on affordable parallelism. Thus, for the most part, I am concerned with GPGPU and smaller scale MIMD style processing (8-64 single-chip and 16-128 node [possibly multi-core/node] clusters).
Intel just announces a new 10 core (hyperthreading so 20 threads) Xenon product.
Applications: My focus is definitely not generally on embarrassingly parallel applications. That said, I can see some benefit to considering unique approaches that might apply to embarrassingly parallel application. In particular, the technique with cloning described below might allow us to remove much of the redundant work from an embarrassingly parallel application (simulation) and lead to substantial speedups.

Background and Complicating Factors

Amdahl's law is a harsh mistress:
- G. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," AFIPS Conference Proceedings (30): 483-485, 1967.
- Wikipedia's page. Be sure to examine the speedup graph and think about it. Also take a hard look at some of the external links.
- Here is a zip archive of the gnuplot scripts that i used to show the Amdhal's Law related curves in class.
It's important that you carefully read these papers with some skepticism. Everybody is selling their snake oil. As you read any article, you should ask, what is the author holding behind his back. Often nothing, but sometimes: D. H. Bailey, "Misleading Performance in the Supercomputing Field," Proceedings of 1992 ACM/IEEE Conference on Supercomputing 155-158, 1992.
Inflammatory, but possibly interesting: P. H. Madden, "Parallel Computing: The Elephant in the Room, (working draft), 2010.
Message latencies are much larger than they appear....as you consider the next two bullet items, think about the fact that the average processor has a clock rate above 2GHz. If you assume a 2GHz clock rate (this is obviously low) and a CPI (cycles per instruction) of 2 (this is probably high), then you will see an average of one instruction executed every 1 nano-second (assuming a instruction mix strongly dominated by fast integer operations). As you look at the message latency numbers, think about the number of machine cycles that would be completed in the time it takes the message to be sent.
S. Larsen, P. Sarangam, R. Huggahalli, "Architectural Breakdown of End-to-End Latency in a TCP/IP Network," Int Symp on Computer Architecture and High Performance Computing, 195-202, 2009.
Here are some openoffice/libreoffice spreadsheets/drawings that i shared with you in the class. The most interesting (to me) is the spreadsheet labeled messageLatencies.ods. The data in there compares two different NIC cards, a cheap Realtek RTL-9169 and an expensive Intel 82541PI (e1000). The tabs at the bottom are labeled according to the charts that I was putting together. The measurement unit on the y-axis is micro-seconds and the unit on the x-axis is message length.
The cost of locks and atomic instructions....not as small as you might think....
The following table shows the measured runtime cost of the atomic instruction fetch_and_add(index,1) vs the measured cost of a non-atomic integer add (given in how many times slower):

Processor Cost

Intel i7 920 19x

AMD Phenom 24x

Intel Xenon E5410 39x

Processor	Cost
Intel i7 920	19x
AMD Phenom	24x
Intel Xenon E5410	39x

The following table shows the cost of several pthreads synchronization primitives:

pthread function Total Number of Instructions executed Number of atomic instructions executed Weighted Number of Instructions

(data collected with callgrind) (extracted by examining generated code) Intel i7 920 AMD Phenom Intel Xenon E5410

pthread_mutex_lock 38 5 133 158 233

pthread_mutex_unlock 6 1 25 30 45

pthread_mutex_init 70 0 70 70 70

pthread_rwlock_wrlock 11 4 87 107 167

pthread_rwlock_rdlock 22 4 98 118 178

pthread_rwlock_unlock 24 3 81 96 141

pthread_spin_init 4 0 4 4 4

pthread_spin_lock 8 1 27 32 47

pthread_spin_unlock 3 0 3 3 3

Really thinking about it, despair sets in, and then you get determined to fix: "How do I change our method of attack?"....and now you know the basis for my trying to start this class.

pthread function	Total Number of Instructions executed	Number of atomic instructions executed	Weighted Number of Instructions
	(data collected with callgrind)	(extracted by examining generated code)	Intel i7 920	AMD Phenom	Intel Xenon E5410
pthread_mutex_lock	38	5	133	158	233
pthread_mutex_unlock	6	1	25	30	45
pthread_mutex_init	70	0	70	70	70
pthread_rwlock_wrlock	11	4	87	107	167
pthread_rwlock_rdlock	22	4	98	118	178
pthread_rwlock_unlock	24	3	81	96	141
pthread_spin_init	4	0	4	4	4
pthread_spin_lock	8	1	27	32	47
pthread_spin_unlock	3	0	3	3	3

Parallel & Distributed Simulation Overview:

This first paper is a fundamental manuscript formalizing order and dependencies in parallel computation. This paper is deceptively complex. Generally you can read it multiple times and learn more each time. I believe that this is a good place for us to begin our studies.
L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of ACM, 21, 7, 558-585, July 1978.
A soup to nuts overview of parallel simulation. Mostly focused on distributed synchronization and that suits our needs.
R. Fujimoto, "Parallel Discrete Event Simulation," Communications of the ACM, 33, 10, 30-53, October 1990.

So Lamport does a good job of defining happens before and then "casually dependent". At first this appears to define the correct constraints that we should follow to implement our distributed programming models from. However, what actually happens is that we tend to impose "artificial/false" happens before constraints on the problem (based on our sequential model of synthesizing programming solutions).

Optimistic synchronization is one mechanism that begins to weaken this overly strict viewpoint of causality. There are two main methods proposed for optimistic synchronization: (i) the Time warp mechanism, and (ii) Space-Time.

The time warp mechanism uses a virtual time framework to record the local simulation time in the distributed simulation. It was first introduced by D. Jefferson, "Virtual Time," ACM Transactions on Programming Languages and Systems (TOPLAS), Vol 7, Issue 3, 404-425, July 1985.

Weakening our perspective of causal order even further is the concept of lazy reevaluation. Lazy reevaluation was originally developed and presented by Darrin West as part of his master's thesis in 1988. The interesting point of lazy reevaluation is that it allows one to violate causal order, jump back and quickly repair the damage, and then jump back to the head of the premature computation. Thus, in a sense, we are able to further relax the happens before relation. However, it is remarkably difficult to know when you can jump back and thus most people do not follow up on this idea.

Oddly enough I cannot find an online copy of Mr. West's thesis, but I have a copy of the source latex. Unfortunately I could not process the old formatting, so this is reformatted by me to approximate his thesis document.

For the limited application of digital logic simulation, we discovered an efficient way to deploy lazy reevaluation. Our work is described in this paper: A. C. Palaniswamy, S. Aji, and P. A. Wilsey, "An Efficient Implementation of Lazy Reevaluation," Proc. of the 25th Annual Simulation Symposium, 140-146, April 1992.

Space-time simulation is not widely studied or explored. I believe that this mostly has to do with our inability to fully understand how we might go about deploying a space-time simulation (although I am frequently wrong in these matters). In any event, here are some citations on Space-time simulation:

K. M. Chandy and R. Sherman, "Space-Time and Simulation," Distributed Simulation, 53-57, 1989.
I have not yet read this next paper, but it says something about implementing global termination in a space-time simulation and I have seen so little discussion of space-time that I'm putting it in here (for now). M. Abrams and D. Richardson, "Implementing A Global Termination Condition and Collecting Output Measures in Parallel Simulation," TR 90-54, November 1990.

Returning back to Time Warp, Maria Hybinette has developed

M. Hybinette and R M. Fujimoto, "Cloning Parallel Simulations," ACM Transactions on Modeling and Computer Simulation, Vol 11, No 4, 378-407, October 2001.
M. Hybinette and R M. Fujimoto, "Scalability of Parallel Simulation Cloning," Proceedings of the IEEE 35th Annual Simulation Symposium, April 2002.
M. Hybinette, "Just-In-Time Cloning," 18th Workshop on Parallel and Distributed Simulation (PADS-2004)," May 2004
A. Agarwal and M. Hybinette, "Merging Parallel Simulation Programs," 19th Workshop on Parallel and Distributed Simulation (PADS-2005)," May 2005

Transparent Parallelism

Dynamic memory:
E. C. Herrmann and P. A. Wilsey, "Threaded Dynamic Memory Management in Many-Core Processors," Proceedings of the 2010 International Workshop on Multi-Core Computing Systems (MuCoCoS 2010), February 2010.
Here are some raw data plots from E. Herrmann's M.S. thesis.
D. Tiwari, S. Lee, J. Tuck, and Y. Solihin, "MMT: Exploiting Fine-Grained Parallelism in Dynamic Memory Management," International Parallel and Distributed Processing Symposium (IPDPS 10), April 2010.
Function pre-computation:
E. C. Herrmann and P. A. Wilsey, "Pre-computing Function Results in Multi-Core and Many-Core Processors," Parallel Software Tools and Tool Infrastructures Workshop, September 2011 (submitted).

Paper from Lee, I still haven't read/classified it

A. J. Wijs and B. Lisser, "Distributed Extended Beam Search for Quantitative Model Checking," Model Checking and Artificial Intelligence, Edelkamp, Stefan and Lomuscio, Alessio (editors), Springer-Verlag, 166-184, 2007.

Programming Models

Dataflow is a graph based programming model that was very big in the late 70's and early 80's.

A popular (text based) dataflow programming language is SISAL (see citation below). However, many graphical languages were also designed. We see graphical programming languages fairly regularly today but mostly in tightly constrained application environments. LabView is a good example of this. Here are some papers on dataflow programming languages:

J. T. Feo, D. C. Cann, and R. R. Oldehoeft, "A Report on the SISAL Language Project," Journal of Parallel and Distributed Computing, vol 10, no 4, 349-366, December 1990.
P. G. Whiting and R. S. V. Pascoe, "A History of Data-Flow Languages." IEEE Annals of the History of Computing, vol 16, no 4, 38-59, December 1994.
W. M. Johnston, J. R. P. Hanna, and R, J. Millar, "Advances in Dataflow Programming Languages, ACM Computing Surveys, vol 36, no 1 1-24, March 2004.
Several hardware platforms to support dataflow execution were developed. Here are some papers on work to implement dataflow processing hardware:
A. H. Veen, "Dataflow Machine Architecture," ACM Computing Surveys, vol 18, no 4, 365-396. December 1986.
J. R Gurd, C. C Kirkham, and I. Watson, "The Manchester Prototype Dataflow Computer," Communications of the ACM, vol 28, no 1, 34-52, January 1985.
A. R. Hurson and K. M. Kavi, "Dataflow Computers: Their History and Future," Wiley Encyclopedia of Computer Science and Engineering, B. Wah (ed), 2008

Systolic Arrays: I will let you dig into this area on your own.

Prolog (prepared by Patrick Putnam, edited by Wilsey)

My reason for introducting this language is not to have you think about prolog, but rather I encourage you to think more abstractly about the processing model of prolog. Go beyond the simple functional abstraction, think instead of the idea that the rules read "the left term is true when the right term is true". Not classic programming, but assertions of relational truths. I share prolog with you in hopes that it will help you to think a bit more abstractly about computation, it certainly hit me when I first encountered prolog (many years ago). Other functional languages may hit you this way as well, I choose prolog only for the reason that it was where I first understood this. --paw

Prolog falls into the paradigm of logic programming languages. So, what are logic programming languages? Logic programming languages rely on expressing computations in the form of mathematical logic formulas. What sets Prolog apart from the rest is that it is a declarative programming language. Computation is expressed as first-order logic formulas which express rules or facts. For example, a statement (fact) like "Tom is a cat." can be expressed in the following syntax: cat(tom) :- true.

Typically, the idea of a Prolog program is to establish all of the facts and rules of the system, load them into a knowledge base, then query the system to determine if your query is true (a logical conclusion from the knowledge base), or false. Basically, the goal of a query is to determine whether there is a satisfying assignment of set of variables such that the query evaluates to true. So, assume we had put "cat(tom) :- true." into out knowledge base, by querying 'cat(X).' we can answer the question of: is there an assignment for X such that 'cat(X)' is a true statement.

A note on the Prolog syntax: variables must always start with capital letters or underscore. Lists are represented with square brackets around them. Square brackets with nothing inside represent an empty list. A list can be separated into head and tail portions by using a pipe '|'. The head is as many elements as are comma separated (though it is strongly encouraged to use a few comma separated elements as possible), and the tail is the rest of the list. So [X, Y | Xs] will separate the first two element of a list into X and Y respectively, and the remainder of the list will be in Xs.

I have attached two sorting algorithms implementations (quicksort and mergesort). I found the quicksort implementation on the web, and the mergesort was something I wrote. Both of these algorithms expect lists of arithmetically comparable elements (numbers, characters,...) as the two parameters. The functions follow the typical execution patterns for the algorithms. Quicksort uses the head of the list as the pivot point. After loading the algorithms into the knowledge base we can query to determine a) whether a list is the sorted form of another (ie. mergesort([1,3,4,2], [1,2,3,4]).), b) what is the sorted form of a list (ie. mergesort([1,3,4,2], X).)

A further example of an implementation of a Prolog program can be found in:

Martin Erwig, "Escape from Zurg: An Exercise in Logic Programming," Journal of Functional Programming, Vol. 14, No. 3, 253-261, 2004. This paper is a description of how one professor went about teaching logic programming to students who understood functional programming. It is primarily an implementation level comparison of functional programming (Haskell) and logic programming (Prolog), trying to show that a problem (state space search) which is typically thought of being easier to solve (express) in logic programming is just as easy in functional programming.

The use of Prolog has for the most part been in the areas of research, specifically AI and Natural Language Processing research. Research into parallel Prolog has been limited. The ACE Research group at New Mexico State University implemented a system which took advantage of the inherit parallelism available in Prolog

Throughout my Googling I kept finding references to the functional programming paradigm and languages like Haskell. It seems that there has more attempts to express parallelism using this paradigm rather than logic programming. I think this stems from the direct relationship to lambda calculus (buzzword bingo) and concepts like lazy evaluation (delaying computation until it is absolutely necessary). This might be another avenue to look down in the last weeks of the quarter.

Where are we? Discussions of Software Environments and Languages

So in the last two weeks of this class I would like to wrap up with a review of the major parallel: (i) software tools, programming languages, and programming techniques, and (ii) hardware/architecture platforms that are currently available for you to use as you move forward in your studies. While these will enable your studies, I hope that you will take advantage of our discussions about non-traditional techniques to discover new and innovative ways to apply parallelism to computation. I hope you enjoy your explorations and if I can ever be of further assistance to you, please do not hesitate to contact me.

Software Environments and Languages

Interesting set of slides talking about parallelism, map reduce, MPI, and threads. Some of this may be beneath you, but there is almost certainly useful information to all of you in these slides. This is probably a nice starting point.
The POSIX Threads (or pthreads) API is the back-bone of multi-core parallel programming. Ubiquitous and easy to use; you should almost certainly understand something about pthreads if you are in parallel computing.
Lock-free and wait-free algorithms. Wikipedia lumps this all under the topic of Non-Blocking Algorithms. Basically the goal is to create shared data structures with lightweight access mechanisms. The locking calls of pthreads can be expensive (hundreds of machine cycles) and there is a temptation to lock more of the data structure than absolutely necessary. In contrast, atomic operations will require only 20-40 machine cycles. However, the proper construction of lock-free data structures can prove to be quite difficult.
For distributed computation, there are two main competing message passing APIs, namely: Message Passing Interface (MPI), and
Parallel Virtual Machine (PVM, nothing to do with virtualization).
Beyond using pthreads/MPI/PVM, contemporary programming languages with actual parallelism built into them are quite limited. Several use versions of threads (Java, Ada). Probably one of the more interesting parallel languages in actual use is Erlang which uses the actor model of concurrency. However, I am not sure that the Erlang model is well suited to generalized parallel computing (and i do not apper to be alone in this perspective). Another active program is the Clik programming language. Clik is basically ANSI C with a few keyword thrown in for exposing parallelism.
For GPGPU programming we have CUDA developed by Nvidia for generalized programming of the Nvidia graphics cards. Going one step beyond CUDA, the OpenCL (Open Computing Language) framework is attempting to develop a programming framework for building software solutions that make effective use of heterogeneous hardware platforms. However, OpenCL really just targets heterogeneity between CPU and GPUs.
I've not got much optimism to report with regards to operating systems research over the past 20 years. Much like x86's dominance of CPU architecture, Unix/Windows dominate O/S service. There is some hope, we now see accelerating developments in this area (possibly assisted with the widespread availability of virtualization). Interesting developments are happening in lightweight hypervisors, micro-kernels (L4), uber-lightweight O/Ses (Kitten, IBM's BlueGene O/S). More recently work is also expanding to heterogeneous platforms (Barrelfish, although I question if we will actually see the widespread use of a truely heterogeneous processing platforms). Potentially very interesting in this space is what emerges from Intel's studies with many-core processors. I would also watch Sandia's work with Palacios and Kitten. (if you are interested you might pop over to my Spring 2010 offering of this class for some links to papers).
Design patterns for parallelism. Not sure I think this will be useful or that we are really ready for this step, but it is nonetheless being pursued.
Debugging: Basically this is still a black art. Eclipse Parallel Tools Platform (more than just debugging). The GNU Project Debugger (gdb) allows one to attach to the threads in a parallel or distributed program. Finally, the LLVM group is building a new debugger from scratch to ' to support modern, multi-threaded applications.'
Performance Analysis: Likewise, it is mostly a black art.
Important tools: for parallel programming, debugging and analysis exist from Intel and IBM. If I have time later this summer, I'll look them up and put pointers here.

Hardware Systems

We've been all over this space throughout the quarter. Today for most of the world this space primarily breaks down to the x86 mult-core processors (and Beowulf clusters) and the AMD/Nvidia GPGPUs. For those with access to the hardware there are a few more exotic systems out there. One of the most interesting options is the IBM Blue Gene machines (currently scaling up to 64K processors), but those just aren't available to regular people.

Wikipedia article on multi-core contains comments on many-core and links to vendors.
Here are a few survey articles:
- A. Sodan, J. Machina, A. Deshmeh, K. Macnaughton, B. Esbaugh, "Parallelism via Multithreaded and Multicore CPUs," IEEE Computer, vol 43, no 3, 24-32, March 2010.
- G. Blake, R. G. Dreslinski, and T. Mudge, "A survey of multicore processors," IEEE Signal Processing Magazine, vol.26, no.6, 26-37, November 2009.
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computataions on Graphics Hardware, Computer Graphics Forum, 26(1):80-113, March 2007.
Sun (no Oracle) has an interesting lead in this space with it's UltraSparc T1, T2, and T3 products.
An x86 Many-Core Processor, the Intel SCC: Slides used by Karthik. You might also want to look at last year's course webpages for additional links on the SCC platform.
UnConventional Multi-core: Tilera, picoChip,
GPGPU: Nvidia's Tesla hardware. Beyond this, Cray is separately marketing a Tesla based solution.
Multi-core Beowulf (no links here, but doesn't seem necessary).
Really Big Iron: IBM's Blue Gene Machines, you will have to dig to find this paper (I have a copy, but IBM has asked me to not post it).
- A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, P. Vranas, "Overview of the Blue Gene/L System Architecture," IBM J. Research and Dev, Vol 49, No 2/3, 195-212, March/May 2005.
IBM's Cell processor
Parallelax ? Joe Rocklin inputs
IBM's new Power multi-core

Hot off the Preses

S. H. Fuller and L. I. Millett (eds), The Future of Computing Performance: Game Over or Next Level?, National Academy Press, 2011.