I run my seminar sessions pretty wide open to my and the student interests.
Ideally I like to see the students performing independent reading/studying to
help find papers and select topics under the general theme of the term's study.
There are no hard prereqs for this class except that you should be skilled in
the general computing field and ready for independent work/learning. If you
took this class last year and would like to take it again, we can arrange for
you to register for independent study to get suitable credit and appease the
university rule-makers.
This term is even more wide open than my usual offerings. In this case, I know
I want to think unconventionally about parallelism, but I am not entirely
certain where I expect to be at the end of this quarter's studies or what
particular papers are best suited to help me get there. I am really just hoping
to get us all trying to develop background and perspective that will help us to
attack the problem of harnessing parallelism more effectively. I will
definitely need your help to get there. Please prepare to come to this class
with suggestions. Help me!! My mind is completely open for this class
offering.
Theme for the Term
In this session, I am planning to look at parallel and distributed systems with
a special emphasis on non-traditional parallelization techniques.
Last fall while preparing for a public lecture that I gave to the local
professional section of the IEEE, I had time to reflect on where we are going
with parallel and distributed computing. I decided that, in addition to my
conventional work with parallelism, I needed to break with my (and many others)
conventional approaches to parallelism (take a sequential algorithm, or
programming solution, and parallelize it). This class is part of my quest to
broaden my thoughts. Fortunately there are some interesting works that I think
might help me/us in this quest. I hope you will join me in this quest.
Grading
The class we be organized as readings and discussions; no projects,
homeworks, or exams will be assigned. My expectation is that students
will read and explore this problem space on their own and bring
interesting papers to the class for review and discussion.
Possible Topics/Readings
These pages will change throughout the class as we decide
which papers to study. Check these pages regularly.
Introduction: A Call to the Arms of Parallelism
Processor clock speeds have hit the wall. Conventional air cooled x86
processors topped out somewhere between 3-4GHz and processor designers have
focused on riding the wave of multi-core solutions.
K. Asanovic, R. Bodik, J.Demmel, T. Keaveny, K. Keutzer,
J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel,
and K. Yelick, "A
View of the Parallel Computing Landscape," Communications of the
ACM, 52, 10, 56-67, Oct 2009.
Hardware: Over the years, parallel hardware remains
mostly unchanged. The basic categories are SIMD (now provided and named the
ever popular GPGPU), tightly coupled MIMD (shared memory
multi-core/many-core), and loosely coupled MIMD (Beowulf clusters, IBM Blue
Gene machines). While these architectures are classic, the computation vs
communication cost models have changed substantially in recent years. More
on this later. We have also seen studies of massive parallelism (more than
1K processors/processing elements). In this class we will focus on
affordable parallelism. Thus, for the most part, I am concerned with GPGPU
and smaller scale MIMD style processing (8-64 single-chip and 16-128 node
[possibly multi-core/node] clusters).
Applications: My focus is definitely not
generally on embarrassingly parallel applications. That said, I can see
some benefit to considering unique approaches that might apply to
embarrassingly parallel application. In particular, the technique with
cloning described below might allow us to remove much of the redundant work
from an embarrassingly parallel application (simulation) and lead to
substantial speedups.
Wikipedia's
page. Be sure to examine the speedup graph and think about it. Also
take a hard look at some of the external links.
Here is a zip archive of the gnuplot
scripts that i used to show the Amdhal's Law related curves in class.
It's important that you carefully read these papers
with some skepticism. Everybody is selling their snake oil. As you read
any article, you should ask, what is the author holding behind his back.
Often nothing, but sometimes: D. H. Bailey, "Misleading Performance
in the Supercomputing Field," Proceedings of 1992 ACM/IEEE
Conference on Supercomputing 155-158, 1992.
Message latencies are much larger than they appear....as you consider
the next two bullet items, think about the fact that the average processor
has a clock rate above 2GHz. If you assume a 2GHz clock rate (this is
obviously low) and a CPI (cycles per instruction) of 2 (this is probably
high), then you will see an average of one instruction executed every 1
nano-second (assuming a instruction mix strongly dominated by fast integer
operations). As you look at the message latency numbers, think about the
number of machine cycles that would be completed in the time it takes the
message to be sent.
Here are some openoffice/libreoffice
spreadsheets/drawings that i shared with you in the class. The most
interesting (to me) is the spreadsheet labeled messageLatencies.ods. The
data in there compares two different NIC cards, a cheap Realtek RTL-9169
and an expensive Intel 82541PI (e1000). The tabs at the bottom are
labeled according to the charts that I was putting together. The
measurement unit on the y-axis is micro-seconds and the unit on the x-axis
is message length.
The cost of locks and atomic instructions....not as small as you
might think....
The following table shows the measured runtime cost of the atomic
instruction fetch_and_add(index,1) vs the measured cost
of a non-atomic integer add (given in how many times slower):
Processor
Cost
Intel i7 920
19x
AMD Phenom
24x
Intel Xenon E5410
39x
The following table shows the cost of several pthreads synchronization
primitives:
pthread function
Total Number of Instructions
executed
Number of atomic instructions executed
Weighted
Number of Instructions
(data collected with callgrind)
(extracted by
examining generated code)
Intel i7 920
AMD
Phenom
Intel Xenon E5410
pthread_mutex_lock
38
5
133
158
233
pthread_mutex_unlock
6
1
25
30
45
pthread_mutex_init
70
0
70
70
70
pthread_rwlock_wrlock
11
4
87
107
167
pthread_rwlock_rdlock
22
4
98
118
178
pthread_rwlock_unlock
24
3
81
96
141
pthread_spin_init
4
0
4
4
4
pthread_spin_lock
8
1
27
32
47
pthread_spin_unlock
3
0
3
3
3
Really thinking about it, despair sets in, and then you get
determined to fix: "How do I change our method of attack?"....and now you
know the basis for my trying to start this class.
Parallel & Distributed Simulation Overview:
In fact, the Time Warp synchronization protocol that I have studied for 20
years is a fairly interesting starting point for the topic of this class. In
Time Warp, tasks are not strictly synchronized and processing can occur
out-of-order. In general the processing is organized as a conventional
translation of components from a sequential simulation into parallel units,
however, the synchronization is somewhat non-traditional. These papers below
are very good foundation material for us to leap into the first
non-traditional approaches to parallel simulation.
This first paper is a fundamental manuscript formalizing order and
dependencies in parallel computation. This paper is deceptively complex.
Generally you can read it multiple times and learn more each time. I
believe that this is a good place for us to begin our studies.
L. Lamport, "Time, Clocks, and the Ordering
of Events in a Distributed System," Communications of
ACM, 21, 7, 558-585, July 1978.
A soup to nuts overview of parallel simulation. Mostly focused on
distributed synchronization and that suits our needs.
R. Fujimoto, "Parallel Discrete Event
Simulation," Communications of the ACM, 33, 10, 30-53, October
1990.
So Lamport does a good job of defining happens before and then
"casually dependent". At first this appears to define the correct
constraints that we should follow to implement our distributed programming
models from. However, what actually happens is that we tend to impose
"artificial/false" happens before constraints on the problem (based
on our sequential model of synthesizing programming solutions).
Optimistic synchronization is one mechanism that begins to weaken this
overly strict viewpoint of causality. There are two main methods proposed
for optimistic synchronization: (i) the Time warp mechanism, and
(ii) Space-Time.
The time warp mechanism uses a virtual time framework to record the local
simulation time in the distributed simulation. It was first introduced by
D. Jefferson, "Virtual
Time,"ACM Transactions on Programming Languages and Systems
(TOPLAS), Vol 7, Issue 3, 404-425, July 1985.
Weakening our perspective of causal order even further is the concept of lazy reevaluation. Lazy reevaluation was
originally developed and presented by Darrin West as part of his master's
thesis in 1988. The interesting point of lazy reevaluation is that it
allows one to violate causal order, jump back and quickly repair the damage,
and then jump back to the head of the premature computation. Thus, in a
sense, we are able to further relax the happens before relation.
However, it is remarkably difficult to know when you can jump back and thus
most people do not follow up on this idea.
Oddly enough I cannot find an online copy of Mr. West's thesis, but I have a
copy of the source latex. Unfortunately I could not process the old
formatting, so this is reformatted by me to approximate his thesis document.
For the limited application of digital logic simulation, we discovered an
efficient way to deploy lazy reevaluation. Our work is described in this
paper: A. C. Palaniswamy, S. Aji, and P. A. Wilsey, "An Efficient
Implementation of Lazy Reevaluation," Proc. of the 25th Annual
Simulation Symposium, 140-146, April 1992.
Space-time simulation is not widely studied or explored. I believe that
this mostly has to do with our inability to fully understand how we might go
about deploying a space-time simulation (although I am frequently wrong in
these matters). In any event, here are some citations on Space-time
simulation:
Returning back to Time Warp, Maria Hybinette has developed
M. Hybinette and R M. Fujimoto, "Cloning Parallel
Simulations," ACM Transactions on Modeling and Computer
Simulation, Vol 11, No 4, 378-407, October 2001.
M. Hybinette, "Just-In-Time
Cloning," 18th Workshop on Parallel and Distributed Simulation
(PADS-2004)," May 2004
A. Agarwal and M. Hybinette, "Merging
Parallel Simulation Programs," 19th Workshop on Parallel
and Distributed Simulation (PADS-2005)," May 2005
Transparent Parallelism
These studies follow classic modes of parallelism, but they bring to light an
interesting approach to injecting parallelism into an application to achieve
(small) speedups. The interesting aspect of these studies is that they inject
parallelism without requiring modifications to the original application
program. These projects are somewhat related to the notion of "program
futures" (google it, an idea that came up 20-30 years ago, but never really
caught on). The third manuscript outlines the broader idea (function
pre-computation) exploited in the first two manuscripts (the specific
application of function pre-computation for dynamic memory management).
So let's take a look at some alternate programming models. Classically people
argue that functional languages will solve our parallelism problem. Doubtful.
While functional languages often have desirable properties for easily
uncovering parallelism (single, or no, assignments, stateless functions,
side-effect free operation, etc), most of them retain the chief problem of
serial organization. The statements and solution in the program are presented
from an organization derived from serial (imperative) thinking/planning. That
said, there are some interesting programming models that we should look at
more closely. Feel free to suggest others.
A popular (text based) dataflow programming language is SISAL (see
citation below). However, many graphical languages were also designed. We
see graphical programming languages fairly regularly today but mostly in
tightly constrained application environments. LabView is a good example of
this. Here are some papers on dataflow programming languages:
J. T. Feo, D. C. Cann, and R. R. Oldehoeft, "A Report on the SISAL Language Project,"
Journal of Parallel and Distributed Computing, vol 10, no 4,
349-366, December 1990.
P. G. Whiting and R. S. V. Pascoe, "A
History of Data-Flow Languages." IEEE Annals of the History of
Computing, vol 16, no 4, 38-59, December 1994.
Systolic
Arrays: I will let you dig into this area on your own.
Prolog (prepared by
Patrick Putnam, edited by Wilsey)
My reason for introducting this language is not to have you think about
prolog, but rather I encourage you to think more abstractly about the
processing model of prolog. Go beyond the simple functional abstraction,
think instead of the idea that the rules read "the left term is true when
the right term is true". Not classic programming, but assertions of
relational truths. I share prolog with you in hopes that it will help you
to think a bit more abstractly about computation, it certainly hit me when I
first encountered prolog (many years ago). Other functional languages may
hit you this way as well, I choose prolog only for the reason that it was
where I first understood this. --paw
Prolog falls into the paradigm of logic programming languages. So, what are
logic programming languages? Logic programming languages rely on expressing
computations in the form of mathematical logic formulas. What sets Prolog
apart from the rest is that it is a declarative programming
language. Computation is expressed as first-order logic formulas which
express rules or facts. For example, a statement (fact) like "Tom is a cat."
can be expressed in the following syntax: cat(tom) :- true.
Typically, the idea of a Prolog program is to establish all of the facts and
rules of the system, load them into a knowledge base, then query the system
to determine if your query is true (a logical conclusion from the knowledge
base), or false. Basically, the goal of a query is to determine whether
there is a satisfying assignment of set of variables such that the query
evaluates to true. So, assume we had put "cat(tom) :- true." into out
knowledge base, by querying 'cat(X).' we can answer the question of: is
there an assignment for X such that 'cat(X)' is a true statement.
A note on the Prolog syntax: variables must always start with capital
letters or underscore. Lists are represented with square brackets around
them. Square brackets with nothing inside represent an empty list. A list
can be separated into head and tail portions by using a pipe '|'. The head
is as many elements as are comma separated (though it is strongly encouraged
to use a few comma separated elements as possible), and the tail is the rest
of the list. So [X, Y | Xs] will separate the first two element of a list
into X and Y respectively, and the remainder of the list will be in Xs.
I have attached two sorting algorithms
implementations (quicksort and mergesort). I found the quicksort
implementation on the web, and the mergesort was something I wrote. Both of
these algorithms expect lists of arithmetically comparable elements
(numbers, characters,...) as the two parameters. The functions follow the
typical execution patterns for the algorithms. Quicksort uses the head of
the list as the pivot point. After loading the algorithms into the knowledge
base we can query to determine a) whether a list is the sorted form of
another (ie. mergesort([1,3,4,2], [1,2,3,4]).), b) what is the sorted form
of a list (ie. mergesort([1,3,4,2], X).)
A further example of an implementation of a Prolog program can be found in:
Martin Erwig, "Escape from Zurg: An Exercise in Logic
Programming," Journal of Functional Programming, Vol. 14,
No. 3, 253-261, 2004.
This paper is a description of how one professor went about teaching logic
programming to students who understood functional programming. It is
primarily an implementation level comparison of functional programming
(Haskell) and logic programming (Prolog), trying to show that a problem
(state space search) which is typically thought of being easier to solve
(express) in logic programming is just as easy in functional programming.
The use of Prolog has for the most part been in the areas of research,
specifically AI and Natural Language Processing research. Research into
parallel Prolog has been limited. The ACE Research group at New
Mexico State University implemented a system which took advantage of the
inherit parallelism available in Prolog
Throughout my Googling I kept finding references to the functional
programming paradigm and languages like Haskell. It seems that there has
more attempts to express parallelism using this paradigm rather than logic
programming. I think this stems from the direct relationship to lambda
calculus (buzzword bingo) and concepts like lazy evaluation (delaying
computation until it is absolutely necessary). This might be another avenue
to look down in the last weeks of the quarter.
Where are we? Discussions of Software Environments and Languages
So in the last two weeks of this class I would like to wrap up with a review of
the major parallel: (i) software tools, programming languages, and programming
techniques, and (ii) hardware/architecture platforms that are currently
available for you to use as you move forward in your studies. While these
will enable your studies, I hope that you will take advantage of our
discussions about non-traditional techniques to discover new and innovative
ways to apply parallelism to computation. I hope you enjoy your explorations
and if I can ever be of further assistance to you, please do not hesitate to
contact me.
The POSIX
Threads (or pthreads) API is the back-bone of multi-core parallel
programming. Ubiquitous and easy to use; you should almost certainly
understand something about pthreads if you are in parallel computing.
Lock-free and wait-free algorithms. Wikipedia lumps this all under
the topic of Non-Blocking
Algorithms. Basically the goal is to create shared data structures
with lightweight access mechanisms. The locking calls of pthreads can be
expensive (hundreds of machine cycles) and there is a temptation to lock
more of the data structure than absolutely necessary. In contrast, atomic
operations will require only 20-40 machine cycles. However, the proper
construction of lock-free data structures can prove to be quite difficult.
Beyond using pthreads/MPI/PVM, contemporary programming languages
with actual parallelism built into them are quite limited. Several use
versions of threads (Java, Ada). Probably one of the more interesting
parallel languages in actual use is Erlang which uses the
actor model of concurrency. However, I am not sure that the Erlang model
is well suited to generalized parallel computing (and i do not
apper to be alone in this perspective). Another active program is the
Clik programming language.
Clik is basically ANSI C with a few keyword thrown in for exposing
parallelism.
For GPGPU programming we have CUDA
developed by Nvidia for generalized programming of the Nvidia graphics
cards. Going one step beyond CUDA, the OpenCL (Open
Computing Language) framework is attempting to develop a programming
framework for building software solutions that make effective use of
heterogeneous hardware platforms. However, OpenCL really just targets
heterogeneity between CPU and GPUs.
I've not got much optimism to report with regards to operating
systems research over the past 20 years. Much like x86's dominance of CPU
architecture, Unix/Windows dominate O/S service. There is some hope, we
now see accelerating developments in this area (possibly assisted with the
widespread availability of virtualization). Interesting developments are
happening in lightweight hypervisors, micro-kernels (L4), uber-lightweight
O/Ses (Kitten, IBM's BlueGene O/S). More recently work is also expanding
to heterogeneous platforms (Barrelfish, although I question if
we will actually see the widespread use of a truely heterogeneous
processing platforms). Potentially very interesting in this space is what
emerges from Intel's studies with many-core processors. I would also
watch Sandia's work with Palacios and Kitten. (if you are interested you
might pop over to my Spring 2010 offering of this class for some links to
papers).
Design patterns for parallelism. Not sure I think this will be
useful or that we are really ready for this step, but it is nonetheless
being pursued.
Performance Analysis: Likewise, it is mostly a black art.
Important tools: for parallel programming, debugging
and analysis exist from Intel and IBM. If I have time later this summer,
I'll look them up and put pointers here.
Hardware Systems
We've been all over this space throughout the quarter. Today for most of
the world this space primarily breaks down to the x86 mult-core processors
(and Beowulf clusters) and the AMD/Nvidia GPGPUs. For those with access to
the hardware there are a few more exotic systems out there. One of the most
interesting options is the IBM Blue Gene machines (currently scaling up to
64K processors), but those just aren't available to regular people.
Sun (no Oracle) has an interesting lead in this space with it's
UltraSparc T1, T2, and T3 products.
An x86 Many-Core Processor, the Intel SCC: Slides used by Karthik. You might also want to
look at last year's course webpages for additional links on the SCC platform.
Multi-core Beowulf (no links here, but doesn't seem necessary).
Really Big Iron: IBM's Blue Gene Machines, you
will have to dig to find this paper (I have a copy, but IBM has asked me
to not post it).
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus,
M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke,
G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow,
T. Takken, P. Vranas, "Overview of the Blue Gene/L System
Architecture," IBM J. Research and Dev, Vol 49, No 2/3, 195-212,
March/May 2005.