Programming Languages & Software Engineering

Programming Languages & Software Engineering

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by Dan Grossman LICENSE: This file is licensed under a Creative Commons Attribution 3.0 Unported License; see http://creativecommons.org/licenses/by/3.0/. The materials were developed by Steve Wolfman, Alan Hu, and Dan Grossman. Learning Goals Define workthe time it would take one processor to complete a parallelizable computation; spanthe time it would take an infinite number of processors to complete the same computation; and Amdahl's Lawwhich relates the speedup in a program to the proportion of the program that is parallelizable. Use work, span, and Amdahl's Law to analyse the speedup available for a particular approach to parallelizing a computation. Judge appropriate contexts for and apply the parallel map, parallel reduce, and parallel prefix computation patterns. Sophomoric Parallelism and Concurrency, Lecture 2 2

Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of small tasks is best Combines results in parallel Some C++11 and OpenMP specifics More pragmatics (e.g., installation) in separate notes Now: More examples of simple parallel programs Other data structures that support parallelism (or not) Asymptotic analysis for fork-join parallelism Amdahls Law Sophomoric Parallelism and Concurrency, Lecture 2 3 Exponential speed-up using Divide-and-Conquer Counting matches (lecture) and summing (reading) went from O(n) sequential to O(log n) parallel (assuming lots of processors!) An exponential speed-up in theory (in what sense?)

+ + + + + + + + + + + +

+ + + Many other operations can also use this structure Sophomoric Parallelism and Concurrency, Lecture 2 4 Other Operations? + + + + + +

+ + + + + + + + + What an example of something else we can put at the + marks? Sophomoric Parallelism and Concurrency, Lecture 2 5

What else looks like this? + + + + + + + + + + + +

+ + + Whats an example of something we cannot put there (and have it work the same as a for loop would)? Sophomoric Parallelism and Concurrency, Lecture 2 6 Reduction: a single answer aggregated from a list + + + +

+ + + + + + + + + + + What are the basic requirements for the reduction operator? Sophomoric Parallelism and Concurrency, Lecture 2

Note: The single answer can be a list or other collection. 7 Is Counting Matches Really a Reduction? Count matches: FORALL array elements: score = (if element == target then 1 else 0) total_score += score Is this really a reduction? Sophomoric Parallelism and Concurrency, Lecture 2 8 Even easier: Maps (Data Parallelism) A map operates on each element of a collection independently to create a new collection of the same size No combining results For arrays, this is so trivial some hardware has direct support One we already did: counting matches becomes mapping number 1 if it matches, else 0 and then reducing with + void equals_map(int result[], int array[], int len, int target) {

FORALL(i=0; i < len; i++) { result[i] = (array[i] == target) ? 1 : 0; } } Sophomoric Parallelism and Concurrency, Lecture 2 9 Another Map Example: Vector Addition Note: if <1, 2, 3, 4, 5> and <2, 5, 3, 3, 1> are vectors, then their vector sum is the sum of corresponding elements: <3, 7, 6, 7, 6>. void vector_add(int result[], int arr1[], int arr2[], int len) { FORALL(i=0; i < len; i++) { result[i] = arr1[i] + arr2[i]; } } Sophomoric Parallelism and Concurrency, Lecture 2 10 Maps in OpenMP (w/explicit Divide & Conquer) void vector_add(int result[], int arr1[], int arr2[], int lo, int hi) { Even though there is no result-combining, it still helps with load const int to SEQUENTIAL_CUTOFF = 1000; balancing create many small tasks if (hi - lo <= SEQUENTIAL_CUTOFF) { Maybe forlo; vector-add but for more compute-intensive for (int not i = i < hi; i++) result[i] = arr1[i] + arr2[i]; maps return; } The forking is O(log n) whereas theoretically other approaches to vector-add is O(1) #pragma omp task untied vector_add(result, arr1, arr2, lo, lo + (hi-lo)/2); vector_add(result, arr1, arr2, lo + (hi-lo)/2, hi); #pragma omp taskwait } Sophomoric Parallelism and Concurrency, Lecture 2 11 Maps and reductions These are by far the two most important and common patterns. Learn to recognize when an algorithm can be written in terms of maps and reductions! They make parallel programming simple Sophomoric Parallelism and Concurrency, Lecture 2 12

Digression: MapReduce on clusters You may have heard of Googles map/reduce or the open-source version Hadoop Idea: Perform maps/reduces on data using many machines system distributes the data and manages fault tolerance your code just operates on one element (map) or combines two elements (reduce) old functional programming idea big data/distributed computing Sophomoric Parallelism and Concurrency, Lecture 2 13 Exercise: find the ten largest numbers Given an array of positive integers, return the ten largest in the list. How is this a map and/or reduce? Sophomoric Parallelism and Concurrency, Lecture 2 14 Exercise: count prime numbers Given an array of positive integers, count the number of prime

numbers. How is this a map and/or reduce? Sophomoric Parallelism and Concurrency, Lecture 2 15 Exercise: find first substring match Given a (small) substring and a (large) text, find the index where the first occurrence of the substring starts in the text. How is this a map and/or reduce? Sophomoric Parallelism and Concurrency, Lecture 2 16 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of small tasks is best Combines results in parallel Some C++11 and OpenMP specifics More pragmatics (e.g., installation) in separate notes

Now: More examples of simple parallel programs Other data structures that support parallelism (or not) Asymptotic analysis for fork-join parallelism Amdahls Law Sophomoric Parallelism and Concurrency, Lecture 2 17 On What Other Structures Can We Use Divide-and-Conquer Map/Reduce? A linked list? A binary tree? Any? Heap? Binary search tree? AVL? B+? A hash table? Sophomoric Parallelism and Concurrency, Lecture 2 18

Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of small tasks is best Combines results in parallel Some C++11 and OpenMP specifics More pragmatics (e.g., installation) in separate notes Now: More examples of simple parallel programs Other data structures that support parallelism (or not) Asymptotic analysis for fork-join parallelism Amdahls Law Sophomoric Parallelism and Concurrency, Lecture 2 19 Analyzing Parallel Algorithms Well set aside analyzing for correctness for now. (Maps are obvious? Reductions are correct if the operator is associative?) How do we analyze the efficiency of our parallel algorithms? We want asymptotic bounds

We want our bound to incorporate the number of processors (so we know how our algorithm will improve as chips do!) But how hard is it to reason about the number of processors? Sophomoric Parallelism and Concurrency, Lecture 2 20 Digression, Getting Dressed pants shirt socks coat under roos shoes

belt watch Heres a graph representation for parallelism. Nodes: (small) tasks that are potentially executable in parallel Edges: dependencies (the target of the arrow depends on its source) Sophomoric Parallelism and Concurrency, Lecture 2 21 (Note: costs are on nodes, not edges.) Digression, Getting Dressed (1) pants shirt socks coat

under roos shoes belt watch Assume it takes me 5 seconds to put on each item, and I cannot put on more than one item at a time: How long does it take me to get dressed? Sophomoric Parallelism and Concurrency, Lecture 2 22 Digression, Getting Dressed () pants shirt socks

coat under roos shoes belt watch Assume it takes my robotic wardrobe 5 seconds to put me into each item, and it can put on up to 20 items at a time. How long does it take me to get dressed? Sophomoric Parallelism and Concurrency, Lecture 2 23 Digression, Getting Dressed (2) pants shirt

socks coat under roos shoes belt watch Assume it takes me 5 seconds to put on each item, and I can use my two hands to put on 2 items at a time. How long does it take me to get dressed? Sophomoric Parallelism and Concurrency, Lecture 2 24 Un-Digression, Getting Dressed: Work, AKA T1 pants

shirt socks coat under roos shoes belt watch What mattered when I could put only one item on at a time? How do we count it? Sophomoric Parallelism and Concurrency, Lecture 2 25 Un-Digression, Getting Dressed: Span, AKA T

pants shirt socks coat under roos shoes belt watch What mattered when I could put on an infinite number of items on at a time? How do we count it? Sophomoric Parallelism and Concurrency, Lecture 2 26

Un-Digression, Getting Dressed: Performance for P processors, AKA TP pants shirt socks coat under roos shoes belt watch What mattered when I could put on 2 items on at a time? Was it as easy as work or span to calculate? Sophomoric Parallelism and Concurrency, Lecture 2

27 Asymptotically Optimal TP T1 and T are easy, but we want to understand TP in terms of P But can TP beat: T1 / P why or why not? T why or why not? Architectural issues (e.g., memory-hierarchy) Sophomoric Parallelism and Concurrency, Lecture 2can throw this off. Well ignore that. 28 Asymptotically Optimal TP T1 and T are easy, but we want to understand TP in terms of P But TP cannot beat: T1 / P because otherwise we didnt do all the work! T because we still dont have have processors!

So an asymptotically optimal execution would be: TP = O((T1 / P) + T ) First term dominates for small P, second for large P Sophomoric Parallelism and Concurrency, Lecture 2 29 Asymptotically Optimal TP When T is small compared to T1/P, we get perfect linear speedup. T1/P As the marginal benefit of more processors bottoms out, we get performance proportional to T. Sophomoric Parallelism and Concurrency, Lecture 2

T 30 Getting an Asymptotically Optimal Bound Good OpenMP implementations guarantee O((T1 / P) + T ) as their expected bound. (Expected b/c of use of randomness.) Or they do if we do our job as OpenMP users: Pick a good algorithm Make each nodes work small, constant, and very roughly equal (Better yet use OpenMPs built-in reductions, parallel for loops, etc. But understand how they work from this course!) Sophomoric Parallelism and Concurrency, Lecture 2 31 Analyzing Code, Not Clothes Reminder, in our DAG representation: Each node: one piece of constant-sized work Each edge: source must finish before destination starts Cost is per node not edge.

Say each node has 1 unit cost. Sophomoric Parallelism and Concurrency, Lecture 2 What is T in this graph? 32 Where the DAG Comes From pseudocode C++11 main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work

t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); }

We start with just one thread. (Using C++11 not OpenMP syntax to make things cleaner.) Sophomoric Parallelism and Concurrency, Lecture 2 33 Where the DAG Comes From fork! main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0;

} task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } A fork ends a node and generates two new ones Sophomoric Parallelism and Concurrency, Lecture 2

34 Where the DAG Comes From main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() {

// O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } the new task/thread and the continuation of the current one. Sophomoric Parallelism and Concurrency, Lecture 2 35 Where the DAG Comes From fork! main:

a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2:

c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } Again, we fork off a task/thread. Meanwhile, the left (blue) task finished. Sophomoric Parallelism and Concurrency, Lecture 2 36 Where the DAG Comes From join! main: a = fork task1 b = fork task2

O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work

join c void task2() { std::thread t(&task1); // O(1) work t.join(); } Lets look at mains join next. (Either way we get the same DAG.) The join makes a node that waits on (has an arrow from) the last node of the joining task and of the joined37one. Where the DAG Comes From main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1);

std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join();

} The next join isnt ready to go yet. The task/thread its joining isnt finished. So, it waits and so do we. Sophomoric Parallelism and Concurrency, Lecture 2 38 Where the DAG Comes From main: a = fork task1 b = fork task2 O(1) work join a join b fork! int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join();

return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } Meanwhile, task2 also forks a task1. (The DAG describes dynamic execution. We can run the same code many times!)

Sophomoric Parallelism and Concurrency, Lecture 2 39 Where the DAG Comes From main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work

void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } task1 and task2 both chugging along. Sophomoric Parallelism and Concurrency, Lecture 2 40 Where the DAG Comes From main:

a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2:

join! c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); } task2 joins task1. Sophomoric Parallelism and Concurrency, Lecture 2 41 Where the DAG Comes From join! main: a = fork task1 b = fork task2 O(1) work

join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c

void task2() { std::thread t(&task1); // O(1) work t.join(); } Task2 (the right, green task) is finally done. So, the main task joins with it. (Arrow from the last node of the joining task and of the joined42one.) Where the DAG Comes From main: a = fork task1 b = fork task2 O(1) work join a join b int main(..) { std::thread t1(&task1); std::thread t2(&task2); // O(1) work

t1.join(); t2.join(); return 0; } task1: O(1) work void task1() { // O(1) work } task2: c = fork task1 O(1) work join c void task2() { std::thread t(&task1); // O(1) work t.join(); }

Sophomoric Parallelism and Concurrency, Lecture 2 43 Analyzing Real Code (Not Fake Code or Clothes) fork/join are very flexible, but divide-and-conquer maps and reductions (like count-matches) use them in a very basic way: A tree on top of an upside-down tree divide base cases combine results Sophomoric Parallelism and Concurrency, Lecture 2 44 Map/Reduce DAG: Work? Asymptotically, whats the work in this DAG?

Sophomoric Parallelism and Concurrency, Lecture 2 45 Map/Reduce DAG: Span? Asymptotically, whats the span in this DAG? Sophomoric Parallelism and Concurrency, Lecture 2 46 Map/Reduce DAG Work: O(n) Span: O(lg n) So: TP LOTS of room for perfect linear speedup with large n! O(n/P + lg n)

Sophomoric Parallelism and Concurrency, Lecture 2 47 Loop (not Divide-and-Conquer) DAG: Work/Span? int divs = 4; /* some number of divisions */ std::thread workers[divs]; int results[divs]; for (int d = 0; d < divs; d++) // count matches in 1/divs sized part of the array workers[d] = std::thread(&cm_helper_seql, ...); int matches = 0; for (int d = 0; d < divs; d++) { workers[d].join(); matches += results[d]; } return matches; n/4

n/4 n/4 Sophomoric Parallelism and Concurrency, Lecture 2 n/4 Black nodes take constant time. Red nodes take non-constant time! 48 Loop (not Divide-and-Conquer) DAG: Work/Span? int divs = k; /* some number of divisions */ std::thread workers[divs]; int results[divs]; for (int d = 0; d < divs; d++) // count matches in 1/divs sized part of the array workers[d] = std::thread(&cm_helper_seql, ...); int matches = 0; for (int d = 0; d < divs; d++) { workers[d].join(); matches += results[d]; } return matches; n/k n/k n/k So, whats the right choice of k? What work/span/expected performance does it give us? Sophomoric Parallelism and Concurrency, Lecture 2 49 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer with lots of small tasks is best Combines results in parallel Some C++11 and OpenMP specifics More pragmatics (e.g., installation) in separate notes Now: More examples of simple parallel programs Other data structures that support parallelism (or not) Asymptotic analysis for fork-join parallelism Amdahls Law Sophomoric Parallelism and Concurrency, Lecture 2 50 Amdahls Law (mostly bad news) Work/span is great, but real programs typically have: parts that parallelize well like maps/reduces over arrays/trees parts that dont parallelize at all like reading a linked list, getting input, doing computations where each needs the previous step, etc.

Nine women cant make a baby in one month Sophomoric Parallelism and Concurrency, Lecture 2 51 Amdahls Law (mostly bad news) Let T1 = 1 (measured in weird but handy units) Let S be the portion of the execution that cant be parallelized Then: T1 = S + (1-S) = 1 Suppose we get perfect linear speedup on the parallel portion Then: TP = S + (1-S)/P So the overall speedup with P processors is (Amdahls Law): T 1 / TP = How good/bad is this?

Sophomoric Parallelism and Concurrency, Lecture 2 52 Amdahls Law (mostly bad news) Let T1 = 1 (measured in weird but handy units) Let S be the portion of the execution that cant be parallelized Then: T1 = S + (1-S) = 1 Suppose we get perfect linear speedup on the parallel portion Then: TP = S + (1-S)/P So the overall speedup with P processors is (Amdahls Law): T 1 / TP = And the parallelism (infinite processors) is: T 1 / T = Sophomoric Parallelism and Concurrency, Lecture 2 53

Why such bad news T1 / TP = 1 / (S + (1-S)/P) T1 / T = 1 / S Suppose 33% of a program is sequential How much speed-up do you get from 2 processors? How much speed-up do you get from 1,000,000 processors? Sophomoric Parallelism and Concurrency, Lecture 2 54 Why such bad news T1 / TP = 1 / (S + (1-S)/P) T1 / T = 1 / S Suppose 33% of a program is sequential How much speed-up do you get from 2 processors? How much speep-up do you get from 1,000,000 processors? Suppose you miss the good old days (1980-2005) where 12ish

years was long enough to get 100x speedup Now suppose in 12 years, clock speed is the same but you get 256 processors instead of 1 For 256 processors to get at least 100x speedup, we need 100 1 / (S + (1-S)/256) What do we need for S? Sophomoric Parallelism and Concurrency, Lecture 2 55 Plots you have to see 1. Assume 256 processors x-axis: sequential portion S, ranging from .01 to .25 y-axis: speedup T1 / TP (will go down as S increases) 2. Assume S = .01 or .1 or .25 (three separate lines) x-axis: number of processors P, ranging from 2 to 32 y-axis: speedup T1 / TP (will go up as P increases) Do this! Chance to use a spreadsheet or other graphing program Compare against your intuition A picture is worth 1000 words, especially if you made it Sophomoric Parallelism and Concurrency, Lecture 2

56 All is not lost. Parallelism can still help! In our maps/reduces, the sequential part is O(1) and so becomes trivially small as n scales up. We can find new parallel algorithms. Some things that seem sequential are actually parallelizable! We can change the problem were solving or do new things Example: Video games use tons of parallel processors They are not rendering 10-year-old graphics faster They are rendering more beautiful(?) monsters Sophomoric Parallelism and Concurrency, Lecture 2 57 Learning Goals Define workthe time it would take one processor to complete a parallelizable computation; spanthe time it would take an infinite number of processors to complete the same computation; and Amdahl's Lawwhich relates the speedup in a program to the proportion of the program that is parallelizable. Use work, span, and Amdahl's Law to analyse the speedup

available for a particular approach to parallelizing a computation. Judge appropriate contexts for and apply the parallel map, parallel reduce, and parallel prefix computation patterns. Sophomoric Parallelism and Concurrency, Lecture 2 58 Moore and Amdahl Moores Law is an observation about the progress of the semiconductor industry Transistor density doubles roughly every 18 months Amdahls Law is a mathematical theorem Diminishing returns of adding more processors Both are incredibly important in designing computer systems Sophomoric Parallelism and Concurrency, Lecture 2 59

Recently Viewed Presentations

  • Monday, January 3rd 2017

    Monday, January 3rd 2017

    Maniac Magee Ch. 9&10. With a partner, read chapter 9 and try to figure out what this simile from the end of the chapter means: "The Cobras were laughing because they figured the dumb, scraggly runt would get out of...
  • Common Spine Disorders Diagnosis and Treatment

    Common Spine Disorders Diagnosis and Treatment

    Common Cervical Spine Disorders -Diagnosis and Treatment Wayne Cheng, MD Head of Spine Service, Dept. of Orthopaedic Surgery Loma Linda University Medical Center * * * * * * * * * Cervical Radiculopathy Vs.
  • Not Condemned

    Not Condemned

    "Indeed, God did not send the Son into the world to condemn the world, but in order that the world might be saved through him. 18 Those who believe in him are not condemned; but those who do not believe...
  • What Trustees Need to Know About Developing a

    What Trustees Need to Know About Developing a

    Former DPS SWAT Team Leader. Commander of DPS Dive Recovery Team. Federal Bureau of Investigation National Academy Graduate. 34 years of law enforcement experience. ... Use building and fire codes. Build and support a strong culture of safety and security.
  • ME 4054W: Design Projects RISK MANAGEMENT Lecture Topics

    ME 4054W: Design Projects RISK MANAGEMENT Lecture Topics

    Root Cause Analysis. Root cause analysis (RCA) is a method of problem solving that tries to solve problems by attempting to identify and correct the root causes of events, as opposed to simply addressing their symptoms. Focusing correction on root...
  • The Medical Consequences of Nuclear War The International

    The Medical Consequences of Nuclear War The International

    Courtesy Lili Xia and Alan Robock. That was out understanding of the danger when we published the original study on nulcear famine. Since then new data has emerged on food production in China which suggests that we may have seriously...
  • Temperature Dependence of Doppler-Broadening for Rubidium-85 Absorption Kenny

    Temperature Dependence of Doppler-Broadening for Rubidium-85 Absorption Kenny

    Temperature Dependence of Doppler-Broadening for Rubidium-85 Absorption. Introduction. Theory. Experiment. Results. A diode laser at 780 nm was used to excite the 5S. 1/2 to 5P 3/2 transition of rubidium-85.. Saturated absorption in a separate rubidium cell was used to...
  • Temperature and Heat - Kyrene School District

    Temperature and Heat - Kyrene School District

    Temperature and Heat Heat is a flow of energy due to temperature differences ... Temperature is measured in units called degrees (oC,F,K) Fahrenheit: Water freezes 32oF and boils at 212oF Celsius: Water freezes at 0oC and boils at 100oC How...