aaron276h.github.io

aaron276h.github.io

Agile Elasticity in ML: How to do it and how to take advantage of it Aaron Harlap, Alexey Tumanov, Greg Ganger, Phil Gibbons PARALLEL DATA LABORATORY Carnegie Parallel Data Mellon Laboratory Carnegie Mellon University Overview Motivation for elasticity in ML How to make Parameter Servers Elastic - TierML How to take advantage of Elasticity - BidBrain

Evaluations of cost benefits and runtime benefits of elasticity for ML Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Dynamic Resource Availability Revocable resources are common in clusters - Best effort resource that can be preempted - Yarn, Borg, Mesos, etc Adding the element of cost savings in clouds - Preemptible Instances in Google Compute Engine - Spot Instances in Amazon EC2

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Big $$$ Saving Often 75-85% cheaper to use Spot Instances Low Cost Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

November 16 How do you save $$$ Need agile elasticity - Scale in and out efficiently and quickly Handle bulk revocations/evictions Dont lose progress Little/no overhead Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

PS are great at Solving Iterative ML Parameter Servers shard solution state across machines Current architecture has servers and workers on all machines Used by IterStore, MxNet, Bosen Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

Current Options for PS Elasticity Exploiting checkpointing support - Need to checkpoint, stop, start up new resources, load the checkpoint - Expensive and slow - Run-time overhead Model Parameter Replication - Doesn't tolerate bulk resizing - Run-time overhead Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

TierML: New Approach to Elasticity Use tiers of reliable and un-reliable resources - Revocable resources are un-reliable (transient) Maintain all state on reliable resources - Parameter Servers only on On-demand Instances - Spot Instances run workers only (initially) 3 architecture stages - stage set by ratio of transient to reliable resources Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 TierML Architecture Stage #1

On-Demand Instances (Reliable) Elasticity Controller Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ ParamServ ParamServ ParamServ Worke r

Worke r Worke r ParamServ ParamServ ParamServ Worke r Worke r Worke

r Worke r Worke r Worke r Worke r Worke r Worke r

Spot Instances (Cheap) November 16 Stage #1 has a Weakness ParamServs = Instances running server processes 64 c4.2x EC2 Machines MF on Netflix Dataset On-Demand Instances (Reliable) Spot Instances (Cheap) Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ CostlyCheaperCheap November 16

Solving the Server Bottleneck Our solution is to shift load from few reliable resource to the many spot resources Primary - Backup mechanism Primary copies live on spot instances - ActivePS Online Backups live on reliable resource - BackupPS Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

ActivePS on Spot Instances (Stage 2) On-Demand Instances (Reliable) Elasticity Controller BackupPS Worke r Worker Worker Worker Worker ActivePS

ActivePS ActivePS ActivePS Worker Worker Worker Worker Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

BackupPS Worke r Spot Instances (Cheap) November 16 ActivePS Helps a Lot 64 c4.2x Machines 4 On-demand Machines MF on Netflix Dataset Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

Costly Cheap Cheap November 16 Becomes Slow at High Ratios 64 c4.2x EC2 Machines 1 Server MF on Netflix Dataset Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Costly Cheap November 16

No Workers on Reliable (Stage #3) Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 High Ratios no Longer a Problem 64 c4.2x EC2 Machines 1 On-Demand MF on Netflix Dataset Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ Costly Cheap Cheap November 16 Transitioning between Stages Stage #1 Stage #2 Stage #3 On-Demand Instances (Reliable) Spot Instances (Cheap) Transition between stages at run-time - Little/No overhead for transitions

- Transitions based on ratios Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Efficiency of our Elasticity 4 machines ==> 64 machines ==> 4 machines Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

November 16 So now we have Agile Elasticity How do we take advantage of it? Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Rules of the Amazon Amazon EC2 Spot Market Rules: - User specifies bid, number of instances and instance type

- Until Market Price rises about Bid Price you have resources - Billed Market Price at the beginning of every hour - If evicted, partial hour is free - Each instance type has its own market in each zone Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 BidBrain knows how to save you $$$ $ BidBrain acts as resource manager Acquires resources from AWS and allocates

them to TierML BidBrain creates allocations: - allocation: (instance type, bid price, number of instances) Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 BidBrain TierML Implementation Spot Market Historical Data 2

BidBrain 3 4 1 6 TierML 5 AWS 1) Application Characteristics 3) Feed Spot Market Price into BidBrian

5) AWS provides resources to BidBrain 2) Feed Historic Spot Market into BidBrian 4) BidBrain makes allocation request 6) BidBrain provides TierML with resources Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

Goal is to minimize cost per work Computes cost per work for current allocations Builds a set of possible allocations - Allocation = (Bid, Machine Type, # of Instances) For each possible allocation: - compute cost per work of current allocation + possible allocation If it finds a lower cost per work, start the new allocation Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Compute Expected Cost

Current Market Price Historic Market Price - Bid Delta = Bid Price - Market Price c4.2xlarge instance type in zone us-east-1a: Bid Delta Evicted within Hour Expected Time to Eviction $0.0005 55% 42 Min $0.01

5.5% 738 Min Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Compute Expected Work TierML provides this information to BidBrain - how long after startup do resources become productive - Scalability

- Scale in/out overhead - Eviction overhead Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 When to Consider Allocatons Every 2 minutes After any eviction event End of billing hour - If not evicted - At 58 minutes after allocation decision decide if we want to re-up or terminate allocation

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 Evaluation TierML vs Checkpointing [Sharma et al ..] BidBrain vs Bid On-Demand Policy Bid On-Demand Policy (standard) - Choose cheapest resource - Bid On-demand Price (user bid = on-demand price) - On eviction repeat Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ November 16 Combination of BidBrain & TierML Wins Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Standard + CKPts BidBrain+ TierML

November 16 BidBrain + TierML is also Faster Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Standard + CKPts BidBrain+ TierML November 16 Importance of BidBrain

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Standard StandardBidBrain+ +CKPts +TierML TierML November 16 Importance of Elasticity Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

Standard BidBrain BidBrain+ +CKPts +CKPts TierML November 16 Summary Agile Elasticity (TierML) + Smart Bidding (BidBrain) take advantage of dynamic resource availability Future Work - Generalizing BidBrain for other workloads Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

Backup Slides References Sharma, Prateek, et al. "Flint: batch-interactive data-intensive processing on transient servers." Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 BidBrain TierML Implementation

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16 How much free compute do we get? Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ November 16

Recently Viewed Presentations

  • FFT in Hardware and Software - University of Calgary

    FFT in Hardware and Software - University of Calgary

    FFT in Hardware and Software Background Core Algorithm Original Algorithm, the DFT, O(n2) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation.
  • Παρουσίαση του PowerPoint - eclass.unipi.gr

    Παρουσίαση του PowerPoint - eclass.unipi.gr

    ΣΤΑΤΙΣΤΙΚΑ ΣΤΟΙΧΕΙΑ ΠΛΗΘΥΣΜΟΥ Σxi, (i=1-N) = μ Μέση τιμή πληθυσμού N Σ(xi -μ)2 Διακύμανση πληθυσμού
  • ISD's processing of postcode files - ISD Scotland

    ISD's processing of postcode files - ISD Scotland

    Geography fields added to unlinked SMR00, SMR01, SMR04 monthly. Part A only. MRL file Other Databases (e.g. Prescribing Information System (PIS), linked to geography via postcode on CHI) L_GEOGRAPHY in NSS Corporate Data Warehouse (CDW) Produced every week. Only part...
  • Unidad temática 3 Análisis de esfuerzos en un punto

    Unidad temática 3 Análisis de esfuerzos en un punto

    Unidad temática 3 Análisis de esfuerzos en un punto Método Grafico. Circulo de Mohr 3a.Parte * 3.5 Método grafico. Circulo de Mohr Existe una interpretación grafica de las ecuaciones anteriores hecha por el ingeniero alemán Otto Mohr (1882) a partir...
  • Chapter 8 Cellular Transport and the Cell Cycle

    Chapter 8 Cellular Transport and the Cell Cycle

    * Length of Cell Cycle Each cell type has a specific growth and reproduction time table Frog embryo cell cycle of less than one hour Cells lining your intestine 24-48 hours Mature nerve cells do not divide * Normal Control...
  • Future Research in Genomics

    Future Research in Genomics

    Future Research in Genomics Curt Van Tassell AIPL Centennial Celebration November 28, 2008 USDA Genomics Blueprint Science to Practice Genome-enabled animal selection Science to Practice Genome-enabled animal selection Precision mating systems Embryonic selection Non-additive genetic effects Cow data Precision Management...
  • LOL TITLE - Lick Observatory

    LOL TITLE - Lick Observatory

    The Current Hardware My Project Design a system to input any time into the Image Mount Control (IMC) Independent of the IMC Doesn't change code in IMC Tricks telescope The Time Setting System Linux PC Timer Card C Code TCL/TK...
  • Colorados Unique Tax Code Carol Hedges Executive Director

    Colorados Unique Tax Code Carol Hedges Executive Director

    TABOR limits how big government spending can grow. If it grows faster than the cap allows, we have to issue tax rebates. Explanation: TABOR (Article X Section 20)Major Provisions. Constrains the Legislature's Choices. ... robert brocker ...