HPC in cloud: state report

HPC in cloud: state report

Cloud Versus In-house Cluster: Evaluatin g Amazon Cluster Compute Instances for Running MPI Applications Yan Zhai, Mingliang Liu, Jidong Zhai Xiaosong Ma, Wenguang Chen Tsinghua University & NCSU & ORNL HPC in cloud? Cloud service viable for HPC applications? Yes Mostly for loosely-coupled codes Has cloud grabbed majority of HPC users mind? No For tightly-coupled codes, Performance still major concer

n Lower performance -> higher cost Amazon EC2 CCI Emerging of the high performance cloud like Amaz on EC2 CCI (Cluster Computing Instance) High end computation hardware Exclusive resource usage Updated inter connection (10GbE network) Has CCI changed cloud HPC landscape? Our work

Several months of evaluating EC2 CCI Comprehensive performance and cost evaluations Focused on tightly coupled MPI programs Micro, macro benchmarks, and real world applications Exploring IO configurability issues Outline Background & Motivation Evaluation and observations Will HPC cloud save you money? Application performance results Wish list to cloud service providers Conclusion

Will HPC cloud save you money? Cost: driving factor for going for cloud Cloud vs. in-house cluster Pay-as-you-go vs. fixed hardware investment Workload-dependent decision Relative performance of individual applications Mixture of applications Expected utilization level of in-house cluster Runtime performance Cloud and 16-node in-house cluster configuration: Cloud Local CPU

Xeon X5570 (8 cores each) Xeon X5670(12 cores each) Memory 23GB 48GB Network 10GbE

QDR Infiniband FS NFS NFS OS Amazon Linux AMI 2011.02.1 RHEL 5.5 Virtualization

Para-virtualization No Selected applications GRAPES [1] (weather simulation) CPU- and memory-intensive Moderate communication MPI-Blast [2] (biological sequence matching)

Large input Relatively little communication POP (ocean modeling) [3] Communication-intensive Large number of small messages Time(s) GRAPES results 450

400 350 300 250 200 150 100 50 0 32 LOCAL CLOUD 64 Process number

128 Time(s) MPI-Blast results 500 450 400 350 300 250 200 150 100 50

0 18 LOCAL CLOUD 34 66 Process number 130 POP results 1200

Time(s) 1000 800 LOCAL CLOUD 600 400 200 0 16 32

64 Process number 128 Performance summary Cloud offers performance close to in-house cluster For some applications Communication still severe concern For communication-heavy apps Major problem: large latency Similar observation from benchmarking results NPB class C and D

Intel MPI Benchmarks STREAM memory benchmark [4] Coming back to cost Issue Local cluster: cost depends on actual utilization le vel For given application A, Cloud more cost-effective if Coming back to cost Issue Local cluster: cost vel

Effective time depends oninactual elapsed application utilization le For given application A, Cloud more cost-effective if Coming back to cost Issue Local cluster: cost depends on actual utilization le Time period before the local vel cluster becomes out of date

For given application A, Cloud more cost-effective if Coming back to cost Issue Local cluster: cost depends on actual utilization le vel Cost of cloud per instance, For given application A, Cloud 1.6$/(hour*instance) if more cost-effective

Coming back to cost Issue Local cluster: cost depends on actual utilization le vel For if Time to finish one job of A in given application cloudA, Cloud more cost-effective Coming back to cost Issue Local cluster: cost depends on actual utilization le

vel For given if Cost to buy and deploy local application A, Cloud clustermore cost-effective Coming back to cost Issue Local cluster: cost depends on actual utilization le vel For given application if

Effective time used to run A, Cloud more cost-effective applications Coming back to cost Issue Local cluster: cost depends on actual utilization le vel For given application A, if Time to finish one job of A in Cloud morelocal cost-effective

Coming back to cost Issue Local cluster: cost depends on actual utilization le vel Cost for one job of application A in local side. If right side is larger, then cloud is more effective For given application A, Cloud more cost-effective if Parameters used in local cluster Expense item Amount Dell 5670 Servers (include

service) $6508/node Infiniband NIC $612/node Infiniband Switch $6891 SAN with NFS server and RAID5 $36753

Hosting (energy included) $15251/rack/year Assumed life span 3 year Utilization Rate Threshold(%) Utilization rate threshold for applications 60 50 40 30

Utilization Rate 20 10 0 GRAPES MPIBLAST POP Utilization Rate Threshold(%) Utilization rate threshold for applications 60

50 40 This means if you use local cluster more than about 25% to run GRAPES per year, youd better stay local 30 Utilization Rate 20 10 0

GRAPES MPIBLAST POP Further considerations in cost Calculation biased toward local cluster Assumes 24x7 availability in 3 years No failures, maintenance, holidays Labor cost not counted Cloud provides continuous hardware upgrades Yesterday: Amazon announced

New CCI instances Lowered price for current configuration: $1.60->$1.30 Heavy HPC users may get further cloud discount Reserved instances on AWS Utilization Rate Threshold(%) Reduced pricing effect

60 50 40 Increment for new cloud price Old utilization rate 30 20 10 0 GRAPES MPIBLAST

POP Reserved Instance discount Use reserved instances for 3-years: $5053 first-pay is required $0.45/(hour * instance) can be enjoyed Cloud more effective for application A if: Reserved Instance discount Use reserved instances for 3-years: $5053 first-pay is required $0.45/(hour * instance) can be enjoyed Cloud more effective

3 x 365 x 24 hours for application A if: Reserved Instance discount Use reserved instances for 3-years: $5053 first-pay is required Under a certain utilization rate, the time $0.45/(hour * instance) can be enjoyed required for cloud to produce same amount of

jobs as local Cloud more effective for application A if: Utilization Rate Threshold(%) Reserved instance discount effe ct 60 50 Increment for reserved instance discount Increment for new cloud price Old Utilization Rate

40 30 20 10 0 GRAPES MPIBLAST POP Summary to cost Rough steps to evaluate cost effectiveness Estimate local utilization rate

Short term run to acquire per job time Calculate threshold utilization rate If estimate utilization rate > calculated threshold Local is more cost-effective Else Cloud is more cost-effective Our wish list to cloud service providers Improved network latency Pre-configured OS image Optimized library for specific cloud platform More flexible charging Current model designed for commercial servers Fine-granule accounting for clusters

To allow large-scale development and testing System scale Current upper limit: dozens of nodes Outline Background & Motivation Evaluation and observations Will HPC cloud save you money? Application performance results Wish list to cloud service providers Conclusion Conclusion Amazon EC2 CCI becoming competitive choice for HPC

Even when running tightly-coupled simulations May deliver similar performance as in-house clusters Except for codes with heavy communication Flexibility and elasticity valuable Users may try out different resource types No up-front hardware investment Per user, per-application system software M. Liu et al., One Optimized I/O Configuration per HPC application : Leveraging the Configurability of Cloud, APSys 2011 Acknowledgment Research sponsored by Intel Collaborators: Bob Kuhn, Scott Macmillan, Nan Qiao

references [1] D. Chen, J. Xue, X. Yang, H. Zhang, X. Shen, J. Hu, Y. Wang, L. Ji, an d J. Chen. New generation of multi-scale NWP system (GRAPES): gene ral scientic design. Chinese Science Bulletin, 53(22):3433{3445, 200 8. [2] A. Darling, L. Carey, and W. Feng. The design, implementation, an d evaluation of mpiBLAST. In Proceedings of the ClusterWorld Confere nce and Expo, in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution, 2003. [3] LANL. Parallel ocean program (pop). http://climate.lanl.gov/Model s/POP, April 2011. [4] T. University. Technique report. http://www.hpctest.org.cn/resourc es/cloud.pdf. Thanks!

Recently Viewed Presentations