Transcription

Data Warehousing and Analytics Infrastructure atFacebookAshish ThusooZheng ShaoSuresh AnthonyDhruba BorthakurNamit JainJoydeep Sen SarmaRaghotham MurthyHao LiuFacebook11The authors can be reached at the able analysis on large data sets has been core to the functionsof a number of teams at Facebook - both engineering and nonengineering. Apart from ad hoc analysis of data and creation ofbusiness intelligence dashboards by analysts across the company,a number of Facebook's site features are also based on analyzinglarge data sets. These features range from simple reportingapplications like Insights for the Facebook Advertisers, to moreadvanced kinds such as friend recommendations. In order tosupport this diversity of use cases on the ever increasing amountof data, a flexible infrastructure that scales up in a cost effectivemanner, is critical. We have leveraged, authored and contributedto a number of open source technologies in order to address theserequirements at Facebook. These include Scribe, Hadoop andHive which together form the cornerstones of the log collection,storage and analytics infrastructure at Facebook. In this paper wewill present how these systems have come together and enabled usto implement a data warehouse that stores more than 15PB of data(2.5PB after compression) and loads more than 60TB of new data(10TB after compression) every day. We discuss the motivationsbehind our design choices, the capabilities of this solution, thechallenges that we face in day today operations and futurecapabilities and improvements that we are working on.Categories and Subject DescriptorsH.m [Information Systems]: Miscellaneous.General TermsManagement, Measurement, Performance, Design, Reliability,Languages.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise,or republish, to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.SIGMOD’10, June 6–10, 2010, Indianapolis, Indiana, USA.Copyright 2010 ACM 978-1-4503-0032-2/10/06. 10.00.KeywordsData warehouse, scalability, data discovery, resource sharing,distributed file system, Hadoop, Hive, Facebook, Scribe, logaggregation, analytics, map-reduce, distributed systems.1.INTRODUCTIONA number of applications at Facebook rely on processing largequantities of data. These applications range from simple reportingand business intelligence applications that generate aggregatedmeasurements across different dimensions to the more advancedmachine learning applications that build models on training datasets. At the same time there are users who want to carry out adhoc analysis on data to test different hypothesis or to answer onetime questions posed by different functional parts of the company.On any day about 10,000 jobs are submitted by the users. Thesejobs have very diverse characteristics such as degree ofparallelism, execution time, resource-needs, and data deliverydeadlines. This diversity in turn means that the data processinginfrastructure has to be flexible enough to support differentservice levels as well as optimal algorithms and techniques for thedifferent query workloads.What makes this task even more challenging is the fact that thedata under consideration continues to grow rapidly as more andmore users end up using Facebook as a ubiquitous social networkand as more and more instrumentation is added to the site. As anexample of this tremendous data growth one has to just look at thefact that while today we load between 10-15TB of compresseddata every day, just 6 months back this number was in the 5-6TBrange. Note that these sizes are the sizes of the data aftercompression – the uncompressed raw data would be in the 6090TB range (assuming a compression factor of 6). Needless tosay, such a rapid growth places very strong scalabilityrequirements on the data processing infrastructure. Strategies thatare reliant on systems that do not scale horizontally arecompletely ineffective in this environment. The ability to scaleusing commodity hardware is the only cost effective option thatenables us to store and process such large data sets.In order to address both of these challenges – diversity and scale,we have built our solutions on technologies that support thesecharacteristics at their core. On the storage and compute side werely heavily on Hadoop[1] and Hive[2] – two open sourcetechnologies that we have significantly contributed to, and in the

case of Hive a technology that we have developed at Facebook.While the former is a popular distributed file system and mapreduce platform inspired by Google's GFS[4] and map-reduce[5]infrastructure, the latter brings the traditional data warehousingtools and techniques such as SQL, meta data, partitioning etc. tothe Hadoop ecosystem. The availability of such familiar tools hastremendously improved the developer/analyst productivity andcreated new use cases and usage patterns for Hadoop. With Hive,the same tasks that would take hours if not days to program cannow be expressed in minutes and a direct consequence of this hasbeen the fact that we see more and more users using Hive andHadoop for ad hoc analysis in Facebook – a usage pattern that isnot easily supported by just Hadoop. This expanded usage hasalso made data discovery and collaboration on analysis that muchmore important. In the following sections we will also touch uponsome systems that we have built to address those importantrequirements.Section 7.Just as Hive and Hadoop are core to our storage and dataprocessing strategies, Scribe[3] is core to our log collectionstrategy. Scribe is an open source technology created at Facebookthat acts as a service that can aggregate logs from thousands ofweb servers. It acts as a distributed and scalable data bus and bycombining it with Hadoop's distributed file system (HDFS)[6] wehave come up with a scalable log aggregation solution that canscale with the increasing volume of logged data.The data from the web servers is pushed to a set of Scribe-Hadoop(scribeh) clusters. These clusters comprise of Scribe serversrunning on Hadoop clusters. The Scribe servers aggregate the logscoming from different web servers and write them out as HDFSfiles in the associated Hadoop cluster. Note that since the data ispassed uncompressed from the web servers to the scribeh clusters,these clusters are generally bottlenecked on network. Typicallymore than 30TB of data is transferred to the scribeh clusters everyday – mostly within the peak usage hours. In order to reduce thecross data center traffic the scribeh clusters are located in the datacenters hosting the web tiers. While we are exploring possibilitiesof compressing this data on the web tier before pushing it to thescribeh cluster, there is a trade off between compression and thelatencies that can be introduced as a result of it, especially for lowvolume log categories. If the log category does not have enoughvolume to fill up the compression buffers on the web tier, the datatherein can experience a lot of delay before it becomes availableto the users unless the compression buffer sizes are reduced orperiodically flushed. Both of those possibilities would in turn leadto lower compression ratios.In the following sections we present how these different systemscome together to solve the problems of scale and job diversity atFacebook. The rest of the paper is organized as follows. Section 2describes how the data flows from the source systems to the datawarehouse. Section 3 talks about storage systems, formats andoptimizations done for storing large data sets. Section 4 describessome approaches taken towards making this data easy to discover,query and analyze. Section 5 talks about some challenges that weencounter due to different SLA expectations from different usersusing the same shared infrastructure. Section 6 discusses thestatistics that we collect to monitor cluster health, plan outprovisioning and give usage data to the users. We conclude in2.DATA FLOW ARCHITECTUREIn Figure 1, we illustrate how the data flows from the sourcesystems to the data warehouse at Facebook. As depicted, there aretwo sources of data – the federated mysql tier that contains all theFacebook site related data and the web tier that generates all thelog data. An example of a data set that originates in the formerincludes information describing advertisements – their category,there name, the advertiser information etc. The data setsoriginating in the latter mostly correspond to actions such asviewing an advertisement, clicking on it, fanning a Facebook pageetc. In traditional data warehousing terminology, more often thannot the data in the federated mysql tier corresponds to dimensiondata and the data coming from the web servers corresponds to factdata.Periodically the data in the scribeh clusters is compressed bycopier jobs and transferred to the Hive-Hadoop clusters as shownin Figure 1. The copiers run at 5-15 minute time intervals andcopy out all the new files created in the scribeh clusters. In thismanner the log data gets moved to the Hive-Hadoop clusters. Atthis point the data is mostly in the form of HDFS files. It getspublished either hourly or daily in the form of partitions in thecorresponding Hive tables through a set of loader processes andthen becomes available for consumption.The data from the federated mysql tier gets loaded to the HiveHadoop clusters through daily scrape processes. The scrapeprocesses dump the desired data sets from mysql databases,compressing them on the source systems and finally moving theminto the Hive-Hadoop cluster. The scrapes need to be resilient tofailures and also need to be designed such that they do not put toomuch load on the mysql databases. The latter is accomplished byrunning the scrapes on a replicated tier of mysql databases therebyavoiding extra load on the already loaded masters. At the sametime any notions of strong consistency in the scraped data issacrificed in order to avoid locking overheads. The scrapes areretried on a per database server basis in the case of failures and ifthe database cannot be read even after repeated tries, the previousdays scraped data from that particular server is used. Withthousands of database servers, there are always some servers thatFigure 1: Data Flow Architecture

may not be reachable by the scrapes and by a combination ofusing retries and scraping stale data a daily dump of thedimension data is created in the Hive-Hadoop clusters. Thesedumps are then converted to top level Hive tables.As shown in Figure 1, there are two different Hive-Hadoopclusters where the data becomes available for consumption by thedown stream processes. One of these clusters – the productionHive-Hadoop cluster - is used to execute jobs that need to adhereto very strict delivery deadlines, where as the other cluster – thead hoc Hive-Hadoop cluster is used to execute lower prioritybatch jobs as well as any ad hoc analysis that the users want to doon historical data sets. The ad hoc nature of user queries makes itdangerous to run production jobs in the same cluster. A badlywritten ad hoc job can hog the resources in the cluster, therebystarving the production jobs and in the absence of sophisticatedsandboxing techniques, the separation of the clusters for ad hocand production jobs has become the practical choice for us inorder to avoid such scenarios.Some data sets are needed only in the ad hoc cluster where asothers are needed in both the clusters. The latter are replicatedfrom the production cluster to the ad hoc cluster. A Hivereplication process checks for any changes made to the Hivetables in the production cluster. For all changed tables, thereplication process copies over the raw data and then reapplies themeta data information from the production Hive-Hadoop clusterto the ad hoc Hive-Hadoop cluster. It is important to load the datato the production cluster first as opposed to the ad hoc cluster bothfrom the point of view of making the data available earlier to thecritical production jobs and also from the point of view of notputting the less reliable ad hoc cluster on the path of the dataarriving at the production cluster. The replication process relies onthe logging of all the Hive commands submitted to the productioncluster. This is achieved by implementing a “command logger” asa pre-execution hook in Hive – the pre-execution hook APIenables Hive users to plug in any kind of programming logic thatgets executed before the execution of the Hive command.Finally, these published data sets are transformed by user jobs orqueried by ad hoc users. The results are either held in the HiveHadoop cluster for future analysis or may even be loaded back tothe federated mysql tier in order to be served to the Facebookusers from the Facebook site. As an example all the reportsgenerated for the advertisers about their advertisement campaignsare generated in the production Hive-Hadoop cluster and thenloaded to a tier of federate mysql data bases and served to theadvertisers from the advertiser Insights web pages.2.1Data Delivery LatencyAs is evident from this discussion, the data sets arrive in the HiveHadoop clusters with latencies ranging from 5-15 minutes in caseof logs to more than a day in case of the data scraped from thefederated mysql databases. Moreover, even though the logs areavailable in the form of raw HDFS files within 5-15 minutes ofgeneration (unless there are failures), the loader processes loadsthe data into native Hive tables only at the end of the day. In orderto give users more immediate access to these data sets we useHive's external table feature[7] to create table meta data on theraw HDFS file