Go Big (with Data Lake Architecture) or Go Home!

Go Big (with Data Lake Architecture) or Go Home!

Microsoft Machine Learning & Data Science Summit September 26 27 | Atlanta, GA BR008 Go Big (with Data Lake Architecture) or Go Home! Omid Afnan Session objectives and key takeaways Objective: Understand how the traditional data landscape is changing, what opportunities big data presents and what architectures allow you to maximize the benefits to your organization. Key Takeaways: Data lake architectures can be additive to your data warehouse Azure Data Lake makes building big data The traditional data warehouse data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data

* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012) BI and analytics Dashboard Reportin s g Data warehouse ETL Data sources OLTP ERP CRM LOB 4 Big Data is driving transformative changes Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Gartner, Big Data Definition* * Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/ Big Data is driving transformative changes

Data Characteristics Cost Culture Big Data is driving transformative changes Traditional Data Characteristics Cost Culture Big Data Relational All Data (with highly modeled schema) (with schema agility) Expensive Commodity (storage and compute capacity)

(storage and compute capacity) Rear-view reporting Intelligent action (using relational algebra) (using relational algebra AND ML, graph, streaming, image processing) Example: Culture of experimentation Tangerine instantly adapts to customer feedback to offer customers what they want, when Lack they want it of insight for targeted campaigns Scenario Inability to support data growth Solution Azure HDInsight (Hadoop-as-a-service) with the Analytics Platform System enables instant analysis of social sentiment and customer feedback across digital, face-to-face and phone interactions. Result

Reduced time to customer insight Ability to make changes to campaigns or adjust product rollouts based on real-time customer reactions Ability to offer incentives and new services to retainand growits customer base I can see uscreating predictive, context-aware financial services applications that give information based on the time and where the customer is. Billy Lo Head of Enterprise Architecture Why Data Lakes? Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis Cre a Do te rep ana o lyt rts ics 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema

(schema-on-write) 5. Create reports, analyze data Dedicated ETL tools (e.g. SSIS) Relational LOB Applications ETL pipeline Queries Defined schema Results All data not immediately required is discarded or archived fy a nti Ide chem as ies dat quer and Cr New ea

s y requirements if ce pip te t r el ET en u in L Id so e a t a d New big data thinking: All data has value All data has potential value Data hoarding No defined schemastored in native format Schema is imposed and transformations are done at query time (schema-on-read). Iterate Apps and users interpret the data as they see fit Gather data from all sources Store indefinitely Analyze

See results The data lake and warehouse Devices Batch queries Social Device s Web Video Cooked Data Web Sensor s Clickstrea m Sensors Social Clickstream

Video Interactive queries Real-time analytics Machine Learning Meta-Data, Joins Queries Cooked Data Results Relational ETL pipeline LOB Applications Dashboards Reports Exploration Defined schema However, Big Data is not easy Obtaining skills and capabilities

Determining how to get value *Gartner: Survey Analysis Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015) Integrating w existing IT inves Microsoft made the transition to Big Data Data Stored We wanted to build better products based on real usage and experimentation, So we built Xbox Live A data lake for everyone to put their data Tools approachable by any developer Machine learning tools for collaborating across large experiment models Office365 LCA Live Yammer SMSG Result Used at Microsoft across Office, Xbox Live, Azure, Windows, Bing and Skype

10K+ Developers running diverse workloads and scenarios Exabytes of data under management Bing CRM/Dynamics Skype Exchange Windows Malware Protection Microsoft Stores Commerce Risk 1 2 3 4 5 6

7 Patterns for Big Data Big Data Analytics Data Flow Ingestion Business apps Discovery Azure Data Catalog Bulk Ingestion Preparation, Analytics and Machine Learning Visualization People Power BI Custom apps Sensors and devices Event Ingestion

Azure Data Lake Store DATA INTELLIGENCE ACTION Event ingestion patterns Power BI Real Time Dashboards Business apps Events Custom apps Sensors and devices Azure Event Hubs Azure Stream Analytics

Events Transforme d Data Kafka Spark Streaming Event Collection Stream Processing Raw Events Azure Data Lake Store Lambda architecture DATA SOURCES INGES T PREPAR E ANALYZE

PUBLISH CONSUM E Machine Learning Real-time Scoring Hot Path Reference Data Event hubs ASA Job Rule Cortan a Sensors (IoT, Devices, Mobile) Event hubs Flatten & Metadata Join

Event hubs Archived Data Data Lake Store Aggregated Data Data Lake Store Offline Training Hourly, Daily, Monthly RollUps Data Lake Analytics Machin Batch e Scoring Learnin g Data Factory: Move Data, Orchestrate, Schedule, and Monitor

Logs (CSV, JSON, XML) On ASA Job Rule Cold Power BI Data Lake Store Azure SQL Data Warehouse Web/LOB Dashboard s Leading Computer Manufacturer / Retailer Clickstream, Recommendation How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online Azure Service for Target Email How They Did It AzureML

HDInsight Cluster Collect clickstream data Training/validation data Catched data Blog Storage MB ase Blog Storage Targeted email Template data IaaS VM Web logs Email Server Product Catalog Omniture Visitor Information

Service Website.com User clicks Captur e User segment info Web logs Deterministic Scored data Recommendations Non-Deterministic NRT AzureML Event Hub Azure SQL DE BK1 In tab separated text files Adding 22 new files per hour ~5-

B Click feedback Catched data data To be scored Scored data Blog Storage NoSQL Storage Persisted Storage Azure Service for Recommendations Email to user Targeted Email Deterministic 18 MB/file Currently 1TB and growing Spin up Hadoop Use Hive scripts because of SQL- like syntax Extracts click behavior like buys,

additions to carts, reviews etc. and assigns scores Jobs run hourly Currently 8-nodes with plans to 16 Making Big Data Easy Setting up for big data the hard way Select an open source stack Select several components from dozens available based on workloads Setup an installation and management service like Ambari Install core big data components Size hardware Install optional big data

Procure hardware, rack Add authentication and requirements space, networking services security services Configure and tune cluster Demo: Setting up a big data environment in Azure Introducing Cortana Intelligence Suite Data Sources Information Management Data Factory Data Catalog Apps

Event Hubs Big Data Stores Data Lake Store SQL Data Warehouse Machine Learning and Analytics People Machine Learning Cognitive Services Data Lake Analytics Bot Framework Web HDInsight (Hadoop and Spark)

Cortana Mobile Stream Analytics Sensors and devices Data Intelligence Apps Bots Dashboards & Visualizations Power BI Intelligence Automated Systems Action Where Big Data is a cornerstone Data Sources

Information Management Data Factory Data Catalog Apps Event Hubs Big Data Stores Data Lake Store SQL Data Warehouse Machine Learning and Analytics People Machine Learning Cognitive Services Data Lake Analytics

Bot Framework Web HDInsight (Hadoop and Spark) Cortana Mobile Stream Analytics Sensors and devices Data Intelligence Apps Bots Dashboards & Visualizations Power BI Intelligence

Automated Systems Action Bringing Big Data for everybody User Adoption Built for the cloud to accelerate the pace of innovation through a state of the art Control Ease of use platform Data Lake Analytics Specific Applications in a multi-tenant form factor HDInsight HDP | CDH | MapR (Azure Marketplace) Workload optimized, managed clusters Any Hadoop technology IaaS Hadoop Managed Hadoop

Big Data as-aservice Azure Data Lake Analytics Azure Storage Data Lake Store Azure HDInsight Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industrys best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* *IDC study The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure

HDInsight Azure Data Lake Store A hyper-scale repository for Big Data analytics workloads Hadoop File System (HDFS) for the cloud No limits to scale Store any data in its native format Enterprise-grade access control, encryption at rest Optimized for analytic workload performance Azure Data Lake Analytics Distributed analytics service built on Apache YARN A new distributed analytics service Includes U-SQLa language that unifies the benefits of SQL with the expressive power of C#

Elastic scale per query lets users focus on business goalsnot configuring hardware Integrates with Visual Studio to develop, debug, and tune code faster Federated query across Azure data sources Enterprise-grade role based access control Demo: Running a U-SQL job in ADLA Highest availability guarantee in the industry for peace of mind Managed, monitored and supported by Microsoft Enterprise-leading SLA 99.9% uptime No IT resources needed for upgrades and patching

Microsoft monitors your 99.9% SLA *Applies to HDInsight only deployment so you dont have to Runs in the most datacenters worldwide North Central US Illinois West Europe Netherland s Central US Iowa West US Californi a South Central US Texas East US Virginia

China North* Beijing North Europe Ireland Japan East Tokyo, Saitama China South* Shanghai Japan West Osaka India Central Pune East US 2 Virginia East Asia Hong Kong

SE Asia Singapo re Australia East New South Wales Azure doubling compute and storage every 6 months *Applies to HDInsight only Brazil South Sao Paulo State Australia South East Victoria Manage and secure your data by leveraging existing IT investments Auditing, alerting, access controlall from within a single web-based portal

Azure Active Directory integration for identity and access management Leverage existing investment in Active Directory on-premises Lower total cost of ownership No hardware Hadoop support included with Azure support Pay only for what you use Independently scale storage and compute No need to hire specialized operations team 63% lower total cost of ownership than onpremises*

*IDC study The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight Get started now Learn more on the Data Lake website: http://azure.com/datalake Watch videos on Azure Data Lake: https://channel9.msdn.com/Series/AzureDataLa ke Take courses and read documentation on Azure Data Lake: http://aka.ms/hditraining http://aka.ms/adlanalytics http://aka.ms/adlstore Copyright Microsoft Corporation. All rights reserved.

Recently Viewed Presentations

  • Eras of Earth's History

    Eras of Earth's History

    *Cenozoic Era* Continents were drifting apart—present time of Earth. African plate collided with European plate creating the Alps. India slammed into Asian plate creating Himalayas. Climate was cooled causing several "ice ages"
  • Speaker Intro - NPAIHB

    Speaker Intro - NPAIHB

    Target Audience. The overarching goal of PCSS is to train a diverse range of healthcare professionals in the safe and effective prescribing of opioid medications for the treatment of pain, as well as the treatment of substance use disorders, particularly...
  • Automotive Plumbing: Tubing and Pipe - Cengage

    Automotive Plumbing: Tubing and Pipe - Cengage

    Automotive Plumbing: Tubing and Pipe Chapter 24 Objectives Describe the different types of tubing used on automobiles Understand the different types of tubing connections Repair damaged tubing Introduction Tubing and pipe Found on automobiles and on shop equipment This chapter...
  • poster number Title Author1, Author 2, and Author

    poster number Title Author1, Author 2, and Author

    poster number Title Author1, Author 2, and Author 3 Heading 3 Introduction Conclusion Site preparation has become an integral part of the southern regeneration system due to its demonstrated efficacy for creating conditions favorable to planting and to seedling survival...
  • Title (46 pt. HP Simplified bold) - Certiport

    Title (46 pt. HP Simplified bold) - Certiport

    Success story: Belfast MET. ... Active on Certiport Portal by Aug 1, 2014 (manual orders can be submitted to [email protected]) Note: Technical track and business tracks are separate products. Discounts may be available for customers that purchase both. ... Title...
  • RENDER THE VESPR DOT STRUCTURE OF IF5 SHOWING

    RENDER THE VESPR DOT STRUCTURE OF IF5 SHOWING

    form the vsepr chart and . is. square. pyramidal. f. render the vespr dot structure of so 4 2-showing the geometry, axe notation and full polarity assessment (molecular and bond polarity…both) step one: calculate valence electrons: s: 7 . ve...
  • Get Them Into the Ball Park! Using Estimation As A Means To ...

    Get Them Into the Ball Park! Using Estimation As A Means To ...

    Estimation Ideas To Support Division Strategies Looking At Student Work Effective Use of Estimation Adding It Up, 2001 Why Practice Estimation Strategies? Is the Answer Over or Under? Estimation Game Ten Minute Math, Dale Seymour Publications What do the researchers...
  • Communicating about climate change using a health frame

    Communicating about climate change using a health frame

    Professor Garry Egger, School of Health & Human Sciences, Southern Cross University . Climate change "biggest global health threat of the 21st century" Called for a public health movement that "frames the threat of climate change for humankind as a...