Outside-In Project Planning & Management

Outside-In Project Planning & Management

From Development to Deployment (ESaaS 12.1) 2013 Armando Fox & David Patterson, all rights reserved Outline Continuous integration & continuous deployment Upgrades & feature flags Availability & responsiveness

Monitoring Relieving pressure on the database Defending customer data Development vs. Deployment Development: Testing to make sure your app works as designed Deployment: Testing to make sure your app works when used in ways it was not designed to be used

Bad News Users are a terrible thing Some bugs only appear under stress Production environment != development environment The world is full of evil forces And idiots Good News: PaaS Makes Deployment Way Easier

Get Virtual Private Server (VPS), maybe in cloud Install & configure Linux, Rails, Apache, mysqld, openssl, sshd, ipchains, squid, qmail, logrotate, Fix almost-weekly security vulnerabilities Find yourself in Library Hell (version control) Tune all moving parts to get most bang for buck Figure out how to automate horizontal scaling

Our Goal: Stick with PaaS! PaaS handles We handle Easy tiers of horizontal scaling Minimize load on database Component-level performance tuning

Application-level performance tuning (e.g. caching) Infrastructure-level security Application-level security Is this really feasible? Pivotal Tracker & Basecamp each run on a single DB (128GB commodity box <$10K)

Many SaaS apps are not world-facing (internal or otherwise limited interest) Performance & Security Defined What % of time is site up & accessible? Responsiveness How long after a click does user get response? Scalability As # users increases, can you maintain responsiveness

without increasing cost/user? Performance Stability Availability or Uptime Privacy Authentication Can we trust that user is who s/he claims to be?

Data integrity Is users sensitive data tamper-evident? Security Is data access limited to the appropriate users? END 8 Quantifying Availability and

Responsiveness (ESaaS 12.2) 2013 Armando Fox & David Patterson, all rights reserved Availability and Response Time Gold standard: US public phone system, 99.999% uptime (five nines) Rule of thumb: 5 nines ~5 minutes/year Since each nine is an order of magnitude, 4 nines ~50 minutes/year, etc.

Good Internet services get 3-4 nines Response time: how long after I interact with site do I perceive response? For small content on fast network, dominated by latency (not bandwidth) Is Response Time Important? How important is response time?* Amazon: +100ms => 1% drop in sales Yahoo!: +400ms => 5-9% drop in traffic

Google: +500ms => 20% fewer searches Classic studies (Miller 1968, Bhatti 2000) Jeff Dean, Google Fellow <100 ms is instantaneous >7 sec is abandonment time Speed is a feature

http://developers.google.com/speed * Nicole Sullivan (Yahoo! Inc.), Design Fast Websites, http://www.slideshare.net/stubbornella/designing-fast-websites-presentation 13 Simplified (& False!) View of Performance For standard normal distribution of response times: 2 standard deviations around mean is 95% confidence interval Average response

time T means: 95%ile users get T+2s 99.7% users get T+3s 14 A Real Response Distribution

25% 50% 75% Mean (median) 95% Courtesy Bill Kayser, Distinguished Engineer, New Relic. http://blog.newrelic.com/breaking-down-apdex Used with permission of the author. Service Level Objective (SLO) Time to satisfy user request

(latency or response time) SLO: Instead of worst case or average: what % of users get acceptable performance Specify %ile, target response time, time window e.g., 99% < 1 sec, over a 5 minute window Why is time window important? Service level agreement (SLA) is an SLO to which provider is contractually obligated 16

Apdex: Simplified SLO Given a threshold latency T for user satisfaction: Satisfactory requests: t T Tolerable requests: T t 4T Apdex = (#satisfactory + 0.5(#tolerable)) / #reqs 0.85 to 0.93 generally good Warning! Can hide systematic outliers if not used carefully!

e.g. critical action occurs once in every 15 clicks but takes 10x as long => (14+0)/15 = 0.93 Apdex Visualization T=1500ms, Apdex = 0.7 Apdex Visualization T=1000ms, Apdex = 0.49

What to Do If Site is Slow? Small site: overprovision Applies to presentation & logic tier Before cloud computing, this was painful Today, its largely automatic (e.g. Rightscale) Large site: worry Overprovision 1,000-computer site by 10% = 100 idle computers Insight: same problems that push us out of

PaaS-friendly tier are the ones that will dog us when larger! END 21 Continuous Integration & Continuous Deployment (ESaaS 12.3) 2013 Armando Fox & David Patterson, all rights reserved

Releases Then and Now: Windows 95 Launch Party Releases Then and Now Facebook: master branch pushed once a week, aiming for once a day (Bobby Johnson, Dir. of Eng., in late 2011) Amazon: several deploys per week StackOverflow: multiple deploys per day (Jeff Atwood, co-founder)

GitHub: tens of deploys per day (Zach Holman) Rationale: risk == # of engineer-hours invested in product since last deployment! Like development and feature check-in, deployment should be a non-event that happens all the time Successful Deployment Automation: consistent deploy process PaaS sites like Heroku, CloudFoundry already do this Use tool like Capistrano for self-hosted Rails site

Continuous Integration (CI): integrationtesting the app beyond what each developer does Pre-release code check-in triggers CI Since frequent check-ins, CI always running Common strategy: integrate with GitHub Why CI? Differences between dev & production envs Cross-browser or cross-version testing Testing SOA integration when remote

services act wonky Hardening: protection against attacks Stress testing/longevity testing of new features/code paths Example: Salesforce.com CI runs 150K+ tests and automatically opens bug report when test fails Continuous Deployment Push => CI => deploy several times per day deploy may be auto-integrated with CI runs

So are releases meaningless? Still useful as customer-visible milestones Tag specific commits with release names git tag 'happy-hippo' HEAD git push --tags Or just use Git commit ID to identify release END 30

Upgrades & Feature Flags (ESaaS 12.4) 2013 Armando Fox & David Patterson, all rights reserved The Trouble With Upgrades What if upgraded code is rolled out to many servers? During rollout, some will have version n and others version n+1will that work?

What if upgraded code goes with schema migration? Schema version n+1 breaks current code New code wont work with current schema Nave Update 1. Take service offline 2. Apply destructive migration, including data copying 3. Deploy new code http://pastebin.com/5dj9k1cj

4. Bring service back online May result in unacceptable downtime Incremental Upgrades with Feature Flags 1. Do nondestructive migration http://pastebin.com/TYx5qaSB 2. Deploy method protected by feature flag http://pastebin.com/qqrLfuQh 3. Flip feature flag on; if disaster, flip it back 4. Once all records moved, deploy new code

without feature flag 5. Apply migration to remove old columns Feature flag is a design pattern Undoing an Upgrade Disaster strikesuse down-migration? Is down-migration thoroughly tested? Is migration reversible? Are you sure someone else didnt apply an irreversible migration?

Use feature flags instead Down-migrations are primarily for development But upgrades are common source of SaaS outages! Always have a plan to back out of an upgrade Other Uses for Feature Flags Preflight checking: gradual rollout of feature to increasing numbers of users To scope for performance problems

A/B testing Different users get different features/implementations to test them Complex feature whose code spans multiple deploys rollout gem (on GitHub) covers these cases and more END 39

Monitoring (ESaaS 12.5) 2013 Armando Fox & David Patterson, all rights reserved Kinds of Monitoring If youre not monitoring it, its probably broken At development time (profiling) Identify possible performance/stability problems

before they get to production In production Internal: instrumentation embedded in app and/ or framework (Rails, Rack, etc.) External: active probing by other site(s) Why Use External Monitoring? Detect if site is down Detect if site is slow for reasons outside measurement boundary of internal

monitoring Get users view from many different places on the Internet Example: Pingdom Internal Monitoring Pre-SaaS/PaaS: local Info collected & stored locally, e.g. Nagios Today: hosted Info collected in your app but stored centrally

Info available even when app is down Example: New Relic Conveniently, has both a development mode and production mode Basic level of service is free for Heroku apps Kinds of monitoring Sampling of Monitoring Tools What is monitored

Availability Unhandled exceptions Slow controller actions or DB queries Level site site Example tool

pingdom.com airbrake.com Hosted Yes Yes app newrelic.com (also has dev mode)

Yes Clicks, think times Process health & telemetry (MySQL server, Apache, etc.) app Google Analytics process god, monit, nagios

Yes No Interesting: Customer-readable monitoring features with cucumber-newrelic http://pastebin.com/TaecHfND What to Measure? Stress testing or load testing: how far can I push my system...

...before performance becomes unacceptable? ...before it gasps and dies? Usually, one component will be bottleneck A particular view, action, query, Load testers can be simple or sophisticated Bang on a single URI over and over Do a fixed sequence of URIs over and over Play back a log file

48 Longevity Bugs Resource leak (RAM, file buffers, sessions table) is classic example Some infrastructure software such as Apache already does rejuvenation aka rolling reboot Related: running out of sessions Solution: store whole session[] in cookie (Rails

3 does this by default) END 50

Recently Viewed Presentations

  • An Empire in Transition

    An Empire in Transition

    Reduced colonial government. Permitted Royal officials to be tried in England when accused of a crime. Required quartering of troops. Violated English Bill of Rights. First Continental Congress. 55 delegates from 12 colonies (except Georgia) met in Philadelphia . Decisions...
  • Industry Initiatives: What is Railinc Developing? DTE Transportation

    Industry Initiatives: What is Railinc Developing? DTE Transportation

    7001 Weston Parkway, Suite 200, Cary, NC 27513 -877.724.5462 -- www.railinc.com. Industry Initiatives: What is Railinc Developing? DTE Transportation Services. 4th Annual Railcar Maintenance Management Seminar
  • Leadership thru innovation An ISO 9001:2008 Certified Company

    Leadership thru innovation An ISO 9001:2008 Certified Company

    Multimeter and Megger MEG 5D. Cable Fault Pre-locator TDR 3000 with Arc Reflection Filter. Surge Voltage up to 40kV-3000Joules. DC Test /Burn Test 40kV - 200mA. Cable Route Tracer CRT 50. Cable Identification System CI 60S. ... Slide 1 Last...
  • Level 3 Extended Project Qualification WJEC Entry Code:

    Level 3 Extended Project Qualification WJEC Entry Code:

    Internally assessed and externally moderated. Free online e-submission platform available to all centres. Diversity is a key element of the EPQ philosophy
  • THE WAY AHEAD? - Leytonstone School

    THE WAY AHEAD? - Leytonstone School

    LEVELS? WHAT ARE YOU WORKING TOWARDS? The Longer Term WHAT DO THE COLLEGES/EMPLOYERS WANT? COLLEGE REQUIREMENTS Slide 7 www.ucasprogress.com APPRENTICESHIPS MOVING ONWARDS..... GCSES and UNIVERSITY www.leytonstoneschool.org (students>careers) Slide 13 MATHS & ENGLISH So what do I have to do?
  • Using INSPECT® and OARS to Guide Instruction

    Using INSPECT® and OARS to Guide Instruction

    Early literacy and early numeracy are important skill areas that develop during the early childhood period. Early Literacy Assessments To support K-2 children in acquiring and mastering the foundational skills, Inspect® created assessments that focus on early literacy skills in...
  • Settlement Type Conurbation City Town Village Hamlet Isolated

    Settlement Type Conurbation City Town Village Hamlet Isolated

    Settlement Type Examples of each type of settlement. Conurbation City Town Village Hamlet Isolated Building
  • Tema 1. Genealogía de la antropología del enfermar.

    Tema 1. Genealogía de la antropología del enfermar.

    Deborah Lupton, 1997: creencias a través de relaciones sociales y políticas. Nikolas Rose y Paul Rabinow. CindyPatton y Nancy Tomes (2007): Complejidad. Procesos de negociación y resistencia. Teorías de la biomedicalización.