Transcription

Copyright 2016 Splunk Inc.Architecting Splunk for Epic Performance atBlizzard EntertainmentMason MoralesSr. Security Engineer, Blizzard Entertainment

DisclaimerDuring the course of this presentation, we may make forward looking statements regarding future eventsor the expected performance of the company. We caution you that such statements reflect our currentexpectations and estimates based on factors currently known to us and that actual events or resultscould differ materially. For important factors that may cause actual results to differ from those containedin our forward-looking statements, please review our filings with the SEC. The forward-lookingstatements made in the this presentation are being made as of the time and date of its live presentation.If reviewed after its live presentation, this presentation may not contain current or accurate information.We do not assume any obligation to update any forward looking statements we may make. In addition,any information about our roadmap outlines our general product direction and is subject to change atany time without notice. It is for informational purposes only and shall not, be incorporated into anycontract or other commitment. Splunk undertakes no obligation either to develop the features orfunctionality described or to include any such feature or functionality in a future release.2

AgendaIntroductionPerformance IssuesSplunk Re-Design ProjectQ&ABonus Content3

Introduction4

About MeSplunkTrust Member 2015-2016Splunk Certified ArchitectExperience– 5 years of Splunk XP– Sole developer of the Utilization Monitor for Splunk App– Published 126 Splunk training videos with SkillsoftCommunity– Splunk Answers @masonmorales– #Splunk IRC @MasonStarted at Blizzard Entertainment in October 20155

About Blizzard6

Blizzard Use CasesSecurityGame Fraud DetectionIT ortingAlertingProvide Data to Other Internal Applications7

History of Splunk at BlizzardThree separate Splunk deploymentsNobody owned Splunk, no SMESerious performance issues on-premIndexes with default settingsForwarders dating back to v4.3.3Indexers and search heads running v5.0.1Largest user group moved to SplunkCloud because the on-prem deploymentwasn’t being maintained8

October 2015Lok ‘tar!Blizzard Seeks YourSplunk Guidance9

Battle Plan* What we’ll cover in this talk1.Upgrade all the things, implement DS, train users, and more 2.*Fix performance issues with existing deployment3.*Implement new infrastructure to meet business needs4.Migrate forwarders and users to the new Splunk instance5.Continue to add more data, users, and apps to Splunk10

Performance Issues11

Historical Causes at Blizzard1.Too many accelerated searches2.Too many real-time searches3.Bad search schedules4.Inefficient searches5.IOPS-constrained hardware6.Too many users in the same role12

Splunk PerformanceAddressing Performance IssuesReactiveProactive Delete orphaned and unused Perform capacity planningscheduled searches Revoke search acceleration andreal-time capabilities from role(s) Modifying scheduled searches Implement role-based access control On-board data to different indexes Change default time range for timepicker User training– Disable search acceleration– Disable real-time– Convert fixed search schedules tothe new “schedule window”13

Using RolesBlizzard creates separate Splunk roles for each departmentAdvantages1. Limit concurrent jobs– User-level– Role-level2.3.4.5.Limit disk usage on SHsEnforce search restrictionsSeparate knowledge objects when each role also has their own appLimit capabilities for each role /Security/RolesandcapabilitiesDisadvantages– Slightly more administrative overhead14

Tips for Configuring RolesEmpty Index Trick1.2.3.Create an empty index (e.g. index nothing)Assign index nothing as the index searched by default for every roleInform users that they must always specify an index in their searchesLimit Advanced Capabilities– We do not give these capabilities to all users: accelerate search accelerate datamodel rtsearch schedule rtsearch– Evaluate the need for each capability on a case-by-case basis15

Default Time RangeDefault time range in the time picker for search is “All Time”Users often forget to specify time range, but we can limit the damage– Edit SPLUNK h.earliest time -5mdispatch.latest time now– Or configure it through Splunk WebServer settings - Search preferences - Default search time range16

Time Picker CustomizationCustomize the time picker– Copy SPLUNK HOME/etc/system/default/times.confTo SPLUNK HOME/etc/system/local/times.conf– Edit as desired!17

Scheduled SearchesLong-running scheduled searches can waste system resources– Can cause the concurrent search limit to be hit– System-wide impact if everyone has the same role!Tip: Limit the amount of time searches can run for at the role level inauthorize.conf18

Scheduled SearchesHow long are your scheduled searches really running for?19

IndexesMany of our source types get their own index– Why? Efficiency Each index has its own directory and buckets on the file system– When should you separate sourcetypes into additional indexes? Different retention timesDifferent access requirementsDifferent applications generating the dataOne set of data searched more often than another setTip: Create a data catalog for users20

User TrainingBlizzard has an internal Splunk User Group that does training at leastonce/month, along with recurring workshops to help users learn SplunkWhen Blizzard on-boards new users to Splunk, they are invited to the UserGroup and given the following list of learning resources Splunk Cheat Sheet: http://docs.splunk.com/images/4/4f/Splunk Quick Reference Guide 6.x.pdfCommunity Forum: https://answers.splunk.com/Free Splunk eBook: http://www.splunk.com/web assets/v5/book/Exploring Splunk.pdfFree Splunk Course: http://www.splunk.com/view/SP-CAAAHSMSplunk Education Videos: AGB6Splunk Docs: http://docs.splunk.com/Documentation/SplunkSplunk Wiki: https://wiki.splunk.com/Main PageSplunk Apps: https://splunkbase.splunk.com/Splunk YouTube Channel: VyyvAw Internal Wiki Links 21

Tips for User TrainingDocument examples of good searches on your internal WikiAsk your rep for Splunk swag, like query mugs!Distribute Splunk quick reference cardsHold your own SPLing bee to encourage hands-onpractice with SplunkLook at use cases in your environment and helpusers implement things like summary indexing,accelerated data models, and report acceleration22

Splunk Re-Design Project23

Project SummaryGoals––––“One Splunk” experience at BlizzardAwesome performanceHigh availability1-year data retentionBonus– Retire the two on-prem Splunk instances– Level-up configuration management– Standardize on one hardware platform24

Approach1.Determine hardware requirements2.Procure hardware3.Benchmark various configurations4.Deploy new Splunk cluster25

Hardware SelectionStorage Requirements– Various use cases required fast random read– 1-year data retention indexer clustering MANY DISKS!! NOW HANDLE IT!Cost of SSD evaluated against 15k HDD–––––SAS 15K Enterprise Drive: 0.81/GBSAS SSD Enterprise Drive: 0.91/GBSATA SSD Enterprise Drive: 0.49/GBSATA SSD was the clear winner in terms of costAdditionally, SSD drives had 640% more storage density than the 15k drivesSplunk Performance with ng-the-benefits-of-splunk-with-ssds/26

Evaluating SSDs for SplunkSATA vs SAS Technical ComparisonEnterprise SATA 3.84 TB SSDSAS 3.84 TB SSD 540 MB/s Seq. Read 1500 MB/s Seq. Read 480 MB/s Seq. Write 750 MB/s Seq. Write 99,000 IOPS Random Read 270,000 IOPS Random Read 18,000 IOPS Random Write 22,000 IOPS Random Write MTTF: 2,000,000 Hours MTTF 2,000,000 Hours Cost: 1,900 Cost: 3,500Blizzard Conclusion: SAS was 84% more expensive for the same amount of storagewhile Splunk would likely be CPU constrained with a sufficiency quantity of either drive27

CPU and MemoryReference machine for distributed deployments– 16 cores @ 2 Ghz/core– 12 GB RAM (really more like 64 GB)Ultimately a business decision– Memory is cheap, better too much than too little– Many options for CPU, just stay within cost constraint28

Blizzard Indexer Hardware1U dual-socket enterprise serversDual Intel Xeon E5 v4 @ 3.4 GHz256 GB ECC DDR4 2400 MHz20x 2.5” external hot-swap bays (data)2x 2.5” internal bays (OS) (RAID1)2x HBAs on-board storage controller29

Scaling Splunk for PerformanceSplunk scales horizontally, so we distributed pretty heavilyTip: Always add indexers before adding search heads More indexers greater search distribution faster search completion time Faster search completion time less search concurrencyEach search uses one core on each indexer Frequency-optimized CPUs can offer better search performance but at the costof less concurrency (since they typically have a lower core count)30

Doubling PerformanceBlizzard deployed twice as many indexers with only 20% additionalcost by purchasing SSDs with half the capacity of the max availableThis gave us twice the compute and the same amount of storageOther benefits–––––Double the available disk throughputLower CPU contentionLower memory contentionReduction in concurrency factorSubstantially better search performance31

System ConfigurationsSettings– BIOS Enabled hyper-threading Disabled CPU power saving in BIOS– OS Partitions were aligned to erase blocks on SSDsSwap file was disabledLinux IO scheduler was set to deadlineQueue depth was set to 32 for each driveDisabled Transparent Huge Pages (THP)ulimit––––Core file size (ulimit -c) to unlimitedData segment size (ulimit -d) to unlimitedMax open files (ulimit -n) to 65536Max user processes (ulimit -u) to 25804832

Testing MethodologyTested random read, sequential read, and Splunk search performance– Scope included different file systems, RAID levels, and Splunk journalcompression algorithms (GZIP vs LZ4)– Goal was to determine the best performing configurationRAID– mdadm used for the EXT4 and XFS tests– BTRFS used built-in RAID functionalitySame indexer used for all testing33

Testing ProcessSynthetic benchmarks performed with FIO on Ubuntu 14.04.4 (x64)– Flexible I/O (FIO) is available at https://github.com/axboe/fio– Syntax at https://github.com/axboe/fio/blob/master/HOWTO– Disk cache invalidated at the start of each test and used non-buffered IOSplunk benchmarks performed using a large static data set– Splunk v6.4.1 with parallelization settings enabled– Ran the same searches under each configuration– Recorded search completion times34

Synthetic Benchmark ResultsSequential Read at 1M Block SizeRAID 10FSThroughputBTRFS 4,594 MB/sEXT4 10,266 MB/sXFS10,310 MB/sRAID 5FSThroughputBTRFSEXT4XFS5,346 MB/s10,345 MB/s10,390 MB/sfio --time based --name 4k benchmark --size 100G --runtime 30 --filename /splunkdata/test --ioengine libaio--iodepth 128 --direct 1 --invalidate 1 --verify 0 --verify fatal 0 --numjobs 12 --rw read --blocksize 1M --group reporting35

Synthetic Benchmark ResultsRandom Read at 4k Block SizeRAID 10FSIOPSBTRFS 443,000EXT4 389,000XFS 1,228,000Throughput1,733 MB/s1,533 MB/s5,032 MB/sRAID 5FSIOPSBTRFS 427,000EXT4 448,000XFS 2,794,000Throughput1,670 MB/s1,750 MB/s10,915 MB/sfio --time based --name 4k benchmark --size 100G --runtime 30 --filename /splunkdata/test --ioengine libaio--randrepeat 0 --iodepth 128 --direct 1 --invalidate 1 --verify 0 --verify fatal 0 --numjobs 12 --rw randread --blocksize 4k--group reporting36

Synthetic Benchmark ResultsThe numbers for XFS on RAID 5 seemed “too good”– Retested without a time limit and set FIO to random read 1 TB per processFinal result was 1,295,400 IOPS and 5,058 MB/s at a 4kbenchmark: (g 0): rw randread, bs 4K-4K/4K-4K/4K-4K, ioengine libaio,iodepth 128fio-2.1.3 Starting 12 processesbenchmark: (groupid 0, jobs 12):read : io 12000GB, bw 5058.8MB/s, iops 1295.4K, runt 2429074msec Sequential read throughput was 10,294 MB/sec at a 1M block sizefio-2.1.3 Starting 64 processesRun status group 0 (all jobs):READ: io 617713MB, aggrb 10294MB/s, minb 10294MB/s, maxb 10294MB/s37

Splunk Benchmark ResultsSingle Indexer38

Splunk Benchmark ResultsDenseSearchTestRareSearchTest39

Splunk Benchmark ConclusionSearch speed was nearly identical in all testsCPU will always be the bottleneck for adhoc searches in Splunkonce you have a sufficiently fast disk subsystem– Bonus finding: LZ4 does not yield any substantial gains in performance thatwould be worth the tradeoff in extra storage vs. GZIP40

Wrap-up41

Performance Comparison42

Splunk Features for Faster SearchingSummary indexing– /Knowledge/UsesummaryindexingData model acceleration– /Knowledge/AcceleratedatamodelsReport acceleration– /Report/AcceleratereportsPost-process searches– /Viz/Savedsearches#Postprocess searchesBatch mode search parallelization– /Knowledge/Configurebatchmodesearch43

Q&A44

What Now?Related breakout sessions and activities How Splunkd WorksNotes on Optimizing Splunk PerformanceArchitecting Splunk for High Availability and Disaster RecoveryArchitecting and Sizing Your Splunk DeploymentHarnessing Performance and Scalability in the Next Version of SplunkOnboarding Data Into SplunkSplunk User Groups: More Than Pints and Pizza45

THANK YOU

Bonus ContentOptimizations for Data On-BoardingSplunk’s flexibility to perform autom