SDM

SDM

SCIENTIFIC DATA MANAGEMENT Arie Shoshani Computational Research Division Lawrence Berkeley National Laboratory February , 2007 A. Shoshani Feb. 2007 Outline Problem areas in managing scientific data Motivating examples Requirements The DOE Scientific Data Management Center A three-layer architectural approach Some results of technologies (details in mini-symposium) Specific technologies from LBNL Fastbit: innovative bitmap indexing for very large datasets Storage Resource Managers: providing uniform access to storage systems A. Shoshani Feb. 2007 Motivating Example - 1 Optimizing Storage Management and Data Access for High Energy and Nuclear Physics Applications Experiment # members /institutions Date of # events/ year first data volume/yearTB STAR 350/35

2001 10-10 500 PHENIX 350/35 2001 10 9 600 BABAR 300/30 1999 10 9 80 CLAS 200/40 1997 10 1200/140 2007 10 ATLAS

8 9 10 300 10 5000 STAR: Solenoidal Tracker At RHIC RHIC: Relativistic Heavy Ion Collider LHC: Large Hadron Collider Includes: ATLAS, STAR, A. Shoshani A mockup of An event Feb. 2007 Typical Scientific Exploration Process Generate large amounts of raw data large simulations collect from experiments Post-processing of data analyze data (find particles produced, tracks) generate summary data e.g. momentum, no. of pions, transverse energy Number of properties is large (50-100) Analyze data use summary data as guide extract subsets from the large dataset Need to access events based on partial properties specification (range queries) e.g. ((0.1 < AVpT < 0.2) ^ (10 < Np < 20)) v (N > 6000) apply analysis code A. Shoshani Feb. 2007

Motivating example - 2 Combustion simulation: 1000x1000x1000 mesh with 100s of chemical species over 1000s of time steps 1014 data values Astrophysics simulation: 1000x1000x1000 mesh with 10s of variables per cell over 1000s of time steps - 1013 data values This is an image of a single variable Whats needed is search over multiple variables, such as Temperature > 1000 AND pressure > 106 AND HO2 > 10-7 AND HO2 > 10-6 Combining multiple single-variable indexes efficiently is a challenge Solution: specialized bitmap indexes A. Shoshani Feb. 2007 Motivating Example - 3 Earth System Grid Accessing large distributed stores for by 100s of scientists Problems SRM Different storage systems Security procedures File streaming Lifetime of request Garbage collection Solution Storage Resource Managers (SRMs) A. Shoshani Feb. 2007 Motivating Example 4: Fusion Simulation Coordination between Running Codes XGC-ET

Mesh/Interpolation Yes No XGC-ET M3D-L (Linear stability) Stable? Mesh/Interpolation Distributed Store M3D t Stable? No Mesh/Interpolation TBs Blob Detection Yes Distributed Store GBs Compute Puncture Plots Noise Detection Need More Flights? B healed?

Web-Portal (ElVis) Distributed Store MBs Island detection Feature Detection Out-of-core Isosurface methods wide-area network A. Shoshani Feb. 2007 Motivating example - 5 Data Entry and Browsing tool for entering and linking metadata from multiple data sources Metadata Problem for Microarray analysis Microarray schemas are quite complex Many objects: experiments, samples, arrays, hybridization, measurements, Many associations between them Data is generated and processed in multiple locations which participate in the data pipeline In this project: Synechococcus sp. WH8102 whole genome microbes are cultured at Scripps Institution of Oceanography (SIO) then the sample pool is sent to The Institute for Genomics Research (TIGR) then images send to Sandia Lab for Hyperspectral Imaging and analysis Metadata needs to be captured and LINKED Generating specialized user interfaces is expensive and time-consuming to build and change Data is collected on various systems, spreadsheets, notebooks, etc. A. Shoshani Feb. 2007 The kind of technology needed DEB: Data Entry and Browsing Tool 1) The microbes are cultured at SIO

2) Microarray hybridization at TIGR 3) Hyperspectral imaging at Sandia Features - Interface based on lab notebook look and feel - Tools are built on top of commercial DBMS - Schema-driven automatic Screen generation HS_Experiment Link From TIGR to SIO Nucleotide Pool Probe Source Link to MA-Scan HS_Slide HS-Scan MA-Scan Probe MCR-Analysis probe1 Study Analysis probe2 Hybridization Slide LCS-Analysis A. Shoshani Feb. 2007

Storage Growth is Exponential Unlike compute and network resources, storage resources are not reusable Unless data is explicitly removed Need to use storage wisely Checkpointing, remove replicated data Time consuming, tedious tasks Storage Growth 1998-2006 at ORNL (rate: 2X / year) Data growth scales with compute scaling Storage will grow even with good practices (such as eliminating unnecessary replicas) Not necessarily on supercomputers but, on user/group machines and archival storage Storage cost is a consideration Has to be part of science growth cost But, storage costs going down at a rate similar to data growth Need continued investment in new storage technologies A. Shoshani Storage Growth 1998-2006 at NERSC-LBNL (rate: 1.7X / year) The challenges are in managing the data Feb. 2007 Data and Storage Challenges End-to-End: 3 Phases of Scientific Investigation) Data production phase Data movement I/O to parallel file system

Moving data out of supercomputer storage Sustain data rates of GB/sec Observe data during production Automatic generation of metadata Post-processing phase Large-scale (entire datasets) data processing Summarization / statistical properties Reorganization / transposition Generate data at different granularity On-the-fly data processing computations for visualization / monitoring A. Shoshani Data extraction / analysis phase Automate data distribution / replication Synchronize replicated data Data lifetime management to unclog storage Extract subsets efficiently Avoid reading unnecessary data Efficient indexes for fixed content data Automated use of metadata Parallel analysis tools Statistical analysis tools Data mining tools Feb. 2007 The Scientific Data Management Center (Center for Enabling Technologies - CET) PI: Arie Shoshani, LBNL Annual budget: 3.3 Million Established 5 years ago (SciDAC-1) Successfully re-competed for the next 5 years (SciDAC-2) Featured in second issue of SciDAC magazine Laboratories ANL, ORNL, LBNL, LLNL, PNNL Universities NCSU, NWU, SDSC, UCD, Uof Utah,

http://www.scidacreview.org/0602/pdf/data.pdf A. Shoshani Feb. 2007 Scientific Data Management Center Scientific Simulations & experiments Climate Modeling Astrophysics Genomics and Proteomics High Energy Physics Fusion Petabytes Petabytes Tapes Tapes Terabytes Terabytes Disks Disks SDM-ISIC Technology Data Manipulation: Getting files from Tape archive ~80% Extracting subset time of data from files Reformatting data Getting data from heterogeneous,

distributed systems moving data over the network Optimizing shared access from mass storage systems Parallel-IO for various file formats Feature extraction techniques High-dimensional cluster analysis High-dimensional indexing Parallel statistics ~20% time ~80% time ~20% time A. Shoshani Data Manipulation: Using SDM-Center technology Scientific Analysis & Discovery Scientific Analysis & Discovery Current Goal Feb. 2007

A Typical SDM Scenario Flow Tier Task A: Generate Time-Steps Task B: Move TS Data Mover Work Tier Simulation Program Parallel NetCDF PVFS Task C: Analyze TS Post Processing SRM Subset extraction + Parallel R File system Task D: Visualize TS

Terascale Browser HDF5 Libraries Control Flow Layer Applications & Software Tools Layer I/O System Layer Storage & Network Resouces Layer A. Shoshani Feb. 2007 Approach Use an integrated framework that: Provides a scientific workflow capability Supports data mining and analysis tools Accelerates storage and access to data Scientific Simplify data management tasks for the Application scientist Hide details of underlying parallel and indexing technology Permit assembly of modules using a simple graphical workflow description tool A. Shoshani SDM Framework

Scientific Process Automation Layer Data Mining & Analysis Layer Scientific Understanding Storage Efficient Access Layer Feb. 2007 Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical

Framework Analysis Storage Resource Manager (To (SRM) HPSS) Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO)

Parallel Virtual File System Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage Resource Manager (To (SRM) HPSS)

Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File System Hardware, OS, and MSS (HPSS) A. Shoshani

Feb. 2007 Data Generation Scientific Process Automation Layer Data Mining and Analysis Layer Storage Efficient Access Layer Workflow Workflow Design Design and and Execution Execution Simulation Run Parallel netCDF MPI-IO PVFS2 OS, OS, Hardware Hardware (Disks, (Disks, Mass Mass Store) Store) A. Shoshani Feb. 2007 Parallel NetCDF v.s. HDF5 (ANL+NWU) Parallel Virtual File System: Interprocess communication Enhancements and deployment

Developed Parallel netCDF Enables high performance parallel I/O to netCDF datasets Achieves up to 10 fold performance improvement over HDF5 Enhanced ROMIO: P0 P0 P1 P1 P2 P2 P3 P3 P0 P0 P1 P1 P2 P2 P3 P3 Parallel netCDF Parallel netCDF netCDF netCDF Parallel File System Parallel File System Before Parallel File System Parallel File System

After Provides MPI access to PVFS2 Advanced parallel file system interfaces for more efficient access Developed PVFS2 Production use at ANL, Ohio SC, Univ. of Utah HPC center Offered on Dell clusters Being ported to IBM BG/L system A. Shoshani FLASH I/O Benchmark Performance (8x8x8 block sizes) Feb. 2007 Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage

Resource Manager (To (SRM) HPSS) Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File

System Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 Statistical Computing with R About R (http://www.r-project.org/): R is an Open Source (GPL), most widely used programming environment for statistical analysis and graphics; similar to S. good support for both users and developers. Provides extensible via dynamically loadable add-on packages. Highly Originally developed by Robert Gentleman and Ross Ihaka. library(mva) >>library(mva) pca<<-prcomp(data) prcomp(data) >>pca summary(pca) >>summary(pca) A. Shoshani >> dyn.load(foo.so) foo.so) >>dyn.load( .C(foobar foobar) ) >>.C( dyn.unload(foo.so foo.so) ) >>dyn.unload( library(rpvm) (rpvm) >>library .PVM.start.pvmd()() >>.PVM.start.pvmd .PVM.addhosts(...)

(...) >>.PVM.addhosts .PVM.config()() >>.PVM.config Feb. 2007 Providing Task and Data Parallelism in pR Task-parallel analyses: Likelihood Maximization. Re-sampling schemes: Bootstrap, Jackknife, etc. Animations Markov Chain Monte Carlo (MCMC). Multiple chains. Simulated Tempering: running parallel chains at different temperature to improve mixing. Data-parallel analyses: k-means clustering Principal Component Analysis (PCA) Hierarchical (model-based) clustering Distance matrix, histogram, etc. computations A. Shoshani Feb. 2007 Parallel R (pR) Distribution Releases History: http://www.ASPECT-SDM.org/Parallel-R pR enables both data and task parallelism (includes task-pR and RScaLAPACK) (version 1.8.1) provides R interface to ScaLAPACK withRScaLAPACK its scalability in terms of problem size and number of processors using data parallelism (release 0.5.1) task-pR achieves parallelism by performing out-oforder execution of tasks. With its intelligent scheduling mechanism it attains significant gain in execution times (release 0.2.7)

pMatrix provides a parallel platform to perform major matrix operations in parallel using ScaLAPACK and PBLAS Level II & III routines A. Shoshani Also: Available for download from Rs CRAN web site ( www.R-Project.org) with 37 mirror sites in 20 countries Feb. 2007 Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage Resource Manager

(To (SRM) HPSS) Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File System

Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 Piecewise Polynomial Models for Classification of Puncture (Poincar) plots Classify each of the nodes: quasiperiodic, islands, separatrix Connections between the nodes Want accurate and robust classification, valid when few points in each node National Compact Stellarator Experiment Quasiperiodic A. Shoshani Islands Separatrix Feb. 2007 Polar Coordinates Transform the (x,y) data to Polar coordinates (r,). Advantages of polar coordinates: A. Shoshani Radial exaggeration reveals some features that are hard to see otherwise. Automatically restricts analysis to radial band with data, ignoring inside and outside. Easy to handle rotational invariance. Feb. 2007 Piecewise Polynomial Fitting: Computing polynomials

In each interval, compute the polynomial coefficients to fit 1 polynomial to the data. If the error is high, split the data into an upper and lower group. Fit 2 polynomials to the data, one to each group. Blue: data. Red: polynomials. Black: interval boundaries. A. Shoshani Feb. 2007 Classification The number of polynomials needed to fit the data and the number of gaps gives the information needed to classify the node: Number of polynomials Gaps Zero > Zero one Quasiperiodic 2 Polynomials 2 Gaps Islands A. Shoshani two Separatrix Islands 2 Polynomials 0 Gaps Separatrix Feb. 2007 Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining &

Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage Resource Manager (To (SRM) HPSS) Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components

Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File System Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 Example Data Flow in Terascale Supernova Initiative Aggregate to ~500 files (< 2 to 10+ GB each) Input Data Logistical Network Logistic Network L-Bone Local Mass Storage 14+TB)

Aggregate to one file (1+ TB each) Data Depot Archive Highly Parallel Compute Local 44 Proc. Data Cluster - data sits on local nodes for weeks Output ~500x500 files Viz Software Viz Wall Viz Client A. Shoshani Courtesy: John Blondin Feb. 2007 Original TSI Workflow Example with John Blondin, NCSU Automate data generation, transfer and visualization of a large-scale simulation at ORNL A. Shoshani Feb. 2007 Top level TSI Workflow Automate data generation, transfer and visualization of a large-scale simulation at ORNL Submit Job to Cray at ORNL Check whether a time slice is finished Yes

Aggregate all into One large File - Save to HPSS Yes Split it into 22 Files and store them in XRaid ORNL NCSU SGE schedule the transfer for 22 Nodes Head Node submit scheduling to SGE Notify Head Node at NC State Node Retrieve File from XRaid Node-0 Node-1 .. Node-21 A. Shoshani Start Ensight to generate Video Files at Head Node Feb. 2007 Using the Scientific Workflow Tool (Kepler) Emphasizing Dataflow (SDSC, NCSU, LLNL) Automate data generation, transfer and visualization of a large-scale simulation at ORNL A. Shoshani Feb. 2007 New actors in Fusion workflow to support automated data movement

KEPLER Start Two Independent processes 2 Login Login AtAtORNL ORNL (OTP) (OTP) Detect when Detect when Files Filesare are Generated Generated Move Move files files OTP OTP Login Login actor actor File File Watcher Watcher actor actor Scp Scp File copier

File copier actor actor Tar Tar files files Archive Archive files files Kepler Workflow Engine Taring Taring actor actor Local Local archiving archiving actor actor Software components 1 Simulation Simulation Program Program (MPI) (MPI) Disk Cache Seaborg

NERSC A. Shoshani Disk Cache Disk cacke Ewok-ORNL Hardware + OS HPSS ORNL Feb. 2007 Re-applying Technology SDM technology, developed for one application, can be effectively targeted at many other applications Technology Initial Application New Applications Parallel NetCDF Astrophysics Climate Parallel VTK Astrophysics Climate Compressed bitmaps HENP Combustion, Astrophysics Storage Resource Managers

HENP Astrophysics Feature Selection Climate Fusion (exp. & simulation) Scientific Workflow Biology Astrophysics A. Shoshani Feb. 2007 Broad Impact of the SDM Center Astrophysics: High speed storage technology, parallel NetCDF, integration software used for Terascale Supernova Initiative (TSI) and FLASH simulations Tony Mezzacappa ORNL, Mike Zingale U of Chicago, Mike Papka ANL Scientific Workflow John Blondin NCSU Doug Swesty, Eric Myra Stony Brook ASCI FLASH parallel NetCDF Climate: High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer LLNL, John Drake ORNL, John Michalakes NCAR Dimensionality reduction Combustion: Compressed Bitmap Indexing used for fast generation of

flame regions and tracking their progress over time Wendy Koegler, Jacqueline Chen Sandia Lab Region growing A. Shoshani Feb. 2007 Broad Impact (cont.) Biology: Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray data Matt Coleman - LLNL Building a scientific workflow High Energy Physics: Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSS Doug Olson - LBNL, Eric Hjort LBNL, Jerome Lauret - BNL Fusion: Dynamic monitoring of HPSS file transfers A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak Keith Burrell - General Atomics Scott Klasky - PPPL Identifying key A. Shoshani parameters for the DIII-D Tokamak Feb. 2007 Technology Details by Layer Scientific Process

Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage Resource Manager (To (SRM) HPSS) Data Data Analysis and Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF

Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File System Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 FastBit: An Efficient Indexing Technology For Accelerating Data Intensive Science Outline Overview Searching Technology Applications http://sdm.lbl.gov/fastbit A. Shoshani Feb. 2007

Searching Problems in Data Intensive Sciences Find the collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova These are not typical database searches: Large high-dimensional data sets (1000 time steps X 1000 X 1000 X 1000 cells X 100 variables) No modification of individual records during queries, i.e., append-only data Complex questions: 500 < Temp < 1000 && CH3 > 10-4 && Large answers (hit thousands or millions of records) Seek collective features such as regions of interest, beyond typical average, sum A. Shoshani Feb. 2007 Common Indexing Strategies Not Efficient Task: searching high-dimensional append-only data with ad hoc range queries Most tree-based indices are designed to be updated quickly E.g. family of B-Trees Sacrifice search efficiency to permit dynamic update Hash-based indices are Efficient for finding a small number of records But, not efficient for ad hoc multi-dimensional queries Most multi-dimensional indices suffer curse of dimensionality E.g. R-tree, Quad-trees, KD-trees, Dont scale to high dimensions (< 20) Are inefficient if some dimensions are not queried A. Shoshani Feb. 2007 Our Approach: An Efficient Bitmap Index Bitmap indices Sacrifice update efficiency to gain more search efficiency

Are efficient for multi-dimensional queries Scale linearly as the number of dimensions actually used in a query Bitmap indices may demand too much space We solve the space problem by developing an efficient compression method that Reduces the index size, typically 30% of raw data, vs. 300% for some B+tree indices Improves operational efficiency, 10X speedup We have applied FastBit to speed up a number of DOE funded applications A. Shoshani Feb. 2007 FastBit In a Nutshell FastBit is designed to search multidimensional append-only data Conceptually in table format rows objects columns attributes row FastBit uses vertical (column-oriented) organization for the data Efficient for searching FastBit uses bitmap indices with a specialized compression method A. Shoshani column Proven in analysis to be optimal for single-attribute queries Superior to others because they are also efficient for multidimensional queries Feb. 2007 Bit-Sliced Index Take advantage that index need to be is append only partition each property into bins (e.g. for 0

compress each bit vector (some version of run length encoding) Column 2 Column 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 A. Shoshani 0 0 0 0 0 1 1 1

0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 column n 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 Feb. 2007 Basic Bitmap Index Data values 0 1 5 3 1 2 0 4 1 A. Shoshani First commercial version b0 1 0 0 0 0 0 1 0 0 b1 0 1 0 0 1 0 0 0 1

b2 0 0 0 0 0 1 0 0 0 b3 0 0 0 1 0 0 0 0 0 b4 0 0 0 0 0 0 0 1 0 b5 0 0 1 0 0 0 0 0 0

Model 204, P. ONeil, 1987 Easy to build: faster than building B-trees Efficient for querying: only bitwise logical operations A < 2 b0 OR b1 A > 2 b3 OR b4 OR b5 Efficient for multi-dimensional queries Use bitwise operations to combine the partial results Size: one bit per distinct value per object Definition: Cardinality == number of distinct values Compact for low cardinality attributes only, say, < 100 Need to control size for high cardinality attributes =0 =1 =2 =3 =4 =5 A<2 2

[literal] [31 0-words] [literal] [31 0-words] [00 0F 00 00] [80 00 00 1F] [02 01 F0 00] [80 00 00 0F] Other ideas - repeated byte patterns, with counts - Well-known method use in Oracle: Byte-aligned Bitmap Code (BBC) Advantage: Can perform logical operations such as: AND, OR, NOT, XOR, And COUNT operations directly on compressed data A. Shoshani Feb. 2007 FastBit Compression Method is Compute-Efficient Example: 2015 bits 10000000000000000000011100000000000000000000000000000.00000000000000000000000000000001111111111111111111111111 Main Idea: Use run-length-encoding, but... partition bits into 31-bit groups on 32-bit machines 31 bits 31 bits 31 bits Merge neighboring groups with identical bits 31 bits Count=63 (31 bits) 31 bits Encode each group using one word Name: Word-Aligned Hybrid (WAH) code (US patent 6,831,575) Key features: WAH is compute-efficient because it Uses the run-length encoding (simple) Allows operations directly on compressed bitmaps Never breaks any words into smaller pieces during operations A. Shoshani Feb. 2007

Compute Efficient Compression Method: 10 times faster than best-known method 10X A. Shoshani selectivity Feb. 2007 Time to Evaluate a Single-Attribute Range Condition in FastBit is Optimal Evaluating a single attribute range condition may require ORing multiple bitmaps Both analysis and timing measurement confirm that the query processing time is at worst proportional to the number of hits A range Worst case: Uniform Random Data Bitmaps in an index BBC: Byte-aligned Bitmap Code A. Shoshani The best known bitmap compression Realistic case: Zipf Data Feb. 2007 Processing Multi-Dimensional Queries 1 0 0 0 0 0

1 0 0 2,4,5,8 1,2,5,7,9 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 1 0 0 0 0 0 0 OR 2,5 1 1 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1

0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 OR AND Merging results from tree-based indices is slow Because sorting and merging are slow

Merging results from bitmap indices is fast Because bitwise operations on bitmaps are efficient A. Shoshani Feb. 2007 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 Multi-Attribute Range Queries 2-D queries A. Shoshani 5-D queries Results are based on 12 most queried attributes (2.2 million records) from STAR High-Energy Physics Experiment with average attribute cardinality equal to 222,000 WAH compressed indices are 10X faster than bitmap indices from a DBMS, 5X faster than our own implementation of BBC Size of WAH compressed indices is only 30% of raw data size (a popular DBMS system uses 3-4X for B+-tree indices) Feb. 2007 The Case for Query Driven Visualization

Support Compound Range Queries: e.g. Get all cells where Temperature > 300k AND Pressure is < 200 millibars Subsetting: Only load data that corresponds to the query. Get rid of visual clutter Reduce load on data analysis pipeline Quickly find and label connected regions Do it really fast! A. Shoshani Feb. 2007 Architecture Overview: Query-Driven Vis. Pipeline FastBit Data Query Vis / Analysis Display Index A. Shoshani Feb. 2007 [Stockinger, Shalf, Bethel, Wu 2005] DEX Visualization Pipeline Query Data Visualization Toolkit (VTK) 3D visualization of a Supernova explosion A. Shoshani Feb. 2007

[Stockinger, Shalf, Bethel, Wu 2005] Extending FastBit to Find Regions of Interest Comparison to what VTK is good at: single attribute iso-contouring But, FastBit also does well on: Multi-attribute search Region finding: produces whole volume rather than contour Region tracking Proved to have the same theoretical efficiency as the best iso-contouring algorithms Measured to be 3X faster than the best isocontouring algorithms A. Shoshani [Stockinger, Shalf, Bethel, Wu 2005] DEX 0.3 Time [sec] Implemented in Dexterous Data Explorer (DEX) jointly with Vis group vtkKitwareContourFilter 0.2 3X 0.1 0 Feb. 2007 Combustion: Flame Front Tracking

Need to perform: T1 T2 T3 T4 Cell identification Identify all cells that satisfy user specified conditions, such as, 600 < Temperature < 700 AND HO2 concentration > 10-7 Region growing Connect neighboring cells into regions Region tracking Track the evolution of the regions (i.e., features) through time All steps perform with Bitmap structures A. Shoshani Feb. 2007 Linear cost with number of segments region growing time (sec) 2 1.6 1.2 0.8 0.4

0 0.E+00 1.E+05 2.E+05 3.E+05 4.E+05 Number of line segments Time required to identify regions in 3D Supernova simulation (LBNL) On 3D data with over 110 million records, region finding takes less than 2 seconds A. Shoshani [Wu, Koegler, Chen, Shoshani 2003] Feb. 2007 Extending FastBit to Computer Conditional Histograms Conditional histograms are common in data analysis E.g., finding the number of malicious network connections in a particular time window Top left: a histogram of number of connections to port 5554 of machine in LBNL IP address space (two-horizontal axes), vertical axis is time 10000 Two sets of scans are visible as two sheets FastBit ROOT ROOT-FastBit Time [sec] 1000 100

Bottom left: FastBit computes conditional histograms much faster than common data analysis tools 10X faster than ROOT 2X faster than ROOT with FastBit indices 10 1 1 10 Number of processors A. Shoshani 100 Feb. 2007 [Stockinger, Bethel, Campbell, Dart, Wu 2006] A Nuclear Physics Example: STAR STAR: Solenoidal Tracker At RHIC; RHIC: Relativistic Heavy Ion Collider 600 participants / 50 institutions / 12 countries / in production since 2000 ~100 million collision events a year, ~5 MB raw data per event, several levels of summary data Generated 3 petabytes and 5 million files Append-only data, aka write-once read-many (WORM) data A. Shoshani Feb. 2007 Grid Collector Benefits of the Grid Collector: transparent object access Selection of objects based on their attribute values Improvement of analysis systems throughput Interactive analysis of data distributed on the Grid A. Shoshani Feb. 2007

Finding Needles in STAR Data One of the primary goals of STAR is to search for Quark Gluon Plasma (QGP) A small number (~hundreds) of collision events may contain the clearest evidence of QGP Using high-level summary data, researchers found 80 special events Have track distributions that are indicative of QGP Further analysis needs to access more detailed data Detailed data are large (terabytes) and reside on HPSS May take many weeks to manually migrate to disk We located and retrieved the 80 events in 15 minutes A. Shoshani Feb. 2007 Grid Collector Speeds up Analyses 1000 5 Sample 1 Sample 1 4 100 Sample 3 10 speedup speedup Sample 2

Sample 2 Sample 3 3 2 1 1 0.00001 0 0.0001 0.001 0.01 selectivity 0.1 1 0 0.2 0.4 0.6 0.8 1 selectivity Test machine: 2.8 GHz Xeon, 27 MB/s read speed When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster Using GC to read 1/2 of events, speedup > 1.5, 1/10 events, speed up > 2. A. Shoshani

Feb. 2007 Summary Applications Involving FastBit STAR Search for rare events with special significance BNL (STAR collaboration) Combustion Data Analysis Finding and tracking ignition kernels Sandia (Combustion Research Facility) Dexterous Data Explorer (DEX) Interactive exploration of large scientific data (visualize regions of interest) LBNL Vis group Network Traffic Analysis Enable interactive analysis of network traffic data for forensic and live stream data LBNL Vis group, NERSC/ESNet security, DNA sequencing Finding anomalies in raw DNA anomaly detection sequencing data to diagnose sequencing machine operations and DNA sample preparations JGI

FastBit implements an efficient patented compression technique to speed up the searches in data intensive scientific applications Feb. 2007 A. Shoshani Technology Details by Layer Scientific Process Automation (SPA) Layer Data Mining & Analysis (DMA) Layer Storage Efficient Access (SEA) Layer WorkFlow Management Engine Tools ASPECT: Parallel R integration Statistical Framework Analysis Storage Resource Manager (To (SRM) HPSS) Data Data Analysis and

Analysis Feature tools Identification (PCA, ICA) Parallel NetCDF Scientific Web Wrapping Workflow Components Tools Efficient indexing (Bitmap Index) Efficient Parallel Visualization (pVTK) Parallel ROMIO MPI-IO I/O System (ROMIO) Parallel Virtual File System Hardware, OS, and MSS (HPSS) A. Shoshani Feb. 2007 What is SRM? Storage Resource Managers (SRM) are middleware components whose function is to provide

Dynamic space allocation Dynamic file management in space For shared storage components on the WAN A. Shoshani Feb. 2007 Motivation Suppose you want to run a job on your local machine Need to allocate space Need to bring all input files Need to ensure correctness of files transferred Need to monitor and recover from errors What if files dont fit space? Need to manage file streaming Need to remove files to make space for more files Now, suppose that the machine and storage space is a shared resource Need to do the above for many users Need to enforce quotas Need to ensure fairness of scheduling users A. Shoshani Feb. 2007 Motivation Now, suppose you want to do that on a WAN Need to access a variety of storage systems mostly remote systems, need to have access permission Need to have special software to access mass storage systems Now, suppose you want to run distributed jobs on the WAN Need to allocate remote spaces Need to move (stream) files to remote sites Need to manage file outputs and their movement to destination site(s) A. Shoshani Feb. 2007 Ubiquitous and Transparent Data Access and Sharing Petabytes

Tapes Data Analysis e.g. HPSS Terabytes Data Analysis Disks Terabytes Data Analysis Disks Data Analysis A. Shoshani Feb. 2007 Interoperability of SRMs Client USER/APPLICATIONS Grid Middleware SRM SRM Enstore SRM JASMine SRM SRM SRM

dCache Castor SE SRM Unix-based disks CCLRC RAL A. Shoshani Feb. 2007 SDSC Storage Resource Broker - Grid Middleware This figure was taken from one of the talks by Reagan Moore Application Resource User File SID Application Meta-data A. Shoshani Obj SID SRB Server SRB MCAT Dublin Core DBLobj SID

UniTree HPSS DB2 Client Library Oracle Unix Local SRB Server Feb. 2007 SRM vs. SRB Storage Resource Broker (SRB) Very successful product from SDSC, has long history Is a centralized solution where all requests go to a central server that includes a metadata catalog (MCAT) Developed by a single institution Storage Resource Management (SRM) Based on open standard Developed by multiple institutions for their storage systems Designed for interoperation of heterogeneous storage systems Features of SRM that SRB does not deal with Managing storage space dynamically based clients request Managing content of space based on lifetime controlled by client Support for file streaming by pinning and releasing files Several institutions now ask for an SRM interface to SRB In GGF: activity to bridge these technologies A. Shoshani Feb. 2007 GGF GIN-Data SRM inter-op testing (GGF: Global Grid Forum, GIN: Grid Interoperability Now)

Client SRM-TESTER 1. Initiate SRM-TESTER 3. Publish test results WEB 2. Test Storage Sites according to the spec v1.1 and v2.2 SRM SRM SRM SRM SRM SRM SRM SRM SRM CERN LCG IC.UK EGEE UIO ARC SDSC OSG LBNL STAR APAC

SRM Grid.IT SRM FNAL CMS VU SRM GridFTP HTTP(s) FTP services HRM (performs writes) A. Shoshani Feb. 2007 Testing Operations Results ping put get Advisory delete Copy (SRMs) Copy (gsiftp) ARC (UIO.NO) pass fail pass

fail pass fail EGEE (IC.UK) pass pass pass pass pass pass CMS (FNAL.GOV) pass pass pass pass pass pass LCG/EGEE (CERN) pass pass pass pass

N.A. N.A. OSG (SDSC) pass pass pass pass pass fail STAR (LBNL) pass pass pass pass pass pass A. Shoshani Feb. 2007 Peer-to-Peer Uniform Interface Client (command line) ... Clients site

Uniform SRM interface client Client Program Disk Cache Storage Resource Manager Disk Cache network Storage Resource Manager Disk Cache ... Site 1 A. Shoshani Storage Resource Manager Disk Cache Disk Cache ... Site 2 Storage Resource

Manager ... Disk Cache Disk Cache Site N MSS Feb. 2007 Earth Science Grid Analysis Environment LBNL HPSS High Performance Storage System ANL disk HRM HRM Storage Resource Storage Resource Management Management gridFTP gridFTP server server NCAR openDAPg openDAPg server server gridFTP

gridFTP Striped Striped server server MyProxy MyProxy server server Tomcat servlet engine Tomcat servlet engine disk LLNL MCS client RLS client DRM DRM Storage Resource Storage Resource Management Management gridFTP gridFTP server server gridFTP ISI MCS MCS Services Metadata Cataloguing Metadata Cataloguing Services RLS RLS Services Replica Location Replica Location Services A. Shoshani SOAP

MyProxy client CAS client DRM DRM Storage Resource Storage Resource Management Management GRAM GRAM gatekeeper gatekeeper gridFTP gridFTP server server HRM HRM Storage Resource Storage Resource Management Management RMI disk CAS CAS Community Authorization Services Community Authorization Services MSS Mass Storage System gridFTP ORNL gridFTP gridFTP

server server disk HRM HRM Storage Resource Storage Resource Management Management HPSS High Performance Storage System Feb. 2007 History and partners in SRM Collaboration 5 year of Storage Resource (SRM) Management activity Experience with SRM system implementations Mass Storage Systems: HRM-HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR), Castor (RAL) Disk systems: DRM(LBNL), jSRM (Jlab), DPM (CERN), universities Combination systems: dCache(Fermi) sophisticated multi-storage system L-Store (U Vanderbilt) based on Logistical Networking StoRM to parallel file systems (ICTP, Trieste, Italy) A. Shoshani Feb. 2007 Standards for Storage Resource Management Main concepts Allocate spaces Get/put files from/into spaces Pin files for a lifetime Release files and spaces Get files into spaces from remote sites Manage directory structures in spaces SRMs communicate as peer-to-peer

Negotiate transfer protocols A. Shoshani Feb. 2007 DataMover Perform rcp r directory on the WAN A. Shoshani Feb. 2007 FABRIC COLLECTIVE COLLEC TIVE 2 : COLLECTIVE 1: GENERAL SERV ICES SERVICES FOR SPECIFIC TO COORDINATING APPLICATION RESOURCE: DOMAIN OR MULTIPLE SHARING SINGLE VIRTUAL ORG. RESOURCES RESOURCES CONNECTIVITY SRMs supports data movement between storage systems A. Shoshani Request Interpretation and Planning Services Data Transport Services

File Transfer Service (GridFTP) Workflow or Request Management Services Data Federation Services ApplicationSpecific Data Discovery Services Storage Data Movement Storage Resource Manager General Data Discovery Services Data Filtering or Transformation Services Communication Protocols (e.g., TCP/IP stack) Networks Community Authorization Services Consistency Services (e.g., Update Subscription, Versioning, Master Copies)

Data Filtering or Transformation Services Database Management Services Compute Scheduling (Brokering) Compute Resource Management Monitoring/ Auditing Services Resource Monitoring/ Auditing Authentication and Authorization Protocols (e.g., GSI) Mass Storage System (HPSS) Other Storage systems Compute Systems This figure based on the Grid Architecture paper by Globus Team Feb. 2007

Massive Robust File Replication Multi-File Replication why is it a problem? Tedious task many files, repetitious Lengthy task long time, can take hours, even days Error prone need to monitor transfers Error recovery need to restart file transfers Stage and archive from MSS limited concurrency, down time, transient failures Commercial MSS HPSS at NERSC, ORNL, , Legacy MSS MSS at NCAR Independent MSS Castor (CERN), Enstore (Fermilab), JasMINE (Jlab) A. Shoshani Feb. 2007 DataMover: SRMs use in ESG for Robust Multi-file replication Anywhere DataMover Recovers from file transfer failures Recovers from archiving failures Recovers from staging failures Create identical Directory and issue SRM-COPY (thousands of files) Get list of files From directory NCAR SRM-GET (one file at a time) LBNL/ ORNL

SRM SRM (performs writes) GridFTP GET (pull mode) Disk Cache (performs reads) Disk Cache NCAR-MSS Network transfer archive files A. Shoshani Web-based File Monitoring Tool stage files Feb. 2007 Web-Based File Monitoring Tool Shows: -Files already transferred - Files during transfer - Files to be transferred Also shows for each file: -Source URL -Target URL -Transfer rate A. Shoshani

Feb. 2007 File tracking helps to identify bottlenecks Shows that archiving is the bottleneck A. Shoshani Feb. 2007 File tracking shows recovery from transient failures Total: 45 GBs A. Shoshani Feb. 2007 Robust Multi-File Replication Main results DataMover is being used in production for over three years Moves about 10 TBs a month currently Averages 8 MB/s (64 Mb/sec) over WAN Eliminated person-time to monitor transfer and recover from failures Reduced error rates from about 1% to 0.02% (50 fold reduction)* * http://www.ppdg.net/docs/oct04/ppdg-star-oct04.doc A. Shoshani Feb. 2007 Summary: lessons learned Scientific workflow is an important paradigm Coordination of tasks AND Management of data flow Managing repetitive steps Tracking, estimation Efficient I/O is often the bottleneck Technology essential for efficient computation Mass storage need to be seamlessly managed Opportunities to interact with Math packages

General analysis tools are useful Parallelization is key to scaling Visualization is an integral part of analysis Data movement is complex Network infrastructure is not enough can be unreliable Need robust software to manage failures Need to manage space allocation Managing format mismatch is part of data flow Metadata emerging as an important need Description of experiments/simulation Provenance Use of hints / access patterns A. Shoshani Feb. 2007 Data and Storage Challenges: Still to be Overcome Fundamental technology areas From the report from the DOE Office of Science Data-Management Workshops (March May 2004) General Open Problems Multiple parallel file systems A common data model Efficient access and queries, data integration Distributed data management, data movement, networks

Storage and caching Data analysis, visualization, and integrated environments Metadata, data description, logical organization Workflow, data flow, data transformation Coordinated scheduling of resources Reservations and workflow management Multiple data formats Running coupled codes Coordinated data movement (not just files) Data Reliability / monitoring / recovery Tracking data for long running jobs Security: authentication + authorization URL: http://www-user.slac.stanford.edu/rmount/ dm-workshop-04/Final-report.pdf A. Shoshani Feb. 2007 SDM Mini-Symposium High-Performance Parallel Data and Storage Management Alok Choudhary (NWU) and Rob Ross (ANL), Robert Latham (ANL) Mining Science Data Chandrika Kamath (LLNL) Accelerating Scientific Exploration with Workflow Automation Systems Ilkay Altintas (SDSC), Terence Critchlow (LLNL), Scott Klasky (ORNL), Bertram Ludaescher (UCDavis), Steve Parker (UofUtah), Mladen Vouk (NCSU) High Performance Statistical Computing with Parallel R and Star-P Nagiza Samatova (ORNL) and Alan Edelman (MIT) A. Shoshani Feb. 2007

Recently Viewed Presentations

  • Arousal Drive theory Inverted U theory - A Level PE

    Arousal Drive theory Inverted U theory - A Level PE

    Gross and simple., rugby tackle A. ... If high cognitive arousal occurs with high somatic arousal you can tip over the edge. Point B. Point B the performance drops dramatically 'performer has a catastrophe'. ... Arousal Drive theory Inverted U...
  • Pengantar Ilmu Ekonomi

    Pengantar Ilmu Ekonomi

    Dosen : Drs. Octo Rianto, MM S-1 : Jurnalistik (1989, IISIP Jakarta) S-2 : Manajemen Keuangan (1998, STIE IPWI Jakarta) Kasubbag. Kemahasiswaan FIP UNJ.
  • BME 301 - Rice University

    BME 301 - Rice University

    BIOE 301 Lecture Sixteen How Do We Treat Heart Failure? Artificial Heart Artificial Heart - History April 4th, 1969 Haskell Karp became first human to have artificial heart implanted Surgeon Denton Cooley performed operation Artificial Heart - History Denton Cooley...
  • Chapter 17  The New Frontier and the Great

    Chapter 17 The New Frontier and the Great

    Lyndon Johnson Personality Large and intense with none of Kennedy's good looks, polish, or charm Hardworking and ambitious Genuine desire to help others Greater concern for the poor and underprivileged than Kennedy Believed in an expanded role for government in...
  • Rights of the Accused - Mr. Johnson

    Rights of the Accused - Mr. Johnson

    Rights of the Accused. ... Atkins v. Virginia, 2002. Constitutional Question: Is it constitutional to use the death penalty against individuals who are mentally retarded? No, it is cruel because the punishment is excessive. Roper v. Simmons, 2005.
  • Solar Basics

    Solar Basics

    Solar Estimate from FSEC in Cocoa FL PV System Engineering Decomposition into Functional Components A Representative Grid-Intertie Solar Electric System Solar Energy Intensity Energy Usage & Conservation Florida Energy Use Varies with the Time of Day (Daily Living) PV Cell...
  • To Kill a Mockingbird - MHS English: Ms. Ward

    To Kill a Mockingbird - MHS English: Ms. Ward

    Chapters 3-5: Questions. Here are some more references. What lessons are being learned in these examples? "I could not help receiving the impression that I was being cheated out of something," (44).
  • Genigraphics Research Poster Template A0/A1

    Genigraphics Research Poster Template A0/A1

    The study used Group Cohesion Scale-Revised (GCS-R; Treadwell et al., 2001). The GCS-R is a 25-item questionnaire designed to assess group cohesion in terms of interaction and communication among grown up members (including domination and subordination), member retention, decision-making, vulnerability...