Web Algorithmics - Dipartimento di Informatica

Web Algorithmics - Dipartimento di Informatica

Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Universit di Pisa Our driving moral... Big steps come from theory ... but do NOT forget practice ;-) Strings... why? Ubiquitous: any datum is a sequence of bits, hence a string Spur new problems in many areas: Geometry String-similarity search Points in high-dim space and NN- search Lower/upper-bounds to indexing via reductions to geo- problem

Graphs Doc-doc similarity graph Query-log graphs page ubiquitous in Text/Web mining edge iff 2 queries clicked on the same res- (String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix-searches for a pattern P. Exact search Hashing Mitzenmacher, ESA invited 09 Dominated the string-matching [Fredkin, CACMscene

1960] in the 80s-90s with its suffix-version: the Suffix Tree Trie (Compacted) Performance: 0 Search O(|P|) time s Space O(K + N) 1 y 2 stile zyg (2; 3,5) 5 1 etic ial 2

3 z aibelyite 2 omo 7 5 czecin ygy 4 systile syzygetic syzygial syzygy 6 szaibelyite szczecin szomo Timeline: theory and practice... What about Software Engineers ??

x 9 0 -8 0 7 0 6 0 ie uffi r T S ee r T Dominated the string-matching [Fredkin, CACMscene 1960] in the 80s-90s with its suffix-version: the Suffix Tree Trie (Compacted) Performance:

0 Search O(|P|) time s Space O(K + N) 1 y .. But in practice 2 Search: random memory accesses stile zyg (2; 3,5) Space: len + pointers + strings 5 1 etic ial 2

3 z aibelyite 2 omo 7 5 czecin ygy 4 systile syzygetic syzygial syzygy 6 szaibelyite szczecin szomo What did systems implement? Used the Compacted trie, of course, but with 2 other concerns because of large data

1 issue: space concern .systile syzygetic syzygial syzygy. 2 5 3345% http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html Front Coding 5 0 33 34 38 38 34

35 35 33 42 25 25 38 33 http://checkmate.com/All_Natural/ Applied.html roma.html 1.html tic_Art.html yate.html er_Soap.html urvedic_Soap.html Bath_Salt_Bulk.html s.html Essence_Oils.html Mineral_Bath_Crystals.html Salt.html Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... http://checkmate.com/All/Natural/Washcloth.html ...

Bender et al., PODS 2006 Ferragina et al., PODS 2008 2 issue: hierarchical memory track M Spatial locality or Temporal locality caching: less I/Os Less and Faster I/Os CPU 1 Internal Memory Count I/Os B HD 2-level indexing 2 advantages: Search typically 1 I/O Space Front-coding over buckets Internal Memory

CT on a sample systile szaielyite 2 limitations: Sampling rate lengths of sampled strings Trade-off speed vs space (because of bucket si (Prefix) B-tree Disk .0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo. Timeline: theory and practice... Do we need to trade space by I/Os ? Space + Hierarchical Memory e

S n tri 95 v e l 2- de n li g 19 x 9 0 ffi u S -8 0

7 0 6 0 T rie ee r T n xi g t re B e [Morrison, J.ACM 1968] An old idea: Patricia Trie 0 s 1

y 2 stile zyg aibelyite etic 2 5 ygy ial 3 2 omo 7 5 1 z 4 czecin

6 [Ferragina-Grossi, J.ACM 1999] A new search 0 Search(P): Phase 1: tree navigation Phase 2: Compute LCP Phase 3: tree navigation Three-phase search: P = syzyyea s y 1 2 s z z g < ya Ps position 5 Only 1 string is checked e y

2 o 0 1 2 5 c Trie Space #strings, NOT their i length .systile syzygetic syzygial syzygy szaibelyite szczecin szomo. [Ferragina-Grossi, J.ACM 1999] > 15 US-patents cite it !! The String B-tree [Handbook of Comp. Biology, 2009 + Search(P)

O((p/B) logB K) I/Os O(occ/B) I/Os 1 string checked : O(p/B) PT 29 13 20 18 3 O(logB K) levels 23 It is dynamic... PT 29 2 PT 26 13 PT 29 1

9 20 25 PT 5 2 26 10 4 PT 6 PT 7 13 20 16 Lexicographic position of P 28 18 3 14 PT

8 25 6 12 15 22 18 21 23 PT 3 27 24 11 PT 14 21 17 23 Knuth, vol 3, pag. 489: elegant I/O-aware algorithms & data structures

I/Os was the main concern [CACM 1988] [2006] Huge literature !! Timeline: theory and practice... Not just 2 memory levels g e 99 19 t re B Space CPU registers

Parameter-free solutions n tri 95 S g 19 le 2- 9 0 -8 0 7 0 6 0 ie uffi r T S

e e d e in Tr l x ve n xi L1 Cache Anywhere, anytime, anyway... I/O- L2 RAM HD net Cache-oblivious Algo. and Data Str. See chap by Arge, Brodal, Fagerberg Some precious achievements... Cache-oblivious trie

Static dictionary of strings [Brodal et al, SODA 2006] Cache-oblivious String B-tree Dynamic dictionary of strings [Bender et al, PODS 2006] Cache-oblivious tree mapping Patricia Trie Split-and-Refine that applies to any B-fixed tree partitioning [Alstrup et al, manuscript 2003] Worst-case solution [Demaine et al, manuscript 2004] Timeline: theory and practice... Not just 2 memory levels g Compressed data structures 99 g B-

Cache-oblivious data structures e e r t 19 n tri 95 S 19 le 2- 9 0 -8 0 7 0 6 0

ie uffi r T S e e d e in Tr l x ve n xi Space A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many squeezing heuristics that compress data and still support fast access to them Can we automate and guarantee the process

Aka: Compressed self-indexes Opportunistic Data Structures with Applications P. Ferragina, G. Manzini ...now, J.ACM 2005 Space for text + (full-text) index compressed text ( Hk) [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows #

i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p

i s s i i Can we compress it ? [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows # i i

i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s

s i i T bzip2 = BWT + other simple compressors [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows Suffix Array #

i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p

i s s i i T bzip2 = BWT + other simple compressors From practice to theory... [Ferra gina-M a bwt(T) # i i i i m p p s s s s mississipp #mississip ppi#missis

ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i nzini, Focs 0 0] FM-index = BWT is searchable ...or Suffix Array is compressible

Space = |T| Hk + o(|T|) bits Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiment [Navarro-Makinen, ACM Comp. Surveys 2007] Compressed & Searchable data formats Texts Trees Graphs SODA 2002 FOCS 2008 STACS 2009 SODA 2002 SODA 2007 ICALP 2007 SWAT 2008 ICALP 2009 SODA 2010 DCC 2001 WWW 2004 ISAAC 2007 ESA 2008

FOCS 2009 Functions Point Sets ICALP 2003, 04 SODA 2004 ICALP 2008 ESA 2009 LATIN 2010 SODA 2003 TALG 2007 WADS 2009 SODA 2009 FOCS 2000 SODA 2003, 04 SODA 2007 SPIRE 2007 CPM 2008 CPM 2010 ICALP 2010 Integer Sets Labeled Trees SODA 2002 FOCS 2005 WWW 2006 SODA 2007

ICDE 2010 Images DCC 2008 [December 2003] [January 2005] ACM J. on Experimental Algorithmics, 2009 > 103 faster than Smith-W. >102 faster than SOAP & Maq What about the Web ? [Ferragina-Manzini, ACM WSDM 2010] Where we are nowadays g e e r t 99 g B-

19 n tri 95 S 19 le 2- 9 0 -8 0 7 0 6 0 ie uffi r T S e e

d e in Tr l x ve n xi Cache-oblivious data structures Compressed data structures Something is known... yet very preliminary [PODS 08, Navarro, Vitter, ...] Bellazougui et al, this ESA What else... [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling Self-adjusting or Weighted design Time ops depend on some (un/known) distribution Challenging : no pointers, self-adjust (perf) vs compression (space)

A bigger challenge: from micro to macro ! IEEE Computer, 2007 Approach #1 (engineering oriented) News: Proper system components + specific algorithms Sanders & Meyers groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Approach #2 (Manage resources) Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability Susanne Albers, Comm. ACM 2010 Approach #3 (models and algorithms) IEEE Computer, 2009

Algorithmics offers benefits that Wo IEE rksh o Gr E Co p in ee n C nf. o n omfar extend p. 20 10 beyond TCS into the design of systems. Sometimes energy is a primary resource! Energy-aware Algo+Ds ? Memory-level impacts Lo ca lit yp ay so ff

I/Os and compression are obviously important BUT here there is a new twist MIPS per Watt ? Battery life !! Ap pro ac pri Who cares whether your application: ncipl h in a ed 1.is y% slower than optimal, but it is more energy efficient ? way 2.occupies x% more space than optimal, but decompr is faster ? MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Battery life !! Stay tuned: Algorithm Library for Mobile Phones

v Hbase - Hadoop BigTable, 2006 Cosmos HyperTable Cassandra Real-time search Q&A social search Knowledge search Many ingredients Items are graphs, vectors, strings, Number and size are VERY large Involve many resources to be optimized: Time (speed/patience) Space (#disks/management costs) Bandwidth (speed/) Multi-objective optimization Energy ()

in data-structure design! Thats all ! Look at my paper in the

Recently Viewed Presentations

  • CHEST TRAUMA Victor Politi,M.D., FACP Medical Director, SVCMC

    CHEST TRAUMA Victor Politi,M.D., FACP Medical Director, SVCMC

    CHEST TRAUMA Victor Politi,M.D., FACP Medical Director, SVCMC Physician Assistant Program Statistics Each year there are nearly 150,000 accidental deaths in the United States 25% of these deaths are a direct result of thoracic trauma An additional 25% of traumatic...
  • Early Childhood Mental Health Consultation: How Do We Know It ...

    Early Childhood Mental Health Consultation: How Do We Know It ...

    Georgetown University National Technical Assistance Center for Children's Mental Health * Purpose of Today's Call To highlight findings from Georgetown's new study on effective early childhood mental health consultation (ECMHC) To hear from two of the study sites about their...
  • DBQ Muslim vs. Christian Trade - Quia

    DBQ Muslim vs. Christian Trade - Quia

    The letter from mother to son shows how commerce was adverse to Christianity and undermined it more than anything. All of these documents show that Christians disapproved of trade, unlike Muslims who made it a large part of their lives...
  • Kyle - St. Johns County School District

    Kyle - St. Johns County School District

    My oldest son, Kyle Cubbedge, works as a deputy for St. Johns County Sheriff's Office. He received his degree in Criminology from UF and is now a full time police officer. He has already received a heroic award for saving...
  • PowerPoint-presentasjon

    PowerPoint-presentasjon

    Vi kan kallar slike personar «satte», lite endringsvillige, forstokka. I dag er ikkje den typiske vaksne slik, og om han/ ho er det, er det som regel med dårlegsamvit. Slike «satte» vaksne har i dag lav status både i arbeidslivet...
  • Folie 1 - PROJECT CONSULT

    Folie 1 - PROJECT CONSULT

    ECM Enterprise Content Management wird sich nur einführen lassen, wenn der wirtschaftliche Nutzen deutlicher gemacht wird! Nehmen wir einmal an: Ihr Versicherungsunternehmen hat 1000 Mitarbeiter und arbeitet mit Papier, Fachanwendung und E-Mail 1000 Mitarbeiter arbeiten 8 Stunden an 240 Tagen...
  • Natural Selection - Western Oregon University

    Natural Selection - Western Oregon University

    Natural Selection The Darwin-Wallace theory of organic change over time. Thinking Question Some evidence exists that over several centuries, the number of people born with small wisdom teeth or no wisdom teeth has increased. Using your best understanding of Natural...
  • Military Health Institute Briefing to the Council of

    Military Health Institute Briefing to the Council of

    The University of Texas Health Science Center at San Antonio and the VA National Center for PTSD lead the consortium. The Consortium to Alleviate PTSD (CAP) will provide an array of cutting-edge clinical treatment trials and biological studies for active...