Transcription

Data WarehousingData Warehouse Design &Multi-dimensional ModelsFS 2020Dr. Andreas [email protected] 2020 Slide 1

Outline of the CourseIntroductionDWH ArchitectureDWH-Design and multi-dimensional data modelsExtract, Transform, Load (ETL)MetadataData QualityAnalytic Applications and Business IntelligenceImplementation and Performance(Security and Privacy) Andreas GeppertFS 2020 Slide 2

Content1. Application Development in Data Warehouses2. Schema Design for Data Warehouses3. Multi-dimensional Data Models4. Logical Design for Data Marts5. Appendix: Conceptual Design for Data Marts Andreas GeppertFS 2020 Slide 3

Motivationcomplex facts are represented in the DWHDWH supports integration and analysis– databases must be modeled accordingly– without complete and adequate data models, DWH will not be a successanalytic databases (data marts) use multidimensional concepts– how do you systematically design these databases?– what is the analogon to the Entity Relationship Model and to the mapping ER RM ?notations and conceptual (meta) data modelsmodeling approaches Andreas GeppertFS 2020 Slide 4

Reference Architecture V4 nDataEnrichmentIntegration, HistorizationReusable ,Aggregation,CalculationReporting,OLAP,Data MiningWeb/AppServersGUI Applicationsbuild "seed" IL containing most-needed data and integrated most importantsources Andreas GeppertFS 2020 Slide 5

DWH Application Developmentstart with requirements of analysis richmentDWH Data UniverseIntegration, HistorizationAnalysisReusable Selection,Aggregation,CalculationData ,Aggregation,CalculationReporting,OLAP,Data MiningWeb/AppServersGUIReusableMeasures&Dimensions AreasSubjectMatterAreasStagingAreaLandingZone Metadata ldatabasefilelogic;extract, transform, loadlogic(no ETL)dataflow Andreas GeppertFS 2020 Slide 6

General Approach & Data Model HierarchyAnalysismodeling-driven approachmodel/specify on abstract level and derive (or evengenerate) lower-level constructsRequirementsConceptual designdistinguish conceptual, logical, and physical modelthe same hierarchy is applied to ETL processes /mappingsConceptual modelLogical designLogical modelPhysical designPhysical model Andreas GeppertFS 2020 Slide 7

Data Models: ConceptualThe Conceptual Data Model serves the following purposes:– Unambiguously represent business information structures and rules, enabling communication of thisunderstanding to the entire development team– Provide an implementation-independent set of requirements as input to the logical data model, andto the physical data model– Clearly and uniquely identify all business entities in the systemNote that the conceptual data model should not be considered as an intermediatedesign document to be disregarded after logical and physical design; rather it shouldremain as a part of the database specifications, organized with a variety of documentsthat also describe in detail the requirement acquisition and design processFinally one of the possibly most important advantages of conceptual design shows upduring the operation of the database when the conceptual model and itsdocumentation ease the understanding of data schemas and of applications that usethem and thus facilitate their transformation and maintenance. Andreas GeppertFS 2020 Slide 8

Data Models: LogicalThe Logical Data Model (LDM) is a database-near data model that hides details of datastorage and DBMS-specific idiosyncrasies but can nevertheless be implementedstraightforward on a computer systemIts main purpose is to ensure a proper mapping from a high-level conceptual data model(i.e., an Entity Relationship Model) that focuses exclusively on business entities and theirrelationships to the (principal) schema constructs used by a class of DBMSs (e.g.,relational DBMSs). In other words logical design is conducted in the same way for allrelational DBMSs (e.g., Oracle, DB2 etc.) because they all implement the relational datamodel. As a consequence a specific relational logical model can be used "as is" to designthe physical data model of DB2, Oracle, SQL Server etc. whereas it cannot be used todesign the physical data model for an IMS System. Andreas GeppertFS 2020 Slide 9

Data Models: PhysicalThe purpose of the Physical Database Design is to ensure thatdatabase transactions and queries meet performance requirementswhile reflecting the semantics of the Logical Data ModelWhile the Logical Data Model contains the information requirementsof the system in a normalized form, a direct implementation of themodel is unlikely to meet performance requirementsPhysical Database Design takes into account data and transactionvolume as well as typical queries to produce a schema andenvironment that will meet necessary performance requirements. Andreas GeppertFS 2020 Slide 10

Application Development Processes Big Pictureanalysis application development processanalyzeinfo reqdeterminereuse potentialderivedata reqdeterminetechnologymodelmappingsmodeldata martenrichment application development processanalyzeinfo reqderivedata reqmodelmappingsmodelRMDAdesignlogical schemaintegration application development processanalyzedata signlogical schemadef. incom.interf.extendstaging areaimplementSMAimplementlog. appingsimplementdata martmodelmappingsdef. outgoinginterfacesdef. outgoinginterfacesimplementETL processdefinequality preports Andreas GeppertFS 2020 Slide 11

Content1. Application Development in Data Warehouses2. Schema Design for Data Warehouses3. Multi-dimensional Data Models4. Logical Design for Data Marts5. Appendix: Conceptual Design for Data Marts Andreas GeppertFS 2020 Slide 12

Schema Design for Data Warehouses database design must support goals of the data warehouse integration– schema integration– target of ETL processes historization data quality granularity in most architectural styles (Hub-and-Spoke, etc.) the DWH is implemented relationallyon the logical level (3NF) the ER Model can/should be used for conceptual modeling Andreas GeppertFS 2020 Slide 13

Schema Design for Data Warehouses Requirements to the data warehouse result from requirements analysis of theanalysis/reporting application are the data that an application under construction needs already in the datawarehouse (integration layer) and can be sourced from there?– if not, those data have to be sourced from one or more data sources /operational systems it is absolutely crucial to maintain accurate, timely, and complete informationabout the data in the data warehouse– data in the DWH must be modeled ( conceptual schemas !)– semantics of data must be understood– data ownership must be defined Andreas GeppertFS 2020 Slide 14

Integrationthe data warehouse integrates data from different sources– different aspects of the same business entities are managed in different business processeswith disjoint databases– different business processes over the same business entity operate on disjoint databases– the same business process is implemented by multiple applications (e.g., because of mergersand acquisitions); e.g., multiple CRM systems"vertical" integration: integrate attributes from different sourcesinto the same entity"horizontal" integration: integrate entities from different sourcesinto the same entity collection (logically: relation) Andreas GeppertFS 2020 Slide 15

Key Generation and Integrationdifferent sources usually maintain different kinds of primary keys– different types of keys (e.g., securities)– overlapping sets of key values (no globally unique identifiers)business keys vs. technical keysuse artificial/technical keys in the data warehouse ("surrogate keys")– map business keys / source keys surrogate– maintain business keys as attributes– surrogates are not visible to users and applications should not rely on themapping of business keys onto surrogatesCSN (Valorennr)Apple Inc. (CUSIP)Surrogate VNr CUSIP ISINBEA Systems (ISIN) Andreas GeppertFS 2020 Slide 16

HistorizationDWH must represent historical evolution of objectsdiffering states of objects on the timelinebi-temporal time notion (temporal databases)Validity time: Interval, during which an object has been in a specific state (e.g.,during which an attribute has had a certain value)Transaction time: point in time when the state of an object changed (e.g., anattribute has been modified). Could also be an interval. Andreas GeppertFS 2020 Slide 17

Historization (2)Customer Calvin has been living in Basel for a long timeOn April 1, he moved to Bern. He announced his move one week in advance–this change is reflected by an update (closing the validity interval of the old address) and an insert(containing the new address with an infinite valid until date)On November 11, we learned that Calvin was deleted from the customerdatabase as per November 1.–This change is reflected by an update (update of the transaction time and closing of the validityinterval)Customer NameAddressTX TimeValid From Valid Until12345CalvinBasel2004-04-042004-04-01 11-112019-04-01 9999-12-319999-12-31 2019-11-01 Andreas GeppertFS 2020 Slide 18

Modeling of "Time"date and time are important properties inmany applicationsin addition to the data and time, furtherproperties are important, depending on theapplication (e.g., holidays)different calendars exists (we normally usethe Gregorian Calendar)in addition to the calendar year, othernotions of "year" are common (e.g., fiscalyear)in many cases, the explicit modeling ofdate and time is recommended over simplyusing the database system's calendarFiscalMonthFiscalYearDateDayMonthYear Andreas GeppertFS 2020 Slide 19

Data Vault ModelingThe Data Vault technique has been introduced in the 1990sToday it is used in many DWH projectsPrevious techniques (3NF-based data models) have issues withchanging sources. Data Vault modeling has been designed to bettercope with such changesThe Data Vault main components:HubsLink TablesSatellites Andreas GeppertFS 2020 Slide 20

Data Vault Main Elements: Hub EntitiesRepresent an entity in the subject area of interestCarries at a minimum a unique list of business keys– invoice number, customer number, employee PID, In case the business uses multiple business keys, the hub containsmultiple rows (one for each business key)Surrogate key: optional (see surrogates above)Load date timestamp: records when the key was first loaded into thedata warehouseRecord source: reference to the source system where the businesskey came from, for traceability reasons Andreas GeppertFS 2020 Slide 21

Data Vault Main Elements: Link EntitiesRepresent the relationship between two or more businesscomponentsHub keys represent the relationship between the hubsSurrogate key: optional componentLoad date timestamp: indicates when the relationship was firstcreated in the warehouseRecord source: indicates from which data source the relationshipwas loaded; for traceability reasons Andreas GeppertFS 2020 Slide 22

Data Vault Main Elements: Satellite Entitieshub key context (descriptive) informationsatellite data are subject to change over time; therefore thestructure must be capable of storing new or altered dataSatellite primary key: Hub or link primary keyLoad date timestamp: indicates when the context information wasfirst created in the warehouseSequence surrogate number: optional. Useful for Satellites that havemultiple valuesRecord source: indicates from which data source the satellite wasloaded; for traceability reasons Andreas GeppertFS 2020 Slide 23

Data Vault Examplesource: ics/ Andreas GeppertFS 2020 Slide 24

Inhalt1. Application Development in Data Warehouses2. Schema Design for Data Warehouses3. Multi-dimensional Data Models4. Logical Design for Data Marts5. Appendix: Conceptual Design for Data Marts Andreas GeppertFS 2020 Slide 25

Multi-dimensional Data Models“Classical” relations:– One-dimensional (not in the mathematical sense)– Relation maps key onto attributesHowever, in many cases in data warehousing one is interested inmultiple perspectives („dimensions“)– Example: Sales based on product, time, region, customer, store,manager/employee Cannot be represented with normal relations Multi-dimensional data models Multi-dimensional database systems Andreas GeppertFS 2020 Slide 26

Comparison: relational vs. Multi-dimensional DMrelational modelMulti-dimensional data models Simple, little semantics Application-neutral More complex, more semantics Well-suited only for specificapplications Ostensive modeling userfriendly Caution: the multi-dimensional datamodel does not exist Less ostensive modeling ( ERM) standardized– No uniform query language– No standards– No uniform formalization Andreas GeppertFS 2020 Slide 27

Content1. Application Development in Data Warehouses2. Schema Design for Data Warehouses3. Multi-dimensional Data Models– Dimensions, Measures, Facts, Cubes– Operators4. Logical Design for Data Marts5. Appendix: Conceptual Design for Data Marts Andreas GeppertFS 2020 Slide 28

Multi-dimensional Data: CubesMulti-dimensional data are seen and represented as data cubesMore precisely: Hypercubes Distinction intoor sub-cube Quantifying information al282324Region Qualifying information identifies cell9411Produkt– Qualifying and– Quantifying information61 Andreas GeppertFS 2020 Slide 29

Multi-dimensional Data: Representation Andreas GeppertFS 2020 Slide 30

Mult