W H I T E PA P E RData Modeling inApache Cassandra Five Steps to an Awesome Data Model

Data Modeling in Apache Cassandra CONTENTSIntroduction 3Data Modeling in Cassandra 3Why data modeling is critical 3Differences between Cassandra and relational databases 3How Cassandra stores data 4High-level goals for a Cassandra data model 5Five Steps to an Awesome Data Model 5A hypothetical application 5Step 1: Build the application workflow 6Step 2: Model the queries required by the application 7Step 3: Create the tables 8Chebotko diagrams 9Denormalization is expected 10Step 4: Get the primary key right 11Creating unique keys 11Be careful with custom naming schemes 11Step 5: Use data types effectively 12Collections in Cassandra 12User-defined data types 13Things to Keep in Mind 14Thinking in relational terms can cause problems 14Secondary indexes 15Materialized views 16Using batches effectively 16Learning More 17Summary 17References 18About DataStax 192

Data Modeling in Apache Cassandra INTRODUCTIONFor web-scale applications, Apache Cassandra is a favorite choice among architects anddevelopers. It offers many advantages including performance, scalability, continuousavailability, geographic distribution, and ease of management. Today, Cassandra is amongthe most successful NoSQL databases. It is used in countless applications from onlineretail to internet portals to time-series databases to mobile application backends.While Cassandra is powerful and easy to use, having a well-designed data model is essential to meeting applicationperformance and scalability goals. In this paper, aimed at technical people experienced with relational databases, wediscuss five useful steps to realizing a high-quality data model for your Cassandra application.DATA MODELING IN CASSANDRAWhy data modeling is criticalHaving the correct data model is important for any application, but it becomes especially critical as applications scale.An application may work with thousands of records and a few hundred concurrent users, but what happens whenrecord and user counts are in millions or billions?Regardless of the database, if the data model isn’t right, or doesn’t suit the underlying database architecture, users canexperience poor performance, downtime, and even data loss or data corruption. Fixing a poorly designed data modelafter an application is in production is an experience that nobody wants to go through. It’s better to take some timeupfront and use a proven methodology to design a data model that will be scalable, extensible, and maintainable overthe application lifecycle.Differences between Cassandra and relational databasesMost readers will be familiar with relational databases such as Oracle , MySQL , and PostgreSQL . Before jumping intodata modeling with Cassandra, it’s helpful to explain a few differences between Cassandra and relational databases./ In Cassandra, denormalization is expected – With relational databases, designers are usually encouraged tostore data in a normalizedi form to save disk space and ensure data integrity. In Cassandra, storing the same dataredundantly in multiple tables is not only expected, but it also tends to be a feature of a good data model./ In Cassandra, writes are (almost) free – Owing to Cassandra’s architecture, writes are shockingly fast comparedto relational databases. Write latency may be in the hundreds of microseconds and large production clusters cansupport millions of writes-per-second.ii/ No joins – Relational database users assume that they can reference fields from multiple tables in a single queryjoining tables on the fly. With Cassandra, this functionality doesn’t exist, so developers need to think about howthey will structure their data model to provide equivalent functionality.3

Data Modeling in Apache Cassandra / Developers need to think about consistency – Relational databases are ACID compliant (Atomicity, Consistency,Isolation, Durability), characteristics that help guarantee data validity with multiple reads and writes. Cassandrasupports the notion of tunable consistency, balancing between Consistency, Availability, and Partition Tolerance(CAP).iii Cassandra gives developers the flexibility to manage trade-offs between data consistency, availability,and application performance when formulating queries./ Indexing – In a relational database, queries can be optimized by simply creating an index on a field. Whilesecondary indexes exist in Cassandra, they are not a “silver bullet” as they are in a relational databasemanagement system (RDBMS). In Cassandra, tables are usually designed to support specific queries, andsecondary indexes are useful only in specific circumstances.How Cassandra stores dataUnderstanding how Cassandra stores data is essential to developing a good data model. Readers wishing to get abetter understanding of Cassandra’s internal architecture can read the DataStax Apache Cassandra Architecturewhitepaper available at che-cassandra-architecture.Cassandra clusters have multiple nodes running in local data centers or public clouds. Data is typically storedredundantly across nodes according to a configurable replication factor so that the database continues to operate evenwhen nodes are down or unreachable.Tables in Cassandra are much like RDBMS tables. Physical records in the table are spread across the cluster at alocation determined by a partition key. The partition key is hashed to a 64-bit token that identifies the Cassandra nodewhere data and replicas are stored.iv The Cassandra cluster is conceptually represented as a ring, as shown in Figure 1,where each cluster node is responsible for storing tokens in a range.Queries that look up records based on the partition key are extremely fast because Cassandra can immediatelydetermine the host holding required data using the partitioning function. Since clusters can potentially have hundredsor even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data aredistributed across cluster nodes.Figure 1 – How Cassandra stores dataPartition keys can be single columns or can be composed of multiple columns. Cassandra also supports clusteringcolumns (discussed shortly) that control how data records are grouped and organized within each partition. Records inCassandra are stored as lists of key-value pairs where the column name is the key.4

Data Modeling in Apache Cassandra High-level goals for a Cassandra data model/ Spread data evenly around the cluster – For Cassandra to work optimally, data should be spread as evenly aspossible across cluster nodes. Distributing data evenly depends on selecting a good partition key./ Minimize the number of partitions to read – When Cassandra reads data, it’s best to read from as fewpartitions as possible since each partition potentially resides on a different cluster node. If a query involvesmultiple partitions, the coordinator node responsible for the query needs to interact with many nodes, therebyreducing performance./ Anticipate how data will grow, and think about potential bottlenecks in advance – A particular data modelmight make sense when you have a few hundred transactions per user, but what would happen to performanceif there were millions of transactions per user? It’s a good idea to always “think big” when building data modelsfor Cassandra and avoid knowingly introducing bottlenecks.FIVE STEPS TO AN AWESOME DATA MODELA hypothetical applicationTo illustrate how to build Cassandra applications, DataStax offers a reference application called KillrVideo,v a fictitiouscompany operating a video service like YouTube .Figure 2 – KillrVideo: Sample Cassandra ApplicationWhen discussing data modeling, it’s helpful to focus on a concrete example. In the sections that follow, data modelingwill be discussed in the context of the KillrVideo application.In KillrVideo, users perform activities such as creating profiles, submitting, tagging, and sharing videos, searching forand retrieving videos submitted by others, viewing them, rating them, and commenting on them. KillrVideo serves as auseful example because it needs to operate at internet scale, supporting millions of users and transaction volumes thatwould be all but impossible to achieve using a relational database.5

Data Modeling in Apache Cassandra Step 1: Build the application workflowWhen building applications using relational databases, developers often start with the data model, thinkingabout the data items that need to be stored and how they relate to one another. With Cassandra, just theopposite is recommended. The best practice is to start with the application workflow; an approach referred to as“query-first design.”Before thinking about how data will be stored, designers need to know what types of queries the database will need tosupport. Figure 3 presents a simplified application workflow for forvideosby a tagUser logsinto the site89Show avideo and itsdetails23Show basicinformationabout the user4Show videosaddedby a userShow latestvideos addedto the site10Showcommentsfor a videoShowratings for avideo5Show videoratings bya userShow commentsposted bya userFigure 3 – Simplified application workflowThe sequence of workflow steps matters because it helps us determine what data is available and required for eachquery. For example, before we can show basic information about a user (step 2 above), a userid is required. The userfirst needs to log in to the site (step 1) supplying an email address and password in exchange for the required userid.A userid might also be obtained by searching for a video (steps 6 or 7), showing comments for a video (step 9), andlooking up details about the user that commented. Similarly, before the application can display details about a video(step 8) the application needs a videoid obtained by selecting from a list of the latest videos (step 7) or by searchingvideos by tag (step 6).6

Data Modeling in Apache Cassandra Step 2: Model the queries required by the applicationEven at the design stage, developers can think through the sequence of tasks required, mock up what each screen willlook like, and decide what data will be required at each stage.Figure 4 shows a simplified entity relationship diagram (ERD) for the KillrVideo application. The application needs to beable to keep track of entities such as users, videos, and comments. Users can perform activities such as adding videos,rating videos, and posting comments. Users can comment on multiple videos, and each video can have multiple usercomments associated, but there is only one owner of each video.idadded datefirstnamelastnameidUseraddsnemailcreated ngrated datefeaturesnnpreview datetagsidCommentcommenttimestampFigure 4 – KillrVideo: Entity relationship diagram (ERD) for KillrVideoIt’s a good idea to iterate between the application workflow and ERD, updating both as new data items andrelationships required by the application are identified. Once developers have a clear idea of the application workflowand the key data objects required, it’s possible to start identifying the queries that the application needs to support. Adiagram showing key queries and how they relate to data domains is shown in Figure 5.7

Data Modeling in Apache Cassandra Users12CommentsVideos5User logsinto the siteFind a user byan emailShow basicinformationabout a userFind a user byan id3Show commentsposted by a userFind comments by auser (latest first)96Show commentsfor a videoFind comments by avideo (latest first)7Ratings84Show videoratings by auser10Find ratings by a user(latest first)Showratings for avideoFind ratings by a uservideo (latest first)Show videosadded bya userSearch forvideosby a tagFind videos by auser (latest first)Find videosby a tagShow latestvideos addedto the siteFind videos bya date added(latest first)Show avideo and itsdetailsFind a video byan idFigure 5 – Identify the queries required to support the application workflowStep 3: Create the tablesThe next step is to think about how tables should be structured to support que