Apache Cassandra Blog

A Practical Introduction To Cassandra Query Language

Cassandra Query Language (CQL) Tutorial is an up to date model of this tutorial. It’s relevant to Apache Cassandra version 3.x and above. Continue to use this tutorial in case you are using Apache Cassandra version 2.x.x.

In two earlier posts, An Introduction To Apache Cassandra and Apache Cassandra Architecture, I offered a theoretical overview of Cassandra. On this submit I purpose to introduce CQL and key underlying storage construction utilizing a sensible strategy. It will allow readers to get a superb understanding  of CQL fundamentals.

So as to comply with alongside please install Apache Cassandra. Five Methods of Putting in Apache Cassandra supplies you with numerous options. I recommend the only node choice for this tutorial.

Cassandra Query Language (CQL)

Cassandra Query Language or CQL is a declarative language that permits users to query Cassandra utilizing a language just like SQL. CQL was launched in Cassandra model 0.eight and is now the preferred means of retrieving knowledge from Cassandra. Prior to the introduction of CQL, Thrift an RPC based mostly API, was the preferred method of retrieving knowledge from Cassandra. A major advantages of CQL is its similarity to SQL and thus helps lower the Cassandra learning curve. CQL is SQL minus the difficult bits. Consider CQL as a easy API over Cassandra’s inner storage buildings.

CQL Basics

Lets perceive some primary CQL constructs earlier than we bounce into a practical instance.

Keyspace – A keyspace is just like an RDBMS database. It is a container in your software knowledge. Like a database, a keyspace should have a reputation and a set off related attributes. Two necessary attributes that have to be set when defining a keyspace are the replication factor and the replication technique.

Column Families/Tables – A column household/desk is just like an RDBMS table. A keyspace is made up of quite a lot of Column Families/Tables. For the remainder of this article, I will confer with column families/tables interchangeably.

Main Key / Tables – A main key allows users to uniquely determine an “internal row” of knowledge. A main secret is made up of two elements. A row/partition key and a cluster key. The row/partition key determines the node that the info is stored on while the cluster key determines the type order of the info inside a specific row.

In case you are coming from the SQL world the very first thing that you’ll discover is that CQL closely limits predicates that may be applied to a query. This is primarily to stop dangerous queries and drive the consumer to rigorously take into consideration their knowledge model. The next is an inventory of issues which might be regularly utilized in SQL however are usually not out there in CQL:

  1. No arbitrary WHERE clauses – In CQL your predicate can solely include columns laid out in your main key.
  2. No JOIN construct – There isn’t any option to be a part of knowledge throughout column households. Joining knowledge between two column families is discouraged and thus there isn’t a JOIN assemble in CQL.

  3. No GROUP BY – You can’t group equivalent knowledge.

  4. No arbitrary ORDER BY clauses – Order by can only be applied to a cluster column.

CQL is fairly easy and won’t take long to know. The easiest way to study CQL is by writing CQL queries. CQL is an very simple method of interacting with Cassandra but can get simply misused if one doesn’t understand the internal workings of underlying layers. Understanding underlying buildings is the key to mastering CQL.

Practical CQL

Lets being by first beginning our instance of Apache Cassandra. So as to begin Apache Cassandra please open a new terminal window and navigate to

$ApacheCassandraInstallDir/bin

and execute

./cassandra -f

This can start Apache Cassandra in the foreground. Now navigate to a brand new terminal window and execute

./cqlsh

It will open up a cqlsh immediate. Let’s start by making a keyspace. As talked about earlier than a keyspace is just like a schema/database in the RDBMS world. To create a keyspace execute the following CQL:

CREATE KEYSPACE animalkeyspace
WITH REPLICATION = ‘class’ : ‘SimpleStrategy’ ,
‘replication_factor’ : 1 ;

Take special observe of the “WITH REPLICATION” a part of the command. This states that the animal keyspace should use a simple replication strategy and will only have one duplicate for all knowledge inserted into the keyspace. This is positive for demonstration functions but is not a practical choice for any variety check or manufacturing setting.

Next let’s create a column household. In an effort to create a column family, you’ll need to navigate to the animal keyspace with the assistance of  the “USE command”. The USE command allows a shopper to hook up with a specific keyspace i.e. all additional CQL instructions can be executed in the context of the chosen keyspace. Execute the next command at your cqlsh prompt to attach your current shopper to the animalkeyspace.

use animalkeyspace;

Notice you cqlsh prompt will change from just “cqlsh>” to “cqlsh:animalkeyspace>” which can visually remind you of the keyspace you’re presently related to.

Now lets create a the column household/desk to deal with monkey related knowledge. To define a table we must use the CREATE TABLE command. Please take special notice of the primary key. The primary secret is made up of two elements. i.e. partition/row key and cluster key. The primary column of the first key’s your partition key. The remaining columns is used to find out the cluster key. A composite partition key, a partition key made up of a number of columns, may be outlined through the use of an extra set of parentheses before the clustering columns. The row key helps distribute knowledge across the cluster whereas the cluster key determines the order of the info saved inside a row. Thus when designing a table think of the row key as a software used to spread knowledge evenly across a cluster while the cluster key helps determine the order of that knowledge within a row. Your question patters will highly affect the cluster key as it is used to type knowledge stored within a row. Word the cluster secret is elective.

Lets create the Monkey desk by executing the following command in you cqlsh prompt.

CREATE TABLE Monkey (
identifier uuid,
species text,
nickname text,
inhabitants int,
PRIMARY KEY ((identifier), species));

Within the above desk we now have chosen identifier as our partition key and species as our cluster key.

Let’s insert a row into the above column family using the next insert statement:

INSERT INTO monkey (identifier, species, nickname, population)
VALUES ( 5132b130-ae79-11e4-ab27-0800200c9a66,
‘Capuchin monkey’, ‘cute’, 100000);

Now lets look at what happened because of creating and inserting a row into the monkey table. This can require us to flush knowledge from a memtable to disk thus creating an SSTable on disk. We’ll use a utility referred to as nodetool to assist us flush knowledge to disk. To do flush the memtable navigate to the

$ApacheCassandraInstallDir/bin

and execute

./nodetool flush animalkeyspace

Next open up your $ApacheCassandraInstallDir/conf/cassandra.yaml file in your favorite editor. Search for the “data_file_directories” property. In my case the listing specified is /house/akhil/cas_data/knowledge. All keyspace and SSTable associated knowledge is stored in this listing. An SSTable is product of of many elements which might be spread throughout separate information. The directory structure and element information have the following structure:

  • Keyspace
    • Column Family
      • Knowledge.db – This is the bottom knowledge file for the SSTable. All different SSTable related information may be generated from this file.
      • CompressionInfo.db – Holds information about the uncompressed knowledge size.
      • Filter.db – The serialized bloom filter.
      • Index.db – An index to the row keys with tips that could their place in the knowledge file.
      • Abstract.db – SSTable index abstract.
      • Statistics.db – Statistical metadata concerning the content material of the SSTable.
      • TOC.txt – A file which incorporates an inventory of information outputted for each SSTable.

SSTable related knowledge is spread across seven information and every of which have the following structure:

  • KeyspaceName-ColumnFamilyName-CassandraVersion-UniqueNodeLevelTableNumber-TypeOfFile.db

Thus in my /house/akhil/cas_data/knowledge/animalkeyspace/monkey directory I see the following information:

  • animalkeyspace-monkey-jb-1-TOC.txt
  • animalkeyspace-monkey-jb-1-CompressionInfo.db
  • animalkeyspace-monkey-jb-1-Statistics.db
  • animalkeyspace-monkey-jb-1-Knowledge.db
  • animalkeyspace-monkey-jb-1-Abstract.db
  • animalkeyspace-monkey-jb-1-Filter.db
  • animalkeyspace-monkey-jb-1-Index.db

Now lets look at the info file. This can enable us to have a greater understanding how knowledge is actually saved on disk. To start have some extent of comparison lets first run a simple Select command:

Choose * from monkey;

You need to see output just like the next screenshot. Please notice the about query is for demonstration purposes and you’ll rarely run a question with out no less than part of main key in your predicate.

No lets see what the underlying format appears like. The sstable2json is a utility can be utilized to converts a binary SSTable file right into a JSON. This can be a useful gizmo to enable customers to know and visualize SSTables.

Lets convert knowledge inserted into our Monkey desk into JSON. So as to do so navigate to $ApacheCassandraInstallDir/bin listing and execute the following command.

./sstable2json $YourDataDirectory/knowledge/animalkeyspace/monkey/animalkeyspace-monkey-jb-1-Knowledge.db

On operating the above command I get the following output.

[

“key”: “5132b130ae7911e4ab270800200c9a66”, // The row/partition key
“columns”: [ // All Cassandra inner columns
[
“Capuchin monkey:”, // The Cluster key. Word the cluster key doesn’t have any knowledge associated with it. The important thing and the info are similar.
“”,
1423586894518000 // Time stamp which data when this inner column was created.
],
[
“Capuchin monkey:nickname”, // Header for the nickname inner column. Observe the cluster secret is all the time prefixed for every further inner column.
“cute”, // Precise knowledge
1423586894518000
],
[
“Capuchin monkey:population”, // Header for the inhabitants inner column
“100000”, // Precise Knowledge
1423586894518000
] ] ]

As mentioned in previous articles on ought to attempt to visualize a Cassandra column household as a map of sorted maps i.e. Map>). The info inserted into the Monkey desk can is visualized as a map under.

Cassandra Monkey Type Map Visualization

Word the partition key 5132b130ae7911e4ab270800200c9a66 is the row key and the important thing for our outer map. “Capuchin monkey:” is our cluster key and the primary entry within the internal sorted map. The primary entry of the sorted map doesn’t have any knowledge as the important thing and the info are the identical.  Subsequent map entries create their key by suffexing the column identify to the cluster key. “Capuchin monkey:nickname” key’s a result of the cluster key + the column header nickname. The info half accommodates the precise knowledge for the column.

The picture under visually depicts the linkage between the CQL row the ensuing SSTable and a logical map of sorted maps.

Cassandra CQL To SSTable To Logical View

Now lets insert two extra CQL rows. The first row inserted could have the same partition key however will change the cluster key. The second row inserted may have a brand new partition and cluster key.

INSERT INTO monkey (identifier, species, nickname, population) VALUES ( 5132b130-ae79-11e4-ab27-0800200c9a66, ‘Small Capuchin monkey’, ‘very cute’, 100); INSERT INTO monkey (identifier, species, nickname, population) VALUES ( 7132b130-ae79-11e4-ab27-0800200c9a66, ‘Rhesus Monkey’, ‘Good-looking’, 100000);

Lets once once more first run a easy choose command:

Choose * from monkey;

You must see output just like the next screenshot:

CQL Query

Next navigate to

$ApacheCassandraInstallDir/bin

and execute

./nodetool flush animalkeyspace ## Second nodetool flush of the exercise
./nodetool compact animalkeyspace ## Never do this on a manufacturing system

Your knowledge might be in a knowledge file will no be in animalkeyspace-monkey-jb-Three-Knowledge.db. Notice the unique node degree desk quantity is about to 3. It’s because the second nodetool flush would have created a knowledge file with the unique node degree table quantity 2. The compact command would have combined knowledge in file 1 and a couple of and created a brand new file 3 with the aggregated knowledge. More on compaction in later weblog posts.

Lets convert the info in our newly created SSTable into JSON. Execute the next command extract the JSON illustration of animalkeyspace-monkey-jb-Three-Knowledge.db:

./sstable2json $YourDataDirectory/knowledge/animalkeyspace/monkey/animalkeyspace-monkey-jb-Three-Knowledge.db

On operating the above command I get the following output:

[

“key”: “5132b130ae7911e4ab270800200c9a66”,
“columns”: [
[
“Capuchin monkey:”,
“”,
1424557973603000
],
[
“Capuchin monkey:nickname”,
“cute”,
1424557973603000
],
[
“Capuchin monkey:population”,
“100000”,
1424557973603000
],
[
“Small Capuchin monkey:”,
“”,
1424558013115000
],
[
“Small Capuchin monkey:nickname”,
“very cute”,
1424558013115000
],
[
“Small Capuchin monkey:population”,
“100”,
1424558013115000
] ],

“key”: “7132b130ae7911e4ab270800200c9a66”,
“columns”: [
[
“Rhesus Monkey:”,
“”,
1424558014339000
],
[
“Rhesus Monkey:nickname”,
“Handsome”,
1424558014339000
],
[
“Rhesus Monkey:population”,
“100000”,
1424558014339000
] ] ]

The above knowledge can be visualized as depicted in the following image.

Cassandra CQL To SSTable To Logical View Part Two

Word the second insert statement simply appends to an present row key with new cluster keys. Thus the three new columns with keys  Small Capuchin monkey:, Small Capuchin monkey:nickname and Small Capuchin monkey:inhabitants are added to the 5132b130ae7911e4ab270800200c9a66 row. The third insert statement created an entire new row with a pration/row key 7132b130ae7911e4ab270800200c9a66.

When you have reached this for much congratulation and I hope you now have a primary understanding of CQL and its effects on the underlying buildings.