Share
One-Page Tutorial for NoSQL – mongoDB

One-Page Tutorial for NoSQL – mongoDB

In this page, I’ll try to make the process of NoSQL – MongoDB learning as simple as possible: It’s one-chapter tutorial!

One-Page Tutorial

After finishing this one-page tutorial, we’ll learn:

Brief Introduction to NoSQL Databases
Overview of MongoDB, The MongoDB Data Model and Designing the MongoDB Data Model
Installing MongoDB, Working with MongoDB and CRUD Operations
Indexes, Aggregation, Sharding and Replica Set

Brief Introduction to NoSQL Databases

What are NoSQL databases?

The term “Not Only SQL “is used to categorize all database technologies that do not use the relational model.

Applications in the past predominantly used RDBMS databases such as Oracle, MS SQL Server, MySQL, IBM DB2, etc.

All of these databases are characterized by the following key features:

Data model is relational
Data model is pre-defined and applying major changes needed the database to be offline
Data manipulation was done using SQL
Compliant with ACID (Atomic, Consistent, Isolated and Durable) properties

NoSQL, on the other hand, is characterized by the following key features:

Data model is non-relational (could be key-value, hierarchical, columnar)
Data model is not pre-defined but can be implicitly defined by the code. Hence it is extensible.
Data manipulation does not use SQL – some NoSQL have SQL-like languages but most provide custom APIs
More compliant with BASE (Basic Availability, Soft-state, Eventual Consistency) properties

Brief Introduction to NoSQL Databases

Why do we need NoSQL?

NoSQL databases have existed for a long time now.

Several dedicated applications implemented custom database technologies because RDBMS technologies either failed to meet the requirements satisfactorily or the cost of doing so was very high. Such applications are becoming increasingly mainstream in a lot of industries and use cases. This increased the need for adopting these technologies and in turn has increased the availability of commercial software that has been built on these technologies.

Some of the requirements that NoSQL technologies address are:

Meet the performance needs of internet-scale applications which are typically characterized by large-scale user base, multi-geography access, large volumes of data, 24*7 availability.
Can store and process semi-structured data such as JSON, etc. and unstructured data such as images, videos and documents at required performance levels.
Have flexible data model that can be defined implicitly by the application code. Which means there is no need to drop and recreate database tables and rewrite a whole lot of code. This reduces the development time and enables incremental development (as done in Agile development).
Support non-relational data models such as key-value, hierarchical, etc. which are better suited for several use cases

Which are the NoSQL databases?

There are several of them out there. They are broadly classified into four categories based on the data model supported.

Some of the most popular ones are listed below.

Key-Value Store: Redis, Riak

Document Store: MongoDB, CouchDB, Memcached, Couchbase

Column Family Store: Cassandra, HBase

Graph Database: Neo4J

Overview of MongoDB

Key Features:

Flexible and Extensible Data Model that uses the BSON structure

Data is stored using BSON structure (which is similar to a JSON) which is essentially a group of key-value pairs. A JSON document can have embedded documents or an array of key-value pairs.

Unlike in an RDBMS where every record in a table must have a definite number of fields which is pre-defined, in MongoDB, each ‘record’ called a document can have a different set of fields which need not be pre-defined.

Distributed Data Store with Sharding and Replication

An instance of MongoDB can be deployed on a cluster of servers. Data can be distributed across the cluster for high performance.

Replication and Sharding

In order to recover from hardware failure/network failure on a node, multiple copies of data can be maintained on multiple nodes. This also serves as a backup copy of data.

Sharding distributes data across the cluster based on a partitioning or shared key. This enables distributed processing of high volumes of data.

Automatic Horizontal Scalability

With an increase in data volumes or transactions, additional nodes can be added dynamically without bringing down/migrating the existing system (servers).

If sharding or replication has been implemented, data will be redistributed (in a case of sharding) and replicated (in a case of replication) to the new node/s automatically.

Advantages

Data structure maps to native data types in most programming languages

Embedded documents and arrays eliminate the need for joins

Automatic replication and sharding enables dynamic addition of nodes to a cluster

The MongoDB Data Model

MongoDB stores data in the form of documents in collections.

To understand this better, let us look at the RDBMS model and compare it with equivalent components in the MongoDB model.

The MongoDB Data Model

In RDBMS:

Table: Similar data is stored in a table. A database can have many tables
Row: Row refers to each data record in a table. A row will comprise of several data fields called columns. A table must have a fixed set of columns and every row must have the same set of columns.
Primary Key: A column or group of columns that uniquely identifies a record in a table
Foreign Key: A column or group of columns that is used to reference a data record in one table in another tableExample
CREATE TABLE book (book_id VARCHAR2(10),language VARCHAR2(20),edition VARCHAR2(10),author VARCHAR2(30))

In MongoDB:

Collection: Similar data is stored in collections. A database can have several collections.
Document: Each data record is referred to as a document. A document includes multiple key->value pairs. A “key” is equivalent to the column name in RDBMS and a “value” is the column data. A document in MongoDB is similar to a JSON document with
• Data is represented in key/value pairs
• Curly braces hold objects and each key is followed by ‘:'(colon), the key/value pairs are separated by, (comma).
Object ID: This uniquely identifies a document in a collection. This is the default MongoDB generated key.

Designing the MongoDB Data Model

Understanding Data Relationships

Identifying Data Entities and the relationships between them

To understand how to design the MongoDB data model, let us consider a simple use case of an Orders database for an online store.

At a minimum, the database would contain the following main entities.

• Customer – that identifies each customer who made a purchase on the online store

• Address – the shipping address of the customer

• Order – the purchase details

• Item – the details of the product that was purchased

Understanding how these entities are related to each other helps in defining the most probable document model

Designing the MongoDB Data Model

Embedding and Referencing – Defining the Document Model

Based on the relationship between the data entities and the manner in which data is typically written and fetched by the application, the document model could be defined using one of more of these techniques.

For a One to One relationship

Embedded Document: Store the child document within the parent document. For ex: the customer address inside the customer document

For a One to Many relationships

Embedded Array of Documents: Store the child documents within the parent document. For ex: the orders of a customer within the customer document

Embedded Array of References: Store the orders in an Orders collection. Store the array of Order Object Ids that reference the Orders collection within the customer document in the Customer collection.

For a Many to Many relationships

Embedded Array of References: Store the orders and items in separate collections. Store the array of references to the Items within each Order document. You could also store the array of references to the Orders within each Item document
Collection of References: Store the orders and items in separate collections. Define a new collection with an array of references to Order and Item mappings

Embedding and Referencing

Installing MongoDB

Installation

On Windows:

• Interactive installation -> download the .msi file and wizard will guide you through installation process. You may specify installation directory in “custom” installation mode.
• Command Line installation -> Install MongoDB unattended from command line using msiexec.exe. This will install the following components:
1. Server – includes mongod.exe
2. Client – includes mongo.exe
3. MonitoringTools – includes mongostat.exe and mongotop.exe
4. ImportExportTools – includes mongodump.exe, mongorestore.exe,mongoexport.exe, and mongoimport.exe)

Installing MongoDB

On Ubuntu:

To install on Ubuntu Linux systems – use .deb packages.Installing the mongodb-org metapackage will automatically install the four component packages listed below.
1. mongodb-org-server This package contains the mongod daemon and associated configuration and init scripts.
2. mongodb-org-mongos This package contains the mongos daemon.
3. mongodb-org-shell This package contains the mongo shell.
4. mongodb-org-tools This package contains the following MongoDB
tools: mongoimport bsondump, mongodump,mongoexport, mongofiles, mongooplog, mongoperf, mongorestore, mongostat, andmongotop.

Working with MongoDB

Command Line Access

• The Mongo shell is a JavaScript tool that is included with all MongoDB distributions

• Supports data manipulation using MongoDB query language

Programmatic Access

• MongoDB provides native drivers for all popular programming languages and frameworks for development. This includes C, C++, C#, Java, Node.js, Perl, PHP, Python, Motor, Ruby, Scala

• Data manipulation is performed using the language specific APIs as opposed to using a standard SQL

CRUD Operations – Write Operations

Insert

• The db.collection.insert() method is available to add new documents to a collection

• The MongoDB method db.users.insert ( { “name” : “smith”, “age” : 26, “status” : “A”} ) is equivalent to the SQL INSERT INTO users (name, age, status) values (“smith”,26,”A”)

• If you insert a document without the “_id” field, MongoDB adds a “_id” field with a unique Object Id

• MongoDB implicitly creates a collection during an Insert if one doesn’t already exist

Update

• The db.collection.update() method. The method modifies existing documents in a collection

• The MongoDB method db.users.update( { “age” : {$gt:18}}, {$set: {status:”A”}}, {multi:true} ) is equivalent to the SQL UPDATE users SET status = “A” where age > 18

• By default, the update method updates only one document unless the multi option is set to true

• This method can update specific fields in a document or the entire document

• The upsert option is available to Insert a document if one doesn’t exist

Remove

• The db.collection.remove() method is available to delete documents from a collection

• The MongoDB method db.users.remove ( {“status” : “D”} ) is equivalent to the SQL DELETE FROM users WHERE status = “D”

• By default, the Remove method deletes all the documents that match the criteria

Other Considerations for Write Operations

• A write operation is atomic at the level of a document.So when one transaction is updating multiple documents, other transactions can operate on some of the same set of documents

• Write concern levels determine when an application receives an acknowledgment that a write has been successful. Write concern levels are configurable on MongoDB

CRUD Operations – Read Operations

Find

• The db.collection.find() method is available to fetch documents from a collection

• The MongoDB method db.users.find ( {“status” : “D”}, {name:1} ) is equivalent to the SQL SELECT name FROM users WHERE status = “D”

• Data across collections cannot be joined using the find method

• The result set of a find is provided as a cursor and you must iterate the cursor to fetch each document in the result set

Query Types

Unlike other NoSQL databases, MongoDB is not limited to simple Key-Value operations. A query may return a document, a subset of specific fields within the document or complex aggregations against many documents.

Key-value queries return results based on any field in the document, often the primary key.
Range queries return results based on values defined as inequalities (e.g, greater than, less than or equal to, between).
Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon.
Text Search queries return results in relevance order based on text arguments using Boolean operators (e.g.,AND, OR, NOT).
Aggregation Framework queries return aggregations of values returned by the query (e.g., count, min, max, average, similar to a SQL GROUP BY statement).
MapReduce queries execute complex data processing that is expressed in JavaScript and executed across data in the database.

Comparison of a SQL SELECT with MongoDB find() method

In SQL SELECT:
SELECT emp_id, name, salary -> projection FROM employee -> table WHERE salary > 1000 ->
select criteria/filter LIMIT (5) -> cursor modifier

In MongoDB find() method:

db.employee.find ( -> collection
{ “salary” : { $gt : 1000}}, -> query criteria
{emp_id :1, name:1, salary:1} -> projection
).limit(5) -> cursor modifier

Other Examples

1. To exclude one field from a result set

db.records.find( { “user_id”: { $lt: 42 } }, { “history”: 0 } )

This query selects documents in the records collection that match the condition { “user_id”: { $lt: 42} }, and uses the projection {“history”: 0 } to exclude the history field from the documents in the result set.

2. To return two fields and the _id field

db.records.find( { “user_id”: { $lt: 42 } }, { “name”: 1, “email”: 1 } )

This query selects documents in the records collection that match the query { “user_id”: { $lt: 42 }} and uses the projection { “name”: 1, “email”: 1 } to return just the _id field (implicitly included), name field, and the email field in the documents in the result set.

Indexes

In MongoDB, indexes are similar to indexes in other database systems.

It performs a scan of all documents in a collection if appropriate indexes are not available.

Index Types:

Default _id: Index on _id field that exists by default

Single field: User defined ascending/descending index on single field of a document

Compound Index: User-defined index on multiple fields of a document

Multi-key index: Index the content stored in arrays in field of a document

Geospatial Index: Index to support queries of geospatial coordinate data

Text Index: For searching string content in a collection

Hash Index: For supporting hash-based sharding, MongoDB provides a hashed index which index the
hash of the value of a field

Index Properties

Unique: Rejects documents with duplicate values for the indexed field
Create a unique index on product_name as db.product.createIndex( { “product_name” : 1 }, {unique: true} )

Sparse: Omit references to documents that do not include the indexed field.
Create a sparse index on warranty field in product collection as db.product.createIndex( { “warranty” : 1 }, {sparse : true} )

TTL: In some cases, data should expire out of the system automatically. Time to Live (TTL) indexes allow the user to specify a period of time after which the data will automatically be deleted from the database. A common use of TTL indexes is applications that maintain a rolling window of history (e.g., most recent 100 days) for user actions such as click streams.

Aggregation

In database terms – summarizing the results over a number of rows, returning computed results

Aggregation

Aggregation function:

Like we have aggregate functions in RDBMS such as SUM, AVG, COUNT etc…, in MongoDB, we have methods like – count, distinct & group for aggregation

For example
db.product.count( ) -> gives count of documents inside product collection
db.product.distinct(“product_type”) -> gives array of values of distinct product_type

Aggregation map-reduce

Using MapReduce database command we can perform aggregation. Below are the steps of execution of MapReduce processing:

Fetch the documents in the collection that match the query condition. This is “map” phase.“Map” outputs the key value pairs, for keys where we have multiple values, apply the “reduce” phase (actual aggregating data)“Reduce” function output the data in collection and if required further calls “finalize” function to further aggregate

For example
db.orders.MapReduce (function() { emit (this.cust_id, this.amount) ; },
function ( key, values ) { return Array.sum( values ) },{ query : { status : “A” },out : “order_totals”})

– After applying query filter i.e WHERE status = ‘A’, we get 3 documents with cust_id as A123 (2 documents) and B212 (1 document)

– Map will then identify key/value pairs as cust_id A123 & B212 as we have specified group by on this.cust_id in function above

– The output of map which is a collection of cust_id and amount (aggregated on cust_id) further send to reduce for aggregation if reqd.

– After reducing the final output collection appear as amount total per cust_id

Aggregation pipeline

This feature allows the output of one operation to be provided as input to another operation. MongoDB’s aggregate() method is used to achieve this.

For example: We have a collection “domain” with documents as below

{ “_id” : 1, “domainName” : “test1.com”, “hosting” : “hostgator.com” }
{ “_id” : 2, “domainName” : “test2.com”, “hosting” : “aws.amazon.com”}
{ “_id” : 3, “domainName” : “test3.com”, “hosting” : “aws.amazon.com” }
{ “_id” : 4, “domainName” : “test4.com”, “hosting” : “hostgator.com” }
{ “_id” : 5, “domainName” : “test5.com”, “hosting” : “aws.amazon.com” }
{ “_id” : 6, “domainName” : “test6.com”, “hosting” : “cloud.google.com” }

1. Execute below command on mongoshell – that will group by documents in “hosting” field and output the number of documents

db.domain.aggregate( { $group : {_id : “$hosting”, total : { $sum : 1 }} } ); -> used only 1 stage i.e $group

Output:
{ “result” : [{ “_id” : “cloud.google.com”, “total” : 1 },
{ “_id” : “aws.amazon.com”, “total” : 3 },
{ “_id” : “hostgator.com”, “total” : 2 }
], “ok” : 1 }

2. Execute above command with multiple stages – first $group then $sort

db.domain.aggregate( { $group : {_id : “$hosting”, total : { $sum : 1 }}, -> 1st stage $group
{ $sort : {total : -1} } } ); -> 2nd stage $sort

Output:
{ “result” : [{ “_id” : “cloud.google.com”, “total” : 1 },
{ “_id” : “aws.amazon.com”, “total” : 3 },
{ “_id” : “hostgator.com”, “total” : 2 }
], “ok” : 1 }

Sharding

A Shard stores the data

The Shard key is either an indexed field or indexed compound field that exists in every document in the collection. MongoDB divides the data into chunks based on shard key and distributes the chunks evenly across shards.

Sharding

A Sharded Cluster has following components:

ShardQuery routers: Identifies the shard and responsible for executing queries on target shards and return result setsConfig servers: Contains metadata of cluster. This data is used by query router to direct the query to the specific shard that contains the data

Types of Sharding

1. Range based Sharding can be applied on collections where we have to fetch data using range queries – like ( >, < ) Here chunks are created based on lower limit and an upper limit of the chunk.

For example: Chunk1 will store documents where key value > 0 and key value=25 and key value < 75

2. Hash-based Sharding can be applied on collections where we fetch data for some specific value using equality operator (=)

Here the first hash value of the field is computed and used to determine which chunk it should be stored in.

For example, Let’s say we have .5 million items in the database and want to share order collection or sales collection.

If we have queries where we are fetching data based on Item_No then we should implement hash sharding.

Replication

This provides redundancy and high availability through a group of mongod processes.

Replication between nodes is managed automatically based on Replica Set configurations.

A group of nodes that need to have the same data must be defined within a Replica Set.

Members of a Replica Set are:

Primary is the only member that receives all write operations
Secondaries replicate operations from primary to maintain an identical data set
Arbiter can also be part of replica set. They don’t keep copy of the data. Arbiters select a primary if the current primary is unavailable

That’s it. Thank you for reading this post.

I want you to do something for me right now: Leave a comment !!!

Like this post? Don’t forget to share it!

2 Comments on this Post

  1. IkXM4OePE

    Excellent blog here! Also your web site loads up fast! What host are you using? Can I get your affiliate link to your host? I wish my website loaded up as quickly as yours lol

    Reply
  1. By My Homepage on June 11, 2017 at 7:31 AM

    … [Trackback]

    […] Read More here: a2cart.com/one-page-tutorial-nosql-mongodb/ […]

Leave a Comment

*