Wrangling Data with MongoDB 🇰🇪

In [39]:
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"
In [40]:
from pprint import PrettyPrinter
from pymongo import MongoClient
import pandas as pd

In this project, I will be working with MongoDB database (NoSQL), which is very common for websites and censored data.

GOALS

  • Connect to a MongoDB server
  • Explore the database and locate PM2.5 readings
  • Import database query into DataFrame

Machine Learning Workflow

  1. Prepare Data
  • Connect: Connect to MongoDB server
    • Client, server
  1. Explore
  • What's the data model for a MongoDB databse
  • What's semi-structured data?
  • How do you query a database using pygmongo?
    • Collection, document, iterator, metadata, primary key, schema
  1. Import
  • Import query results into a DataFrame
In [41]:
# Instantiate a PrettyPrinter, and assign it to the variable pp
pp = PrettyPrinter(indent=2) 

Prepare Data¶

Connect¶

I want to create a client that connects to MongoDB. In my case, Jupyter/Python to MongoDB so it brings the data back. In order to connect, it needs to know the location of that server. There are two pieces of information it needs to know.

First, it needs to know on what computer is this running. This project is done on the same computer so I'm using localhost. Second, the piece of information is the exact address or port. My MongoDB server is running on port "27017". This happens to be the default port number (the convention). Once my client knows where to look for the server and where that server is connecting on which port, it can then establish that connection. And so once that's done, my client can bring queries to my server and my server can send data back.

In [42]:
# Create a client that connects to the database running at localhost on port 27017
client = MongoClient(host="localhost", port=27017)

Explore¶

List Databases

In this task, I'm going to see what databases are available to me on the server that I just connected to.

In [43]:
# Get the names of the databases on this client
client.list_databases()
Out[43]:
<pymongo.command_cursor.CommandCursor at 0x7fec6d134ee0>

Above, I don't get a list back with database names, rather, I get an iterator.

In [44]:
# Give a task of printing out all the integers between 0 and 4, inclusive
from sys import getsizeof
my_list = [0, 1, 2, 3, 4] # Create a list (my_list)

# Create with a for loop and I need something to look through 
for i in my_list:
    print(i) # Print the integers
0
1
2
3
4
In [45]:
# Create a range 
from sys import getsizeof
my_list = [0, 1, 2, 3, 4]
my_range = range(0,5) # Iterator 

for i in my_range:
    print(i)
0
1
2
3
4

The outputs are exactly the same.

In [46]:
# Look at the size of my_list
getsizeof(my_list)
Out[46]:
120
In [47]:
# Look at the size of my_range
getsizeof(my_range)
Out[47]:
48

my_list takes up more space than my_range . Let's try to make my_range bigger this time.

In [48]:
from sys import getsizeof
my_list = [0, 1, 2, 3, 4, 5, 6, 7, 8]
my_range = range(0,9)

for i in my_list:
    print(i)
0
1
2
3
4
5
6
7
8
In [49]:
getsizeof(my_list)
Out[49]:
152
In [50]:
getsizeof(my_range)
Out[50]:
48
In [51]:
from sys import getsizeof
my_list = [0, 1, 2, 3, 4, 5, 6, 7, 8]
my_range = range(0,8_000_000)
In [52]:
getsizeof(my_range)
Out[52]:
48
In [53]:
# Generate everything in iterator and turn it in to a list
list(client.list_databases())
Out[53]:
[{'name': 'admin', 'sizeOnDisk': 40960, 'empty': False},
 {'name': 'air-quality', 'sizeOnDisk': 6938624, 'empty': False},
 {'name': 'config', 'sizeOnDisk': 94208, 'empty': False},
 {'name': 'local', 'sizeOnDisk': 73728, 'empty': False}]
In [54]:
pp.pprint(list(client.list_databases()))
[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
  {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 6938624},
  {'empty': False, 'name': 'config', 'sizeOnDisk': 94208},
  {'empty': False, 'name': 'local', 'sizeOnDisk': 73728}]

Assign Database

I've connected to the MongoDB server. I've also looked at what databases are available to me on that server. And decided that I want to focus on the air-quality database (the largest). So the next step is to pull that out and assign it to a variable name so I can interact with it directly.

In [55]:
# Assign the "air-quality" database to the variable db
db = client["air-quality"]

Now interact with that database directly.

List Collections

I've connected to my server, and found which database I'm interested in and assigned it to a variable name. Now I can look into the data.

I have my server which I connected to and on that server, there's a database air-quality. Inside that database, the first thing I have, is a collection. I can have multiple collections in a database and like the name implies, a collection is a group of observations that I want to put together for some reason. Probably because they have similar attributes. A good way to think of collection is a DataFrame. With a DataFrame, each observation is a row. For a collection, each observation is called a document.

To reiterate, I've connected to my server, I've found the database I am interested in. And now, I am going to look at the different collections in that database and different documents in that collection.

SERVER -> Database -> Collection -> Document

In [56]:
# Use the list_collections method to print a list of the collections available in db
db.list_collections()
Out[56]:
<pymongo.command_cursor.CommandCursor at 0x7fec65899d60>
In [57]:
# Turn it in to a list and look at the first item 
list(db.list_collections())[0]
Out[57]:
{'name': 'lagos',
 'type': 'timeseries',
 'options': {'timeseries': {'timeField': 'timestamp',
   'metaField': 'metadata',
   'granularity': 'seconds',
   'bucketMaxSpanSeconds': 3600}},
 'info': {'readOnly': False}}

Getting information in a form of dictionary (JSON). I just want the name of these collections.

In [58]:
# I just want the name so create a for loop
for c in db.list_collections(): # c is collection 
    print(c["name"]) # I just want the value associated with the "name" key & c is collection 
lagos
system.buckets.lagos
dar-es-salaam
system.buckets.dar-es-salaam
nairobi
system.buckets.nairobi
system.views

Assign Collection

I'm interested in the nairobi collection so let's assign that to its own variable name so it's easier to work with. I'm working with dictionaries here.

In [59]:
# Assign the "nairobi" collection in db to the variable name nairobi
nairobi = ["nairobi"]
In [60]:
nairobi = db["nairobi"]

Count Documents

Count the number of documents in the nairobi collection to see what I'm working with.

In [61]:
nairobi.count_documents({})
Out[61]:
202212

Find One Document

DataFrame has an index column. And inside that index column, each row or observations have unique labels.

In [62]:
# Use the find_one method to retrieve one document from the nairobi collection, and assign it to the variable name result
result = nairobi.find_one({})
pp.pprint(result)
{ 'P1': 39.67,
  '_id': ObjectId('62a1510e5e1eb913597171ee'),
  'metadata': { 'lat': -1.3,
                'lon': 36.785,
                'measurement': 'P1',
                'sensor_id': 57,
                'sensor_type': 'SDS011',
                'site': 29},
  'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
In [65]:
type(nairobi['sensor_id'])
Out[65]:
pymongo.collection.Collection

First thing I can see is I have an _id which is the primary key. Each record in MongoDB has a unique identifier (which is _id).

Distinct Censor Sites

In [36]:
# Use the distinct method to determine how many sensor sites are included in the nairobi collection
nairobi.distinct("metadata.site")
Out[36]:
[29, 6]

Count Documents by Site

In [37]:
# Use the count_documents method to determine how many readings there are for each site in the nairobi collection
nairobi.count_documents({"metadata.site": 6})
Out[37]:
70360
In [38]:
print("Documents from site 6:", nairobi.count_documents({"metadata.site": 6}))
print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
Documents from site 6: 70360
Documents from site 29: 131852

There's almost twice as much data in site 29 then site 6. So I'm focusing on a site, looks like I'm going to be focusing on site 29.

Aggregate Documents

In [71]:
# Use the aggregate method to determine how many readings there are for each site in the nairobi collection
result = nairobi.aggregate(
    [
        {"$group": {"_id": "$metadata.site"}}
    ]
)
pp.pprint(list(result))
[{'_id': 29}, {'_id': 6}]
In [72]:
result = nairobi.aggregate(
    [
        {"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}
    ]

)
pp.pprint(list(result))
[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]

Distinct Measurements

In [73]:
# Use the distinct method to determine how many types of measurements have been taken in the nairobi collection
nairobi.distinct("metadata.measurement")
Out[73]:
['P1', 'temperature', 'P2', 'humidity']

Find PM2.5 Readings

In [75]:
# Use the find method to retrieve the PM 2.5 readings from all sites
result = nairobi.find({"metadata.measurement": "P2"}).limit(3) # Limit results to 3 records only
pp.pprint(list(result))
[ { 'P2': 34.43,
    '_id': ObjectId('62a1510f5e1eb9135971f279'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
  { 'P2': 30.53,
    '_id': ObjectId('62a1510f5e1eb9135971f27a'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
  { 'P2': 22.8,
    '_id': ObjectId('62a1510f5e1eb9135971f27b'),
    'metadata': { 'lat': -1.3,
                  'lon': 36.785,
                  'measurement': 'P2',
                  'sensor_id': 57,
                  'sensor_type': 'SDS011',
                  'site': 29},
    'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]

Aggregate by Measurement I

In [77]:
# Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 6
result = nairobi.aggregate(
    [
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
    ]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 51076},
  {'_id': 'P2', 'count': 51076},
  {'_id': 'temperature', 'count': 50030},
  {'_id': 'humidity', 'count': 50030}]
In [78]:
result = nairobi.aggregate(
    [
        {"$match": {"metadata.site": 6}},
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
    ]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 18169},
  {'_id': 'temperature', 'count': 17011},
  {'_id': 'P2', 'count': 18169},
  {'_id': 'humidity', 'count': 17011}]

Aggregate by Measurement II

In [80]:
# Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 29
result = nairobi.aggregate(
    [
        {"$match": {"metadata.site": 29}},
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
    ]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 32907},
  {'_id': 'temperature', 'count': 33019},
  {'_id': 'P2', 'count': 32907},
  {'_id': 'humidity', 'count': 33019}]

Just like I had more documents overall at site 29, I have almost twice as many P2 readings in site 29. I know that will definitely be used to build my model to predict PM 2.5 readings, so I should use the information at site 29.

Import¶

Find by Site and Measurement

In [81]:
# Use the find method to retrieve the PM 2.5 readings from site 29
result = nairobi.find(  # First thing: pass in a dicitonary with the criteria for my search
    {"metadata.site": 29, "metadata.measurement": "P2"} 
)
pp.pprint(result.next())
{ 'P2': 34.43,
  '_id': ObjectId('62a1510f5e1eb9135971f279'),
  'metadata': { 'lat': -1.3,
                'lon': 36.785,
                'measurement': 'P2',
                'sensor_id': 57,
                'sensor_type': 'SDS011',
                'site': 29},
  'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}

I don't want all of the metadata and the id. So I need to change the projection.

In [69]:
result = nairobi.find(
    {"metadata.site": 29, "metadata.measurement": "P2"}, 
    projection={"P2": 1, "timestamp": 1, "_id": 0} # Use the projection argument to limit the results to the "P2" and "timestamp" keys only
)
pp.pprint(result.next())
{'P2': 34.43, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}

Read Query Into DataFrame

In [70]:
# Read records from result into the DataFrame df
result = nairobi.find(
    {"metadata.site": 29, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0} # Set the index to "timestamp"
)
In [71]:
df = pd.DataFrame(result)
df.head()
Out[71]:
timestamp P2
0 2018-09-01 00:00:02.472 34.43
1 2018-09-01 00:05:03.941 30.53
2 2018-09-01 00:10:04.374 22.80
3 2018-09-01 00:15:04.245 13.30
4 2018-09-01 00:20:04.869 16.57
In [73]:
result = nairobi.find(
    {"metadata.site": 29, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0} # Set the index to "timestamp"
)
In [74]:
df = pd.DataFrame(result).set_index("timestamp")
df.head()
Out[74]:
P2
timestamp
2018-09-01 00:00:02.472 34.43
2018-09-01 00:05:03.941 30.53
2018-09-01 00:10:04.374 22.80
2018-09-01 00:15:04.245 13.30
2018-09-01 00:20:04.869 16.57

I connected a MongoDB server, explored data in that server and found the data in that server. I also did aggregation and querying and now have the dataset to make different models.