Wrangling Data with MongoDB 🇰🇪
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"
from pprint import PrettyPrinter
from pymongo import MongoClient
import pandas as pd
In this project, I will be working with MongoDB database (NoSQL), which is very common for websites and censored data.
GOALS
Machine Learning Workflow
# Instantiate a PrettyPrinter, and assign it to the variable pp
pp = PrettyPrinter(indent=2)
I want to create a client that connects to MongoDB. In my case, Jupyter/Python to MongoDB so it brings the data back. In order to connect, it needs to know the location of that server. There are two pieces of information it needs to know.
First, it needs to know on what computer is this running. This project is done on the same computer so I'm using localhost. Second, the piece of information is the exact address or port. My MongoDB server is running on port "27017". This happens to be the default port number (the convention). Once my client knows where to look for the server and where that server is connecting on which port, it can then establish that connection. And so once that's done, my client can bring queries to my server and my server can send data back.
# Create a client that connects to the database running at localhost on port 27017
client = MongoClient(host="localhost", port=27017)
List Databases
In this task, I'm going to see what databases are available to me on the server that I just connected to.
# Get the names of the databases on this client
client.list_databases()
<pymongo.command_cursor.CommandCursor at 0x7fec6d134ee0>
Above, I don't get a list back with database names, rather, I get an iterator.
# Give a task of printing out all the integers between 0 and 4, inclusive
from sys import getsizeof
my_list = [0, 1, 2, 3, 4] # Create a list (my_list)
# Create with a for loop and I need something to look through
for i in my_list:
print(i) # Print the integers
0 1 2 3 4
# Create a range
from sys import getsizeof
my_list = [0, 1, 2, 3, 4]
my_range = range(0,5) # Iterator
for i in my_range:
print(i)
0 1 2 3 4
The outputs are exactly the same.
# Look at the size of my_list
getsizeof(my_list)
120
# Look at the size of my_range
getsizeof(my_range)
48
my_list
takes up more space than my_range
. Let's try to make my_range
bigger this time.
from sys import getsizeof
my_list = [0, 1, 2, 3, 4, 5, 6, 7, 8]
my_range = range(0,9)
for i in my_list:
print(i)
0 1 2 3 4 5 6 7 8
getsizeof(my_list)
152
getsizeof(my_range)
48
from sys import getsizeof
my_list = [0, 1, 2, 3, 4, 5, 6, 7, 8]
my_range = range(0,8_000_000)
getsizeof(my_range)
48
# Generate everything in iterator and turn it in to a list
list(client.list_databases())
[{'name': 'admin', 'sizeOnDisk': 40960, 'empty': False}, {'name': 'air-quality', 'sizeOnDisk': 6938624, 'empty': False}, {'name': 'config', 'sizeOnDisk': 94208, 'empty': False}, {'name': 'local', 'sizeOnDisk': 73728, 'empty': False}]
pp.pprint(list(client.list_databases()))
[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960}, {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 6938624}, {'empty': False, 'name': 'config', 'sizeOnDisk': 94208}, {'empty': False, 'name': 'local', 'sizeOnDisk': 73728}]
Assign Database
I've connected to the MongoDB server. I've also looked at what databases are available to me on that server. And decided that I want to focus on the air-quality
database (the largest). So the next step is to pull that out and assign it to a variable name so I can interact with it directly.
# Assign the "air-quality" database to the variable db
db = client["air-quality"]
Now interact with that database directly.
List Collections
I've connected to my server, and found which database I'm interested in and assigned it to a variable name. Now I can look into the data.
I have my server which I connected to and on that server, there's a database air-quality
. Inside that database, the first thing I have, is a collection. I can have multiple collections in a database and like the name implies, a collection is a group of observations that I want to put together for some reason. Probably because they have similar attributes. A good way to think of collection is a DataFrame. With a DataFrame, each observation is a row. For a collection, each observation is called a document.
To reiterate, I've connected to my server, I've found the database I am interested in. And now, I am going to look at the different collections in that database and different documents in that collection.
SERVER -> Database -> Collection -> Document
# Use the list_collections method to print a list of the collections available in db
db.list_collections()
<pymongo.command_cursor.CommandCursor at 0x7fec65899d60>
# Turn it in to a list and look at the first item
list(db.list_collections())[0]
{'name': 'lagos', 'type': 'timeseries', 'options': {'timeseries': {'timeField': 'timestamp', 'metaField': 'metadata', 'granularity': 'seconds', 'bucketMaxSpanSeconds': 3600}}, 'info': {'readOnly': False}}
Getting information in a form of dictionary (JSON). I just want the name of these collections.
# I just want the name so create a for loop
for c in db.list_collections(): # c is collection
print(c["name"]) # I just want the value associated with the "name" key & c is collection
lagos system.buckets.lagos dar-es-salaam system.buckets.dar-es-salaam nairobi system.buckets.nairobi system.views
Assign Collection
I'm interested in the nairobi
collection so let's assign that to its own variable name so it's easier to work with. I'm working with dictionaries here.
# Assign the "nairobi" collection in db to the variable name nairobi
nairobi = ["nairobi"]
nairobi = db["nairobi"]
Count Documents
Count the number of documents in the nairobi
collection to see what I'm working with.
nairobi.count_documents({})
202212
Find One Document
DataFrame has an index column. And inside that index column, each row or observations have unique labels.
# Use the find_one method to retrieve one document from the nairobi collection, and assign it to the variable name result
result = nairobi.find_one({})
pp.pprint(result)
{ 'P1': 39.67, '_id': ObjectId('62a1510e5e1eb913597171ee'), 'metadata': { 'lat': -1.3, 'lon': 36.785, 'measurement': 'P1', 'sensor_id': 57, 'sensor_type': 'SDS011', 'site': 29}, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
type(nairobi['sensor_id'])
pymongo.collection.Collection
First thing I can see is I have an _id
which is the primary key. Each record in MongoDB has a unique identifier (which is _id
).
Distinct Censor Sites
# Use the distinct method to determine how many sensor sites are included in the nairobi collection
nairobi.distinct("metadata.site")
[29, 6]
Count Documents by Site
# Use the count_documents method to determine how many readings there are for each site in the nairobi collection
nairobi.count_documents({"metadata.site": 6})
70360
print("Documents from site 6:", nairobi.count_documents({"metadata.site": 6}))
print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
Documents from site 6: 70360 Documents from site 29: 131852
There's almost twice as much data in site 29 then site 6. So I'm focusing on a site, looks like I'm going to be focusing on site 29.
Aggregate Documents
# Use the aggregate method to determine how many readings there are for each site in the nairobi collection
result = nairobi.aggregate(
[
{"$group": {"_id": "$metadata.site"}}
]
)
pp.pprint(list(result))
[{'_id': 29}, {'_id': 6}]
result = nairobi.aggregate(
[
{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}
]
)
pp.pprint(list(result))
[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
Distinct Measurements
# Use the distinct method to determine how many types of measurements have been taken in the nairobi collection
nairobi.distinct("metadata.measurement")
['P1', 'temperature', 'P2', 'humidity']
Find PM2.5 Readings
# Use the find method to retrieve the PM 2.5 readings from all sites
result = nairobi.find({"metadata.measurement": "P2"}).limit(3) # Limit results to 3 records only
pp.pprint(list(result))
[ { 'P2': 34.43, '_id': ObjectId('62a1510f5e1eb9135971f279'), 'metadata': { 'lat': -1.3, 'lon': 36.785, 'measurement': 'P2', 'sensor_id': 57, 'sensor_type': 'SDS011', 'site': 29}, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}, { 'P2': 30.53, '_id': ObjectId('62a1510f5e1eb9135971f27a'), 'metadata': { 'lat': -1.3, 'lon': 36.785, 'measurement': 'P2', 'sensor_id': 57, 'sensor_type': 'SDS011', 'site': 29}, 'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)}, { 'P2': 22.8, '_id': ObjectId('62a1510f5e1eb9135971f27b'), 'metadata': { 'lat': -1.3, 'lon': 36.785, 'measurement': 'P2', 'sensor_id': 57, 'sensor_type': 'SDS011', 'site': 29}, 'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
Aggregate by Measurement I
# Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 6
result = nairobi.aggregate(
[
{"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 51076}, {'_id': 'P2', 'count': 51076}, {'_id': 'temperature', 'count': 50030}, {'_id': 'humidity', 'count': 50030}]
result = nairobi.aggregate(
[
{"$match": {"metadata.site": 6}},
{"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 18169}, {'_id': 'temperature', 'count': 17011}, {'_id': 'P2', 'count': 18169}, {'_id': 'humidity', 'count': 17011}]
Aggregate by Measurement II
# Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 29
result = nairobi.aggregate(
[
{"$match": {"metadata.site": 29}},
{"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}
]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 32907}, {'_id': 'temperature', 'count': 33019}, {'_id': 'P2', 'count': 32907}, {'_id': 'humidity', 'count': 33019}]
Just like I had more documents overall at site 29, I have almost twice as many P2 readings in site 29. I know that will definitely be used to build my model to predict PM 2.5 readings, so I should use the information at site 29.
Find by Site and Measurement
# Use the find method to retrieve the PM 2.5 readings from site 29
result = nairobi.find( # First thing: pass in a dicitonary with the criteria for my search
{"metadata.site": 29, "metadata.measurement": "P2"}
)
pp.pprint(result.next())
{ 'P2': 34.43, '_id': ObjectId('62a1510f5e1eb9135971f279'), 'metadata': { 'lat': -1.3, 'lon': 36.785, 'measurement': 'P2', 'sensor_id': 57, 'sensor_type': 'SDS011', 'site': 29}, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
I don't want all of the metadata and the id. So I need to change the projection.
result = nairobi.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0} # Use the projection argument to limit the results to the "P2" and "timestamp" keys only
)
pp.pprint(result.next())
{'P2': 34.43, 'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
Read Query Into DataFrame
# Read records from result into the DataFrame df
result = nairobi.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0} # Set the index to "timestamp"
)
df = pd.DataFrame(result)
df.head()
timestamp | P2 | |
---|---|---|
0 | 2018-09-01 00:00:02.472 | 34.43 |
1 | 2018-09-01 00:05:03.941 | 30.53 |
2 | 2018-09-01 00:10:04.374 | 22.80 |
3 | 2018-09-01 00:15:04.245 | 13.30 |
4 | 2018-09-01 00:20:04.869 | 16.57 |
result = nairobi.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0} # Set the index to "timestamp"
)
df = pd.DataFrame(result).set_index("timestamp")
df.head()
P2 | |
---|---|
timestamp | |
2018-09-01 00:00:02.472 | 34.43 |
2018-09-01 00:05:03.941 | 30.53 |
2018-09-01 00:10:04.374 | 22.80 |
2018-09-01 00:15:04.245 | 13.30 |
2018-09-01 00:20:04.869 | 16.57 |
I connected a MongoDB server, explored data in that server and found the data in that server. I also did aggregation and querying and now have the dataset to make different models.