Coding Stories: Data Science

When I was ten years old I was given a Commodore 64 and I learned to code in BASIC. When I was in college I learned C to develop a Monte Carlo simulation for a Physics professor. As a science teacher I saw programming as a way to help students explore the natural world. Now I work with students and teachers to find the intersection of coding and learning. This blog documents my coding stories.

Data Science for K-12

What does data science mean in K-12? As a science teacher I have spent a lot of time working with data. For example, when I taught middle school science I had students conduct plant experiments in small groups. They would focus on a particular variable such as fertilizer, light, or space and record their data in their lab notebooks. There were always some anomalies in the data and sometimes dubious conclusions as a result. When I got smarter as a teacher I had students pool their data across the grade to increase the size of their data set. This led to a marked improvement in the accuracy of their conclusions. I had been telling students for years how repeating an experiment is a good idea, but it was not until I experienced it on a large scale that I internalized the power of this effect. While I was excited to expose my students to larger data sets, I don’t think I did enough to communicate its significance.

There is a lot of hype these days about big data. There are tutorials posted online detailing how to analyze data for trends and use machine learning algorithms to make accurate predictions. There is not as much written about how students should be grappling with data. It is only when they engage with the data that students really learn what data is about. There are four aspects of data work that I try to highlight with students. Data invariably contains corrupt or inaccurate records that need to be addressed. Asking the right questions about data takes practice and intuition. Working with data also requires the use of various tools that need to be learned and mastered. Drawing conclusions from data requires a conceptual understanding of the mathematics of data as well as knowledge about the context from which the data was collected.

It takes time, patience, and expertise to work with data . There has been a movement lately to make data more open and available to the public. However access to data is more than just posting it. To give people access we need to give them the skills and competencies required to make sense of it. This is why I feel strongly about engaging with data in schools.

I was in the midst of thinking about these issues when Chris, the math coach at my school emailed me about a project idea focusing on rats in 4th grade. Recently the New York Times published an article titled, Rats are Taking Over New York City featuring some evocative images of rats around town. Forbes followed suit a few days later with a more data focused article on the same topic. Speaking of big data both stories mentioned the OpenTheBooks.com website which features data visualization tools. I was at first unsure about the level of rat interest amongst 4th graders. but I saw an opportunity to explore data with children so I jumped at the chance to meet with him.

Chris and I looked at the article and the OpenTheBooks.com site. We found that the data came from the Open Data NYC site which has a dataset containing all the 311 calls made in NYC from 2010 to the present. Chris and I started talking about what the students could do with this data. We realized that we wanted the students to experience some of the challenges of working with larger data sets. I pointed out that when you start working with data you are forced to adopt different mindsets and ways of thinking.

Chris said, “this reminds me of a technique we use in pushing mathematical thinking. If a child is always relying on one strategy such as counting up for subtraction then we find numbers that make that strategy inconvenient, forcing the student to find new strategies.” Using this analogy we decided we should give the students a dataset that was big enough to push their thinking and support them in figuring out how to organize, represent, and ask questions of the data. Before our next meeting I decided I should spend some time exploring the dataset myself to see what I could find.

This reminded me of another important lesson I gained from my years of teaching science. If you are going to do an investigation with students make sure you explore the materials yourself first. In this case the materials were the 311 call dataset and I needed to have a go with it.

When we eventually present this to the children we will probably use a spreadsheet to support their data work, but recently I have been using Jupyter notebooks and Python to look at data and I wanted to see if I could use those tools to better understand the 311 service request dataset.

My Data Science Investigation

First I created a Collaboratory, a free Google hosted Jupyter notebook platform that saves you the work of installing packages and setting up your computer. The notebooks are conveniently stored in your Google Drive. Best of all it is quick to get started. To access the data I went to the web page for the dataset.

You can look at the data online and export it to CSV. But I decided to use the API query to load the data directly from the web. There is a documentation that includes sample code for Python. Below is the code I used. The first line installs the sodapy package which is not included in Collaboratory.

!pip install -q sodapy
from sodapy import Socrata
import pandas as pd
client = Socrata("data.cityofnewyork.us", None)
results = client.get("fhrw-4uyv", limit=1000)
raw_data = pd.DataFrame.from_records(results)
raw_data.info()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 43 columns):
:@computed_region_92fq_4b7q 244 non-null float64
:@computed_region_efsh_h5xi 243 non-null float64
:@computed_region_f5dn_yrer 244 non-null float64
:@computed_region_sbqj_enih 244 non-null float64
:@computed_region_yeji_bk3q 244 non-null float64
address_type 851 non-null object
agency 1000 non-null object
agency_name 1000 non-null object
bbl 340 non-null float64
borough 1000 non-null object
bridge_highway_direction 2 non-null object
bridge_highway_name 2 non-null object
bridge_highway_segment 3 non-null object
city 442 non-null object
closed_date 947 non-null object
community_board 1000 non-null object
complaint_type 1000 non-null object
created_date 1000 non-null object
cross_street_1 367 non-null object
cross_street_2 313 non-null object
descriptor 1000 non-null object
due_date 360 non-null object
facility_type 999 non-null object
incident_address 402 non-null object
incident_zip 431 non-null float64
intersection_street_1 366 non-null object
intersection_street_2 366 non-null object
latitude 244 non-null float64
location 244 non-null object
location_type 357 non-null object
longitude 244 non-null float64
open_data_channel_type 1000 non-null object
park_borough 1000 non-null object
park_facility_name 1000 non-null object
resolution_action_updated_date 957 non-null object
resolution_description 983 non-null object
road_ramp 2 non-null object
status 1000 non-null object
street_name 402 non-null object
taxi_pick_up_location 1 non-null object
unique_key 1000 non-null int64
x_coordinate_state_plane 244 non-null float64
y_coordinate_state_plane 244 non-null float64
dtypes: float64(11), int64(1), object(31)
memory usage: 336.0+ KB

Pandas is a Python package that provides powerful tools to analyze data. The .info() method lists all the columns in the data set and how many non-null values each column has. To query the dataset effectively you need to use a modified SQL language that is described in the documentation linked above. To be honest I skipped the documentation the first time around and tried to query the database using the pd.read_json() method. This works similarly to what I have above, but you can’t have any spaces in the string and so I had to learn about percent encoding to get my queries to work. It ultimately wasted a lot of time and confused me. The big take away here is: read the documentation. I felt properly humbled making the very mistake I often warn my students against.

Now that I had read the documentation and I had some practice with SQL syntax as well as the pandas package I was ready to roll. I revised my query (the client.get() method) to get only the Rodent complaints, and select the columns that I deemed important to the search. I encourage you to go through each variable listed above using the .value_counts() method for categorical data and the .describe() method for continuous data to get a sense of what each variable represents and which ones you might want to include in your data set. There is also documentation on each variable on the Open Data NYC site. Finally I increased the limit so that I could fetch all the relevant records. Here is the new code that I ran.

client = Socrata("data.cityofnewyork.us", None)
results = client.get("fhrw-4uyv",
select = "borough, created_date, incident_zip, descriptor",
where="complaint_type='Rodent'",
limit = 300000
)
raw_data = pd.DataFrame.from_records(results)
Raw_data.shape
OUTPUT:
(254,628, 4)

As you can see there are 254,628 rows in the table. The next step was to try to visualize the data. For this I used Seaborn, a library built on top of matplotlib. Once you practice with it the syntax makes it look simple to use. Of course as with all things computing it is not so simple and it took me some time to get all of this working. For example, I had to import a few more things.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

Then I ran the following code to create my first bar graph. I used the .countplot() method to represent the categorical data in the ‘descriptor’ column. The first line orients the x labels at 40 degrees.

plt.xticks(rotation=-40)
ts = sns.countplot(x='descriptor',data=raw_data).set_title('Rodent Call Tallies')

This graph helped me see that I only want the rows where descriptor=‘Rat Sighting’ so I adjusted my where clause from the query listed above accordingly.

where="complaint_type='Rodent' and descriptor=’Rat Sighting’",

This left me with 131,399 records of rat sightings. Each sighting was a row in my raw_data Pandas DataFrame. The next step was to unpack the information in the created_date column to make it easier to read and analyze. The created data is a string that contains the year, month, date and time of the call. Using functionality in pandas along with the DateTime library I created new columns for year, month, day, day_of_week, and hour. Here is the code:

raw_data['date'] = pd.to_datetime(raw_data['created_date'])
raw_data['year'] = raw_data['date'].dt.year
raw_data['month'] = raw_data['date'].dt.month
raw_data['day-of-week'] = raw_data['date'].dt.weekday
raw_data['hour']= raw_data['date'].dt.hour
raw_data['day']= raw_data['date'].dt.day
cleaned = raw_data[['borough','incident_zip','year','month','day','day-of-week','hour']]
cleaned.head()

The last line prints out the first five lines of the DataFrame.

The new DataFrame has 7 columns to investigate. Now I am finally ready to produce some interesting graphs.

monthly = sns.countplot(x=”month”, data=cleaned).set_title(‘rat reporting by month’)

It seems that during warmer months more rats are sighted. I realized that this finding might be as indicative of the people observing the rats as the behavior and population of the rats. Are there more rats in the summer or more people out observing rats?

cl = cleaned[cleaned.year.isin([‘2017’,’2018',’2019'])]
monthly = sns.countplot(x=”month”, hue=’year’, data=cl).set_title(“Let’s break down month and year”)

I like this second graph because you can see where the 2019 data ends. I realize now that I should have thrown out the 2019 data for the previous graph because it favored the first five months of the year.

yearly = sns.countplot(x=”year”, data=cleaned).set_title(‘rat reporting by year’)

The last bar in this third graph further illustrates that the 2019 data is incomplete. In the above listed articles the author makes a projection for the 2019 column. Also I notice that rat reporting rose from 2013 to 2017 but then dropped in 2018. The article connected this to de Blasio’s campaign against rats but returning to the 2nd graph you can see in May of 2019 the rat report rate is not declining.

dow = sns.countplot(x="day-of-week", data=cleaned).set_title('rat reporting by day of week')
plt.xticks(range(7),['M','Tu','W','Th','F','Sa','Su'])

From this graph we can see that people don’t report rats as much on the weekend.

Next I wanted to see what time of day people report rats but it turns out that most of the data is listed at midnight. When I looked more closely at the data I realized that most of the calls had a time component of 0. I guess that data was not consistently recorded. Therefore we will have to ignore the hour column. Finally I turned to the location of the calls. The borough graph was not that helpful except to show the relative sizes/populations of the boroughs. Below is the code I used to generate the borough graph.

plt.xticks(rotation=30)
top3 = cleaned[cleaned.borough.isin(['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND'])]
top3 = top3[top3.year.isin(['2017','2018','2019'])]
borough = sns.countplot(x='borough', hue='year', data=top3).set_title('Calls by borough and year')

Next I focused on zip codes.

plt.xticks(rotation=30)
zips = sns.countplot(x='incident_zip',hue='year',data=cleaned,
order=cleaned.incident_zip.value_counts().iloc[:10].index).set_title('zips with the most calls')

This graph was exciting especially since I used to live in the 3rd ranked zip in the city. But this did not seem like the best way to visualize the data. For that I wanted to put this data on a map. It turned out that this was going to require even more work. First I created a new Dataframe with the row count for each zipcode.

for_map = cleaned.groupby('incident_zip').size().reset_index()
for_map =for_map.rename(index=str,columns={"incident_zip":'zip',0:'count'})

Next I needed a geospacial package that could read in the geometry of each zipcode. And I also needed a file with all the zipcode geometries. Luckily NYC open data had all this information for me.

!pip install -q geopandas
!pip install -q descartes
from geopandas import GeoDataFrame
import geopandas as gpd
map = gpd.read_file('ZIP_CODE_040114.shp')
merged = map.set_index("ZIPCODE").join(for_map.set_index("zip"))
merged = merged.fillna(0) #bye bye NaN

geopandas is the package I used to read the geometry and ZIP_CODE_040114.shp is the file with the geometry information that I got from NYC open data. Make sure you download the whole folder because it needs access to another file in that folder to work. Finally after playing around I removed the NaN values from merged using the last line of code above to avoid scaling issues. The code below creates a choropleth figure of the number of calls by zipcode. I mostly just copied it from the geopandas documentation.

vmin, vmax = 0, 4000
fig, ax = plt.subplots(1, figsize=(10, 6))
merged.plot(column='count', cmap='Blues',linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('number of rat 311 calls by zipcode')
ax.annotate('Source: NYC Open Data, 2019',xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

This made the data look beautiful. I was proud that I had figured this out. I started to wonder what stories I could find in the data. I kept investigating by going back to the original 311 service request query and looking at the noise complaints by zip code. Using a similar set of steps I created the choropleth below.

Comparing the figures they are more similar than I expected. I started to wonder if some zip codes just report more complaints to 311 and so the rat sightings were an indication of the reporting rate of that zip code rather than a data point about rat infestation or gentrification as the above article argued.

I could keep exploring these ideas as I have more questions than answers. In fact the number of generated questions might be a good metric for the success of a data project. However, I decided that I have enough knowledge to support a 4th graders inquiry into rat sightings and I need to move on to the next project.

In conclusion it is worth noting all the steps that were involved in my analysis of the data. It required a lot of time and the learning of numerous skills. I also had to draw on many previous experiences at each step in the process making all of this work extremely challenging for a beginner. All of this must be considered what it means to have access to data. I want students to have access because if someone else is doing the analysis and you don’t understand how the analysis works then you are beholden to their choices and biases. Also exploring data can be a lot of fun.

exploring the intersection of coding, education and disciplinary knowledge