Coding Stories: Data Science

When I was ten years old I was given a Commodore 64 and I learned to code in BASIC. When I was in college I learned C to develop a Monte Carlo simulation for a Physics professor. As a science teacher I saw programming as a way to help students explore the natural world. Now I work with students and teachers to find the intersection of coding and learning. This blog documents my coding stories.

Data Science for K-12

What does data science mean in K-12? As a science teacher I have spent a lot of time working with data. For example, when I taught middle school science I had students conduct plant experiments in small groups. They would focus on a particular variable such as fertilizer, light, or space and record their data in their lab notebooks. There were always some anomalies in the data and sometimes dubious conclusions as a result. When I got smarter as a teacher I had students pool their data across the grade to increase the size of their data set. This led to a marked improvement in the accuracy of their conclusions. I had been telling students for years how repeating an experiment is a good idea, but it was not until I experienced it on a large scale that I internalized the power of this effect. While I was excited to expose my students to larger data sets, I don’t think I did enough to communicate its significance.

My Data Science Investigation

First I created a Collaboratory, a free Google hosted Jupyter notebook platform that saves you the work of installing packages and setting up your computer. The notebooks are conveniently stored in your Google Drive. Best of all it is quick to get started. To access the data I went to the web page for the dataset.

Image for post
Image for post
!pip install -q sodapy
from sodapy import Socrata
import pandas as pd
client = Socrata("data.cityofnewyork.us", None)
results = client.get("fhrw-4uyv", limit=1000)
raw_data = pd.DataFrame.from_records(results)
raw_data.info()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 43 columns):
:@computed_region_92fq_4b7q 244 non-null float64
:@computed_region_efsh_h5xi 243 non-null float64
:@computed_region_f5dn_yrer 244 non-null float64
:@computed_region_sbqj_enih 244 non-null float64
:@computed_region_yeji_bk3q 244 non-null float64
address_type 851 non-null object
agency 1000 non-null object
agency_name 1000 non-null object
bbl 340 non-null float64
borough 1000 non-null object
bridge_highway_direction 2 non-null object
bridge_highway_name 2 non-null object
bridge_highway_segment 3 non-null object
city 442 non-null object
closed_date 947 non-null object
community_board 1000 non-null object
complaint_type 1000 non-null object
created_date 1000 non-null object
cross_street_1 367 non-null object
cross_street_2 313 non-null object
descriptor 1000 non-null object
due_date 360 non-null object
facility_type 999 non-null object
incident_address 402 non-null object
incident_zip 431 non-null float64
intersection_street_1 366 non-null object
intersection_street_2 366 non-null object
latitude 244 non-null float64
location 244 non-null object
location_type 357 non-null object
longitude 244 non-null float64
open_data_channel_type 1000 non-null object
park_borough 1000 non-null object
park_facility_name 1000 non-null object
resolution_action_updated_date 957 non-null object
resolution_description 983 non-null object
road_ramp 2 non-null object
status 1000 non-null object
street_name 402 non-null object
taxi_pick_up_location 1 non-null object
unique_key 1000 non-null int64
x_coordinate_state_plane 244 non-null float64
y_coordinate_state_plane 244 non-null float64
dtypes: float64(11), int64(1), object(31)
memory usage: 336.0+ KB
client = Socrata("data.cityofnewyork.us", None)
results = client.get("fhrw-4uyv",
select = "borough, created_date, incident_zip, descriptor",
where="complaint_type='Rodent'",
limit = 300000
)
raw_data = pd.DataFrame.from_records(results)
Raw_data.shape
OUTPUT:
(254,628, 4)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
plt.xticks(rotation=-40)
ts = sns.countplot(x='descriptor',data=raw_data).set_title('Rodent Call Tallies')
Image for post
Image for post
where="complaint_type='Rodent' and descriptor=’Rat Sighting’",
raw_data['date'] = pd.to_datetime(raw_data['created_date'])
raw_data['year'] = raw_data['date'].dt.year
raw_data['month'] = raw_data['date'].dt.month
raw_data['day-of-week'] = raw_data['date'].dt.weekday
raw_data['hour']= raw_data['date'].dt.hour
raw_data['day']= raw_data['date'].dt.day
cleaned = raw_data[['borough','incident_zip','year','month','day','day-of-week','hour']]
cleaned.head()
Image for post
Image for post
Image for post
Image for post
monthly = sns.countplot(x=”month”, data=cleaned).set_title(‘rat reporting by month’)
Image for post
Image for post
cl = cleaned[cleaned.year.isin([‘2017’,’2018',’2019'])]
monthly = sns.countplot(x=”month”, hue=’year’, data=cl).set_title(“Let’s break down month and year”)
Image for post
Image for post
yearly = sns.countplot(x=”year”, data=cleaned).set_title(‘rat reporting by year’)
dow = sns.countplot(x="day-of-week", data=cleaned).set_title('rat reporting by day of week')
plt.xticks(range(7),['M','Tu','W','Th','F','Sa','Su'])
Image for post
Image for post
Image for post
Image for post
plt.xticks(rotation=30)
top3 = cleaned[cleaned.borough.isin(['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND'])]
top3 = top3[top3.year.isin(['2017','2018','2019'])]
borough = sns.countplot(x='borough', hue='year', data=top3).set_title('Calls by borough and year')
Image for post
Image for post
plt.xticks(rotation=30)
zips = sns.countplot(x='incident_zip',hue='year',data=cleaned,
order=cleaned.incident_zip.value_counts().iloc[:10].index).set_title('zips with the most calls')
Image for post
Image for post
for_map = cleaned.groupby('incident_zip').size().reset_index()
for_map =for_map.rename(index=str,columns={"incident_zip":'zip',0:'count'})
!pip install -q geopandas
!pip install -q descartes
from geopandas import GeoDataFrame
import geopandas as gpd
map = gpd.read_file('ZIP_CODE_040114.shp')
merged = map.set_index("ZIPCODE").join(for_map.set_index("zip"))
merged = merged.fillna(0) #bye bye NaN
vmin, vmax = 0, 4000
fig, ax = plt.subplots(1, figsize=(10, 6))
merged.plot(column='count', cmap='Blues',linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('number of rat 311 calls by zipcode')
ax.annotate('Source: NYC Open Data, 2019',xy=(0.1, .08), xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)
Image for post
Image for post
Image for post
Image for post

exploring the intersection of coding, education and disciplinary knowledge

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store