In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite
, which you can download here. The database contains the following tables:
Table | Description |
---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_
tables are explicitly connected through the locationid
and permit
columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
import sqlite3 as sql
db = sql.connect("sf_data.sqlite")
cursor = db.execute("SELECT * FROM sqlite_master")
cursor.fetchall()
import pandas as pd
pd.read_sql("SELECT * FROM sqlite_master", db)
mob_loc = pd.read_sql("SELECT * FROM mobile_food_locations", db)
mob_per = pd.read_sql("SELECT * FROM mobile_food_permits", db)
mob_sch = pd.read_sql("SELECT * FROM mobile_food_schedule", db)
mob_loc.head()
mob_per.head()
mob_sch.head()
print len(mob_loc.index), len(mob_sch.index), len(mob_per.index)
newdf = pd.read_sql("SELECT * FROM mobile_food_locations INNER JOIN mobile_food_schedule ON mobile_food_locations.locationID = mobile_food_schedule.locationID INNER JOIN mobile_food_permits ON mobile_food_permits.permit = mobile_food_schedule.permit", db)
newdf.head()
# Gets most frequent vendor
newdf['Applicant'].value_counts().idxmax
May Catering sells at the most locations.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
from mpl_toolkits.basemap import Basemap
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Which types of schools does San Francisco have most of and least of?
schools = pd.read_sql("SELECT * FROM schools", db)
schools.head()
schools['GradeRange'].value_counts().idxmax
A majority of schools in San Francisco cover the K-8 to K-5 range so most of them are elementary schools. The schools with a unique grade range (not the usual elementary, middle, and high school ranges) are in the minority and not common.
Which areas in San Francisco are the best areas to find parking?
parking = pd.read_sql("SELECT * FROM parking", db)
parking.head()
lotspace = parking.loc[parking['RegCap'] > 500]
plt.figure(figsize=(15,10))
displaymap = Basemap(llcrnrlon=-122.5,llcrnrlat=37.7,urcrnrlon=-122.4,urcrnrlat=37.8, resolution = "h", projection ="merc")
# Convert longitudes and latitudes to list of floats
longitude = lotspace[['Lon']].values.tolist()
for i in range(0, len(longitude)):
longitude[i] = float(longitude[i][0])
latitude = lotspace[['Lat']].values.tolist()
for i in range(0, len(latitude)):
latitude[i] = float(latitude[i][0])
displaymap.drawmapboundary()
displaymap.drawcountries()
displaymap.drawcoastlines()
lons,lats = displaymap(longitude, latitude)
displaymap.plot(lons, lats, 'bo', color = "blue")
I defined the best parking areas to be spaces where the RegCap which is the maximum spaces allowed for parking to be greater than 500. From the map of San Francisco, it looks like most of those parking spaces are in the northeastern area of San Francisco.
Which areas in San Francisco have the most noise? Which type of noise causes the most noise and which areas in San Francisco have the most type of that noise?
noise = pd.read_sql("SELECT * FROM noise", db)
noise.head()
# deletes all N/A locations
noiseknown = noise.loc[np.isnan(noise['Lat']) == False]
plt.figure(figsize=(15,10))
displaymap = Basemap(llcrnrlon=-122.5,llcrnrlat=37.7,urcrnrlon=-122.4,urcrnrlat=37.8, resolution = "h", projection ="merc")
# Convert longitudes and latitudes to list of floats
longitude = noiseknown[['Lon']].values.tolist()
for i in range(0, len(longitude)):
longitude[i] = float(longitude[i][0])
latitude = noiseknown[['Lat']].values.tolist()
for i in range(0, len(latitude)):
latitude[i] = float(latitude[i][0])
displaymap.drawmapboundary()
displaymap.drawcountries()
displaymap.drawcoastlines()
lons,lats = displaymap(longitude, latitude)
displaymap.plot(lons, lats, 'bo', color = "blue")
From the map, it looks like most parts of San Francisco have noise but the eastern part seems to have more noise than the western part of San Francisco.
noise['Type'].value_counts().idxmax
noisecon = noiseknown[noiseknown['Type'] == 'construction_private_property']
plt.figure(figsize=(15,10))
displaymap = Basemap(llcrnrlon=-122.5,llcrnrlat=37.7,urcrnrlon=-122.4,urcrnrlat=37.8, resolution = "h", projection ="merc")
# Convert longitudes and latitudes to list of floats
longitude = noisecon[['Lon']].values.tolist()
for i in range(0, len(longitude)):
longitude[i] = float(longitude[i][0])
latitude = noisecon[['Lat']].values.tolist()
for i in range(0, len(latitude)):
latitude[i] = float(latitude[i][0])
displaymap.drawmapboundary()
displaymap.drawcountries()
displaymap.drawcoastlines()
lons,lats = displaymap(longitude, latitude)
displaymap.plot(lons, lats, 'bo', color = "blue")
Aside from other types of excessive noise which isn't in a specific category, the factor that causes the most noise is construction on private property. The points for construction noise is spread throughout the map of San Francisco but most of it is gathered around in the north eastern area of San Francisco.
Which parts of the city are the least expensive and which are the most expensive?
zillow = pd.read_sql("SELECT * FROM zillow", db)
zillow.head()
priceknown = zillow.loc[np.isnan(zillow['MedianSoldPricePerSqft_AllHomes']) == False]
maxprice = priceknown['MedianSoldPricePerSqft_AllHomes'].max()
expensive = priceknown.loc[priceknown['MedianSoldPricePerSqft_AllHomes'] == maxprice]
minprice = priceknown['MedianSoldPricePerSqft_AllHomes'].min()
cheap = priceknown.loc[priceknown['MedianSoldPricePerSqft_AllHomes'] == minprice]
print expensive.iloc[0]['RegionName'], cheap.iloc[0]['RegionName']
I based my answer on median sold price per square feet for all homes in one region. The most expensive houses are in the 94104 region and the cheapest houses are in the 94124 region.
Which types of violent crimes occur the most in San Francisco and which parts do they occur in the most?
crime = pd.read_sql("SELECT * FROM crime", db)
crime.head()
crime['Category'].value_counts().idxmax
assault = crime[crime['Category'] == 'ASSAULT']
assault.head()
assault['PdDistrict'].value_counts().idxmax
I defined violent crimes as those that cause people bodily harm and found that assault is the most common of those crimes. The Southern, Mission and Bayview district have the most number of these types of crimes.
db.close()