I want to see what types of universities have students in debt. We will be using the college scorecard API from data.gov. I am interested in a number of variables, these are admissions rate, tuition, median household income, student size, etc.
It seems that the UC office of president claims that the UC schools are a good deal based on data from the college scorecard.
You should go begin by getting an API key for data.gov. There are many data.gov APIs that you can use this key for and they have fairly relaxed limits on the number of requests.
The documentation for the API can be found here...
https://github.com/RTICWDT/open-data-maker/blob/master/API.md
https://collegescorecard.ed.gov/data/
Let's import some packages.
from urllib2 import Request, urlopen
import pandas as pd
import json
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
Here are some variable names that we will be interested in.
varnames = """id
school.name
school.degrees_awarded.predominant
2014.admissions.admission_rate.overall
2014.cost.tuition.in_state
2014.cost.tuition.out_of_state
2014.aid.median_debt.completers.overall
2014.aid.median_debt.noncompleters
2005.student.demographics.median_hh_income
2005.student.demographics.poverty_rate
2014.aid.federal_loan_rate
2014.aid.pell_grant_rate
2014.student.size"""
fieldnames = varnames.replace("\n", ",")
Exercise (1) I would like to make a complicated query from the API. An example query from the API is
https://api.data.gov/ed/collegescorecard/v1/schools.json?school.degrees_awarded.predominant=3&_fields=school.name, 2014.aid.pell_grant_rate&_sort=2014.student.size:desc&_page=1 &_per_page=100&api_key=YOURKEY
Which can be parsed as
You can find a more complete description: https://github.com/RTICWDT/open-data-maker/blob/master/API.md
Fill in the following def that fits the following docstring, and gives the shown output.
def query_url(key,fieldnames,filterstring,sortvar,numpages): """ This takes all of the necessary parameters to form a query to college scorecard Input: key (data.gov API key, string), fieldnames (list of variable names), filterstring (string for filtering), sortvar (variable to sort with by descending), numpages (number of pages to return) Output: query url """ query_url(key,fieldnames,'school.degrees_awarded.predominant=3', '2014.student.size',2) https://api.data.gov/ed/collegescorecard/v1/schools.json?school.degrees_awarded.predominant=3&_fields=id,school.name,school.degrees_awarded.predominant,school.degree_urbanization,2014.admissions.admission_rate.overall,2014.cost.tuition.in_state,2014.cost.tuition.out_of_state,2014.aid.median_debt.completers.overall,2014.aid.median_debt.noncompleters,2005.student.demographics.median_hh_income,2005.student.demographics.poverty_rate,2014.aid.federal_loan_rate,2014.aid.pell_grant_rate,2014.student.size&_page=2&_per_page=100&_sort=2014.student.size:desc&api_key=YOURKEYHERE
Try to reproduce the above behavior.
def query_url(key,fieldnames,filterstring,sortvar,numpages):
base_url = 'https://api.data.gov/ed/collegescorecard/v1/schools.json?'
numpages = str(numpages)
url = base_url + filterstring + '&_fields=' + fieldnames + '&_page=' + numpages + '&_per_page=100' + '&_sort=' + sortvar + ':desc&api_key=' + key
return url
searchurl = query_url('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', fieldnames, 'school.degrees_awarded.predominant=3', '2014.student.size', 1)
searchurl
Exercise (2) Make the query using Request and load the results into a DataFrame that makes sense.
request = Request(searchurl)
response = urlopen(request)
rr = response.read()
print rr[0:400]
js = json.loads(rr)
type(js)
js.keys()
observations = js[u'results']
college = pd.DataFrame(data = observations)
college
searchurl = query_url('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', fieldnames, 'school.degrees_awarded.predominant=3', '2014.student.size', 2)
request = Request(searchurl)
response = urlopen(request)
rr = response.read()
js = json.loads(rr)
observations = js[u'results']
collegedfs = pd.DataFrame(data = observations)
college = college.append(collegedfs)
college
college = college.rename(columns=lambda i: ['Median Income', 'Poverty Rate', 'Admission Rate', 'Federal Loan Rate', 'Mean Debt Completers', 'Mean Debt Noncompleters', 'Pell Grant Rate', 'In State Tuition', 'Out State Tuition', 'Size', 'ID', 'School Degrees', 'School Name'][i])
print len(college.columns), len(college.index)
Exercise (3) Produce scatterplots to demonstrate the associations between student debt (for 'completers') and median household income, pell grant rates, student poverty rate, tuition, and federal loan rate. Do you have an explanation for the associations that you are seeing? Are there some surprising associations.
from pandas.tools.plotting import scatter_matrix
debt = college[['2014.aid.median_debt.completers.overall']]
income = college[['2005.student.demographics.median_hh_income']]
plt.scatter(income, debt)
plt.show
scatter_matrix(college[['2014.aid.median_debt.completers.overall', '2005.student.demographics.median_hh_income', '2014.aid.pell_grant_rate', '2005.student.demographics.poverty_rate', '2014.cost.tuition.in_state', '2014.aid.federal_loan_rate']], figsize = (16,16))