In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv
contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
import pandas as pd
fresh = pd.read_csv("fresh.csv")
fresh
fresh.loc[fresh.form == 'Fresh red cabbage1', 'food'] = 'cabbage_red'
fresh.loc[fresh.form == 'Fresh, consumed with peel1', 'food'] = 'cucumber_with_peel'
fresh.loc[fresh.form == 'Fresh, peeled1', 'food'] = 'cucumber_peeled'
fresh
Exercise 1.1. Read the search request documentation, then write a function called ndb_search()
that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs"
should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
from urllib2 import Request, urlopen
import json
# Gets search URL and gets request function
def ndb_search(key, search):
search = list(search)
for i in range (0, len(search)):
if search[i] == ' ':
search[i] = '_'
search = "".join(search)
search = str(search)
searchurl = 'https://api.nal.usda.gov/ndb/search?format=json&q=' + search + '&sort=n&max=25&offset=0&api_key=' + key
request = Request(searchurl)
response = urlopen(request)
rr = response.read()
js = json.loads(rr)
return js[u'list']
results = ndb_search('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', 'quail eggs')
results
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv
file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
ndb_search()
to a data frame with pd.DataFrame()
.pd.merge()
.# Get list of foods and appends raw next the them
foods = (fresh[['food']])
len(foods.index)
foodlist = []
for i in range(0, len(foods.index)):
foodlist.append(foods.iloc[i]['food'])
foodlist[i] = foodlist[i] + ' raw'
foodlist[0:4]
# Gets list of observations
observations = []
for i in range(0, len(foodlist)):
try:
observations.append(ndb_search('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', foodlist[i]))
except KeyError:
continue
numbers = pd.DataFrame(observations)
numbers.head()
# Delete raw and make new dataframe
freshfood = (numbers[['q']])
len(freshfood.index)
freshlist = []
for i in range(0, len(freshfood.index)):
freshlist.append(str(freshfood.iloc[i]['q']))
freshlist[i] = freshlist[i].rstrip('_raw')
freshlist[0:4]
# Get numbers in dataframe
num = (numbers[['item']])
len(freshfood.index)
# The [14:19] of the string returns the number for each observation so I get that into a list
str(num.iloc[0]['item'])[14:19]
numlist = []
for i in range(0, len(num.index)):
numlist.append(str(num.iloc[i]['item'])[14:19])
numlist[0:4]
# Make new dataframe with only the numbers and foods
numdf = pd.DataFrame({'food': freshlist, 'number': numlist}, index = range(0, len(numlist)))
numdf.head()
# Merge the two dataframes together
newdf = pd.merge(fresh, numdf, on = 'food')
newdf.head()
For this exercise, I first extracted the foods from the fresh dataframe and appended the string ' raw' to them so that I could use them for my search. I searched through the whole list of values and got the item object which contains the number in them in a new data frame. I extracted the numbers and foods and stripped the raw from them back so that I can make a new dataframe with only the foods and numbers. Then I merged the two dataframes together using the food as my key. I also converted some of the duplicate foods such as cucumbers and cabbage to their specific names to make this more accurate.
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report()
that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279"
(raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
# Gets search URL and gets request function
def ndb_report(key, search):
searchurl = 'https://api.nal.usda.gov/ndb/V2/reports?ndbno=' + search + '&type=f&format=json&api_key=' + key
request = Request(searchurl)
response = urlopen(request)
rr = response.read()
js = json.loads(rr)
return js[u'foods'][0][u'food'][u'nutrients'][0]
report = ndb_report('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', '09279')
report
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.
# Gets nutrients
nutrients = []
for i in range(0, len(newdf.index)):
try:
nutrients.append(ndb_report('A8tjHzrA9pJQTudS94B6XSGMB3BzXeSSwrgUGlix', newdf.iloc[i]['number']))
except KeyError:
continue
nutrients
nutdf = pd.DataFrame(nutrients)
nutdf.head()
# Get water values in dataframe
nutvalue = (nutdf[['value']])
nutvalue.head()
# Since I ran the basic report function from the foods I originally got, the each value should correspond with the correct food
newdf = newdf.join(nutvalue)
newdf.head()
price_per_lb = newdf[['price_per_lb']]
price_per_lb.head()
yields = newdf[['yield']]
yields.head()
water = newdf[['value']]
water.head()
from matplotlib import pyplot as plt
plt.style.use('ggplot')
plt.hist(price_per_lb['price_per_lb'], label = "Price")
plt.show()
plt.hist(yields['yield'], label = "Yield")
plt.show()
plt.hist(water['value'], label = "Water Value")
plt.show()
ratio = price_per_lb['price_per_lb']/yields['yield']
newdf['ratio'] = ratio
ratio = newdf[['ratio']]
plt.hist(ratio['ratio'], label = "Price and Yield Ratio")
plt.show()
cheapfoods = newdf.loc[newdf['price_per_lb'] < 1]
highyields = newdf.loc[newdf['yield'] > 0.9]
highratio = newdf.loc[newdf['ratio'] > 5]
highwater = newdf.loc[newdf['value'] > 90]
print cheapfoods.iloc[0:len(cheapfoods.index)]['food']
print highyields.iloc[0:len(highyields.index)]['food']
print highwater.iloc[0:len(highwater.index)]['food']
print highratio.iloc[0:len(highratio.index)]['food']
Based on the foods I found with the cheapest prices, high yields, and high nutrition value based on water value, I think that the foods with the best value are watermelons, cantaloupes, nectarines, red cabbages, cucumbers with peels, and asparagus. This may not be accurate since I'm only basing it off of the water value instead of kilocalories. I chose water level instead of kilocalories because having too much kilocalorie in one food may not be beneficial to the body. This could be more accurate if other foods and other nutrients were taken into account but having too much of one nutrients is not beneficial to the body so this could be found by perhaps setting a healthy range of nutrients in order to analyze this. I interpreted "best value" as having a good amount of all of these things so I found the prices, yields, and ratio of price and yields along with the chosen nutrient value in and found a list of commonly shared items in the list. I used a histogram in order to find the ranges of the prices, yield, ratios, and nutrition value but I think in this case, a 3D plot would be better to show which foods have a lower price, high yield and high nutrition value in order to find the points that fit in those area. Another thing that would be better if I could find a list of foods using an and statement that fit all the criteria where I got a value error when I tried comparing in multiple columns.