The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip
file contains the fruit and vegetable files in the directories fruit
and vegetables
, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
---|---|---|---|---|---|---|
fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str
methodsos
os.path
read_excel()
, concat()
, .fillna()
, .str
, plotting methodsAsk questions and search the documentation/web to find the functions you need.
from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
fruits = listdir('C:/Python27/mod1/fruit')
allfiles = []
for i in range(0, len(fruits)):
allfiles.append('C:/Python27/mod1/fruit/' + fruits[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters
For whatever reason, the dot has trouble being stripped together'''
allfruits = []
for i in range(0, len(fruits)):
allfruits.append(fruits[i])
allfruits[i] = allfruits[i].rstrip(allfruits[i][-4:])
allfruits[i] = allfruits[i].rstrip(allfruits[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb =[]
yields = []
lb_per_cup = []
price_per_cup = []
for i in range(0, len(fruits)):
tempfile = pd.read_excel(allfiles[i])
fresh = tempfile.loc[2, :]
price_per_lb.append(float(fresh[1]))
yields.append(float(fresh[3]))
lb_per_cup.append(float(fresh[4]))
price_per_cup.append(float(fresh[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category = []
for i in range(0, len(fruits)):
category.append("fruit")
'''Creates fruit dataframe'''
dffruit = pd.DataFrame({'Type': category, 'Name': allfruits, 'price/lb': price_per_lb, 'yield': yields, 'lb/cup': lb_per_cup, 'price/cup': price_per_cup})
dffruit
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
vegetables = listdir('C:/Python27/mod1/vegetables')
allfiles2 = []
for i in range(0, len(vegetables)):
allfiles2.append('C:/Python27/mod1/vegetables/' + vegetables[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters'''
allveggies = []
for i in range(0, len(vegetables)):
allveggies.append(vegetables[i])
allveggies[i] = allveggies[i].rstrip(allveggies[i][-4:])
allveggies[i] = allveggies[i].rstrip(allveggies[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb2 =[]
yields2 = []
lb_per_cup2 = []
price_per_cup2 = []
for i in range(0, len(vegetables)):
tempfile2 = pd.read_excel(allfiles2[i])
fresh2 = tempfile2.loc[2, :]
'''I noticed there was an error with one of the prices as a string so I converted it to NaN string instead'''
try:
price_per_lb2.append(float(fresh2[1]))
except ValueError:
price_per_lb2.append(np.nan)
'''Same with yields and other values'''
try:
yields2.append(float(fresh2[3]))
except ValueError:
yields2.append(np.nan)
try:
lb_per_cup2.append(float(fresh2[4]))
except ValueError:
lb_per_cup2.append(np.nan)
price_per_cup2.append(float(fresh2[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category2 = []
for i in range(0, len(vegetables)):
category2.append("vegetable")
'''Creates fruit dataframe'''
dfveg = pd.DataFrame({'Type': category2, 'Name': allveggies, 'price/lb': price_per_lb2, 'yield': yields2, 'lb/cup': lb_per_cup2, 'price/cup': price_per_cup2})
dfveg
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
import numpy as np
'''Drops cocktail'''
newdffruit = dffruit.loc[np.isnan(dffruit['price/lb']) == False]
newdffruit
'''Drop all the NaN vegetable observations'''
newdfveg = dfveg.loc[np.isnan(dfveg['price/lb']) == False]
newdfveg
'''Merge dataframes'''
foods = [newdffruit, newdfveg]
dffoods = pd.concat(foods)
dffoods
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
fruit = dffoods.loc[dffoods['Type'] == "fruit"]
maxprice = fruit['price/lb'].max()
maxfruit = dffoods.loc[dffoods['price/lb'] == maxprice]
'''I do not know why this won't return a string if I only grab the name. Had to look this up on Stack Overflow
http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame'''
maxfruit = maxfruit.iloc[0]['Name']
minprice = fruit['price/lb'].min()
minfruit = dffoods.loc[dffoods['price/lb'] == minprice]
minfruit = minfruit.iloc[0]['Name']
print maxfruit, maxprice, minfruit, minprice
I separated the dataset back into the two parts, one containing all the fruits and one containing all the vegetables. From there I got the minimum and maximum prices and found which fruits and vegetables they belonged to. Then I found that the most expensive fruit per pound are raspberries and the least expensive fruit per pound are watermelons. I did take some of the observations out containing N/A for the prices in the dataset so this could possibly be wrong and that not all the types of fruits and vegetables are in the dataset.
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['price/lb'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show()
I plotted a histogram for both fruits and vegetables to get a sense of the price distributions and the number of fruits or vegetables at each price. From this plot, it looks like fruits are more expensive than vegetables in general. Most fruits and vegetables cost around 0 to less than 4 dollars in terms of price per pound and based on the plot, only fruits costs more than 5 dollars per pound. Most of the vegetables seem to cost around 50 cents to 1.5 dollars per pound. There could possibly be more vegetables that are as expensive as the fruits in the outlier parts of the graph as some observations are taken out of the dataset.
ratio = dffoods['price/lb']/dffoods['yield']
dffoods['ratio'] = ratio
dffoods
maxratio = dffoods['ratio'].max()
bestvalue = dffoods.loc[dffoods['ratio'] == maxratio]
bestvalue = bestvalue.iloc[0]['Name']
minratio = dffoods['ratio'].min()
worstvalue = dffoods.loc[dffoods['ratio'] == minratio]
worstvalue = worstvalue.iloc[0]['Name']
print bestvalue, maxratio, worstvalue, minratio
I interpreted "best value" for the price as having the best price/yield ratio. I first took a ratio of all the prices and yields and divided them then appended that to a column in my data frame which from there, I got the foods with best and worst ratio. The food with the best value seems to be raspberries and worst is lentils.
maxvegprice = vegetable['price/lb'].max()
maxveg = dffoods.loc[dffoods['price/lb'] == maxvegprice]
maxveg = maxveg.iloc[0]['Name']
minvegprice = vegetable['price/lb'].min()
minveg = dffoods.loc[dffoods['price/lb'] == minvegprice]
minveg = minveg.iloc[0]['Name']
print maxveg, maxvegprice, minveg, minvegprice
expveg = dfveg.loc[dfveg['price/lb'] > 4]
expfruit = dffruit.loc[dffruit['price/lb'] > 4]
cheapveg = dfveg.loc[dfveg['price/lb'] < 1]
cheapfruit = dffruit.loc[dffruit['price/lb'] < 1]
print expveg, expfruit
print cheapveg, cheapfruit
Something that surprised me in the dataset is that there are more fruits that cost a lot than vegetables are. I expected berries to cost more than 4 dollars per pound but did not expect olives, dates, and figs to cost so much too, especially figs which is $5.74 per pound. I found that pineapples costs less than 1 dollar per pound as well which made me surprised because they usually cost a lot more in the market from what I've seen. I'm also surprised that corn isn't in the list of the cheaper vegetables because I've learned from previous classes that producers generally produce a lot of corn, potatoes, and beans so I'd expect them to have cheaper prices. From what I just saw for the more expensive and cheaper foods, most beans and potato varieties are in the list of cheaper vegetables I've got but corn isn't one of them.
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
fruit = dffoods.loc[dffoods['Type'] == "fruit"]
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['ratio'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show()
bval = dffoods.loc[dffoods['ratio'] > 4]
wval = dffoods.loc[dffoods['ratio'] < 1]
print bval
print wval
From the histogram above, I think that foods with the best values include blackberries, blueberries, cranberries, dates, figs, raspberries, okra, olives, sweet corn, asparagus, and artichokes and the foods with the worst values are bananas, watermelons, potatoes, lentils, and cabbages. I took ratios of the price per yield to determine this since this is what I interpreted as best combination for the yield and price.