Assignment 3¶

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type	food	form	price_per_lb	yield	lb_per_cup	price_per_cup
fruit	watermelon	Fresh1	0.333412	0.52	0.330693	0.212033
fruit	cantaloupe	Fresh1	0.535874	0.51	0.374786	0.3938
vegetables	onions	Fresh1	1.03811	0.9	0.35274	0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

str methods
os
os.path
pandas: read_excel(), concat(), .fillna(), .str, plotting methods

Ask questions and search the documentation/web to find the functions you need.

from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
fruits = listdir('C:/Python27/mod1/fruit')
allfiles = []
for i in range(0, len(fruits)):
    allfiles.append('C:/Python27/mod1/fruit/' + fruits[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters
For whatever reason, the dot has trouble being stripped together'''
allfruits = []
for i in range(0, len(fruits)):
    allfruits.append(fruits[i])
    allfruits[i] = allfruits[i].rstrip(allfruits[i][-4:])
    allfruits[i] = allfruits[i].rstrip(allfruits[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb =[]
yields = []
lb_per_cup = []
price_per_cup = []
for i in range(0, len(fruits)):
    tempfile = pd.read_excel(allfiles[i])
    fresh = tempfile.loc[2, :]
    price_per_lb.append(float(fresh[1]))
    yields.append(float(fresh[3]))
    lb_per_cup.append(float(fresh[4]))
    price_per_cup.append(float(fresh[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category = []
for i in range(0, len(fruits)):
    category.append("fruit")
'''Creates fruit dataframe'''
dffruit = pd.DataFrame({'Type': category, 'Name': allfruits, 'price/lb': price_per_lb, 'yield': yields, 'lb/cup': lb_per_cup, 'price/cup': price_per_cup})
dffruit

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
vegetables = listdir('C:/Python27/mod1/vegetables')
allfiles2 = []
for i in range(0, len(vegetables)):
    allfiles2.append('C:/Python27/mod1/vegetables/' + vegetables[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters'''
allveggies = []
for i in range(0, len(vegetables)):
    allveggies.append(vegetables[i])
    allveggies[i] = allveggies[i].rstrip(allveggies[i][-4:])
    allveggies[i] = allveggies[i].rstrip(allveggies[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb2 =[]
yields2 = []
lb_per_cup2 = []
price_per_cup2 = []
for i in range(0, len(vegetables)):
    tempfile2 = pd.read_excel(allfiles2[i])
    fresh2 = tempfile2.loc[2, :]
    '''I noticed there was an error with one of the prices as a string so I converted it to NaN string instead'''
    try:
        price_per_lb2.append(float(fresh2[1]))
    except ValueError:
        price_per_lb2.append(np.nan)
    '''Same with yields and other values'''
    try:
        yields2.append(float(fresh2[3]))
    except ValueError:
        yields2.append(np.nan)
    try:
        lb_per_cup2.append(float(fresh2[4]))
    except ValueError:
        lb_per_cup2.append(np.nan)
    price_per_cup2.append(float(fresh2[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category2 = []
for i in range(0, len(vegetables)):
    category2.append("vegetable")
'''Creates fruit dataframe'''
dfveg = pd.DataFrame({'Type': category2, 'Name': allveggies, 'price/lb': price_per_lb2, 'yield': yields2, 'lb/cup': lb_per_cup2, 'price/cup': price_per_cup2})
dfveg

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

import numpy as np
'''Drops cocktail'''
newdffruit = dffruit.loc[np.isnan(dffruit['price/lb']) == False]
newdffruit
'''Drop all the NaN vegetable observations'''
newdfveg = dfveg.loc[np.isnan(dfveg['price/lb']) == False]
newdfveg
'''Merge dataframes'''
foods = [newdffruit, newdfveg]
dffoods = pd.concat(foods)
dffoods

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
How do the price distributions compare for fruit and vegetables?
Which foods are the best value for the price?
What's something surprising about this data set?
Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.

fruit = dffoods.loc[dffoods['Type'] == "fruit"]
maxprice = fruit['price/lb'].max()
maxfruit = dffoods.loc[dffoods['price/lb'] == maxprice]
'''I do not know why this won't return a string if I only grab the name. Had to look this up on Stack Overflow
http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame'''
maxfruit = maxfruit.iloc[0]['Name']
minprice = fruit['price/lb'].min()
minfruit = dffoods.loc[dffoods['price/lb'] == minprice]
minfruit = minfruit.iloc[0]['Name']
print maxfruit, maxprice, minfruit, minprice

raspberries 6.97581079886 watermelon 0.333412035323

I separated the dataset back into the two parts, one containing all the fruits and one containing all the vegetables. From there I got the minimum and maximum prices and found which fruits and vegetables they belonged to. Then I found that the most expensive fruit per pound are raspberries and the least expensive fruit per pound are watermelons. I did take some of the observations out containing N/A for the prices in the dataset so this could possibly be wrong and that not all the types of fruits and vegetables are in the dataset.

from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['price/lb'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show()

I plotted a histogram for both fruits and vegetables to get a sense of the price distributions and the number of fruits or vegetables at each price. From this plot, it looks like fruits are more expensive than vegetables in general. Most fruits and vegetables cost around 0 to less than 4 dollars in terms of price per pound and based on the plot, only fruits costs more than 5 dollars per pound. Most of the vegetables seem to cost around 50 cents to 1.5 dollars per pound. There could possibly be more vegetables that are as expensive as the fruits in the outlier parts of the graph as some observations are taken out of the dataset.

ratio = dffoods['price/lb']/dffoods['yield']
dffoods['ratio'] = ratio
dffoods
maxratio = dffoods['ratio'].max()
bestvalue = dffoods.loc[dffoods['ratio'] == maxratio]
bestvalue = bestvalue.iloc[0]['Name']
minratio = dffoods['ratio'].min()
worstvalue = dffoods.loc[dffoods['ratio'] == minratio]
worstvalue = worstvalue.iloc[0]['Name']
print bestvalue, maxratio, worstvalue, minratio

raspberries 7.26646958214 lentils 0.509936423192

I interpreted "best value" for the price as having the best price/yield ratio. I first took a ratio of all the prices and yields and divided them then appended that to a column in my data frame which from there, I got the foods with best and worst ratio. The food with the best value seems to be raspberries and worst is lentils.

maxvegprice = vegetable['price/lb'].max()
maxveg = dffoods.loc[dffoods['price/lb'] == maxvegprice]
maxveg = maxveg.iloc[0]['Name']
minvegprice = vegetable['price/lb'].min()
minveg = dffoods.loc[dffoods['price/lb'] == minvegprice]
minveg = minveg.iloc[0]['Name']
print maxveg, maxvegprice, minveg, minvegprice

olives 4.18683171458 potatoes 0.564319813929

expveg = dfveg.loc[dfveg['price/lb'] > 4]
expfruit = dffruit.loc[dffruit['price/lb'] > 4]
cheapveg = dfveg.loc[dfveg['price/lb'] < 1]
cheapfruit = dffruit.loc[dffruit['price/lb'] < 1]
print expveg, expfruit
print cheapveg, cheapfruit

      Name       Type    lb/cup  price/cup  price/lb  yield
32  olives  vegetable  0.297624   1.246102  4.186832    1.0             Name   Type    lb/cup  price/cup  price/lb  yield
4   blackberries  fruit  0.319670   1.922919  5.774708   0.96
5    blueberries  fruit  0.319670   1.593177  4.734622   0.95
8    cranberries  fruit  0.123163   0.589551  4.786741   1.00
9          dates  fruit  0.165347   0.792234  4.791351   1.00
10          figs  fruit  0.165347   0.990068  5.748318   0.96
25   raspberries  fruit  0.319670   2.322874  6.975811   0.96
                    Name       Type    lb/cup  price/cup  price/lb     yield
5          blackeye_peas  vegetable  0.374786   0.524954  0.910441  0.650000
6            black_beans  vegetable  0.385809   0.582025  0.980580  0.650000
10               cabbage  vegetable  0.330693   0.245944  0.579208  0.778797
17  great_northern_beans  vegetable  0.385809   0.548392  0.923916  0.650000
22          kidney_beans  vegetable  0.385809   0.535194  0.901680  0.650000
30            navy_beans  vegetable  0.385809   0.575997  0.970423  0.650000
34           pinto_beans  vegetable  0.385809   0.514129  0.866190  0.650000
35              potatoes  vegetable  0.264555   0.184017  0.564320  0.811301
41        sweet_potatoes  vegetable  0.440925   0.499400  0.918897  0.811301           Name   Type    lb/cup  price/cup  price/lb  yield
2      bananas  fruit  0.330693   0.292965  0.566983   0.64
6   cantaloupe  fruit  0.374786   0.393800  0.535874   0.51
12  grapefruit  fruit  0.462971   0.848278  0.897802   0.49
14    honeydew  fruit  0.374786   0.649077  0.796656   0.46
22   pineapple  fruit  0.363763   0.447686  0.627662   0.51
28  watermelon  fruit  0.330693   0.212033  0.333412   0.52

Something that surprised me in the dataset is that there are more fruits that cost a lot than vegetables are. I expected berries to cost more than 4 dollars per pound but did not expect olives, dates, and figs to cost so much too, especially figs which is $5.74 per pound. I found that pineapples costs less than 1 dollar per pound as well which made me surprised because they usually cost a lot more in the market from what I've seen. I'm also surprised that corn isn't in the list of the cheaper vegetables because I've learned from previous classes that producers generally produce a lot of corn, potatoes, and beans so I'd expect them to have cheaper prices. From what I just saw for the more expensive and cheaper foods, most beans and potato varieties are in the list of cheaper vegetables I've got but corn isn't one of them.

from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
fruit = dffoods.loc[dffoods['Type'] == "fruit"]
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['ratio'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show()

bval = dffoods.loc[dffoods['ratio'] > 4]
wval = dffoods.loc[dffoods['ratio'] < 1]
print bval
print wval

            Name       Type    lb/cup  price/cup  price/lb     yield     ratio
4   blackberries      fruit  0.319670   1.922919  5.774708  0.960000  6.015321
5    blueberries      fruit  0.319670   1.593177  4.734622  0.950000  4.983812
8    cranberries      fruit  0.123163   0.589551  4.786741  1.000000  4.786741
9          dates      fruit  0.165347   0.792234  4.791351  1.000000  4.791351
10          figs      fruit  0.165347   0.990068  5.748318  0.960000  5.987831
25   raspberries      fruit  0.319670   2.322874  6.975811  0.960000  7.266470
1      artichoke  vegetable  0.385809   2.274967  2.213050  0.375309  5.896615
2      asparagus  vegetable  0.396832   2.582272  3.213494  0.493835  6.507216
15    corn_sweet  vegetable  0.363763   1.812497  2.690623  0.540000  4.982635
31          okra  vegetable  0.352740   1.473146  3.213552  0.769474  4.176299
32        olives  vegetable  0.297624   1.246102  4.186832  1.000000  4.186832
          Name       Type    lb/cup  price/cup  price/lb     yield     ratio
2      bananas      fruit  0.330693   0.292965  0.566983  0.640000  0.885912
28  watermelon      fruit  0.330693   0.212033  0.333412  0.520000  0.641177
10     cabbage  vegetable  0.330693   0.245944  0.579208  0.778797  0.743722
23     lentils  vegetable  0.385809   0.196738  1.385036  2.716095  0.509936
35    potatoes  vegetable  0.264555   0.184017  0.564320  0.811301  0.695574

From the histogram above, I think that foods with the best values include blackberries, blueberries, cranberries, dates, figs, raspberries, okra, olives, sweet corn, asparagus, and artichokes and the foods with the worst values are bananas, watermelons, potatoes, lentils, and cabbages. I took ratios of the price per yield to determine this since this is what I interpreted as best combination for the yield and price.

	Name	Type	lb/cup	price/cup	price/lb	yield
0	apples	fruit	0.242508	0.422373	1.567515	0.90
1	apricots	fruit	0.363763	1.189102	3.040072	0.93
2	bananas	fruit	0.330693	0.292965	0.566983	0.64
3	berries_mixed	fruit	0.330693	1.127735	3.410215	1.00
4	blackberries	fruit	0.319670	1.922919	5.774708	0.96
5	blueberries	fruit	0.319670	1.593177	4.734622	0.95
6	cantaloupe	fruit	0.374786	0.393800	0.535874	0.51
7	cherries	fruit	0.341717	1.334548	3.592990	0.92
8	cranberries	fruit	0.123163	0.589551	4.786741	1.00
9	dates	fruit	0.165347	0.792234	4.791351	1.00
10	figs	fruit	0.165347	0.990068	5.748318	0.96
11	fruit_cocktail	fruit	NaN	NaN	NaN	NaN
12	grapefruit	fruit	0.462971	0.848278	0.897802	0.49
13	grapes	fruit	0.330693	0.721266	2.093827	0.96
14	honeydew	fruit	0.374786	0.649077	0.796656	0.46
15	kiwi	fruit	0.385809	1.037970	2.044683	0.76
16	mangoes	fruit	0.363763	0.705783	1.377563	0.71
17	nectarines	fruit	0.319670	0.618667	1.761148	0.91
18	oranges	fruit	0.407855	0.578357	1.035173	0.73
19	papaya	fruit	0.308647	0.646174	1.298012	0.62
20	peaches	fruit	0.341717	0.566390	1.591187	0.96
21	pears	fruit	0.363763	0.590740	1.461575	0.90
22	pineapple	fruit	0.363763	0.447686	0.627662	0.51
23	plums	fruit	0.363763	0.707176	1.827416	0.94
24	pomegranate	fruit	0.341717	1.326342	2.173590	0.56
25	raspberries	fruit	0.319670	2.322874	6.975811	0.96
26	strawberries	fruit	0.319670	0.802171	2.358808	0.94
27	tangerines	fruit	0.407855	0.759471	1.377962	0.74
28	watermelon	fruit	0.330693	0.212033	0.333412	0.52

	Name	Type	lb/cup	price/cup	price/lb	yield
0	acorn_squash	vegetable	0.451948	1.155360	1.172248	0.458554
1	artichoke	vegetable	0.385809	2.274967	2.213050	0.375309
2	asparagus	vegetable	0.396832	2.582272	3.213494	0.493835
3	avocados	vegetable	0.319670	0.964886	2.235874	0.740753
4	beets	vegetable	0.374786	0.586555	1.017275	0.650000
5	blackeye_peas	vegetable	0.374786	0.524954	0.910441	0.650000
6	black_beans	vegetable	0.385809	0.582025	0.980580	0.650000
7	broccoli	vegetable	NaN	NaN	NaN	NaN
8	brussels_sprouts	vegetable	0.341717	0.890898	2.763553	1.060000
9	butternut_squash	vegetable	0.451948	0.787893	1.244737	0.714000
10	cabbage	vegetable	0.330693	0.245944	0.579208	0.778797
11	carrots	vegetable	NaN	NaN	NaN	NaN
12	cauliflower	vegetable	NaN	NaN	NaN	NaN
13	celery	vegetable	NaN	NaN	NaN	NaN
14	collard_greens	vegetable	0.286601	0.650001	2.630838	1.160000
15	corn_sweet	vegetable	0.363763	1.812497	2.690623	0.540000
16	cucumbers	vegetable	0.264555	0.353448	1.295931	0.970000
17	great_northern_beans	vegetable	0.385809	0.548392	0.923916	0.650000
18	green_beans	vegetable	0.275578	0.696606	2.139972	0.846575
19	green_peas	vegetable	0.352740	0.549769	1.013069	0.650000
20	green_peppers	vegetable	0.264555	0.455022	1.410363	0.820000
21	kale	vegetable	0.286601	0.766262	2.807302	1.050000
22	kidney_beans	vegetable	0.385809	0.535194	0.901680	0.650000
23	lentils	vegetable	0.385809	0.196738	1.385036	2.716095
24	lettuce_iceberg	vegetable	0.242508	0.309655	1.213039	0.950000
25	lettuce_romaine	vegetable	NaN	NaN	NaN	NaN
26	lima_beans	vegetable	0.374786	0.797757	1.383569	0.650000
27	mixed_vegetables	vegetable	NaN	NaN	NaN	NaN
28	mushrooms	vegetable	NaN	NaN	NaN	NaN
29	mustard_greens	vegetable	0.308647	0.944032	2.569235	0.840000
30	navy_beans	vegetable	0.385809	0.575997	0.970423	0.650000
31	okra	vegetable	0.352740	1.473146	3.213552	0.769474
32	olives	vegetable	0.297624	1.246102	4.186832	1.000000
33	onions	vegetable	0.352740	0.406868	1.038107	0.900000
34	pinto_beans	vegetable	0.385809	0.514129	0.866190	0.650000
35	potatoes	vegetable	0.264555	0.184017	0.564320	0.811301
36	pumpkin	vegetable	0.540133	0.730411	1.352280	1.000000
37	radish	vegetable	0.275578	0.401618	1.311629	0.900000
38	red_peppers	vegetable	0.264555	0.734926	2.277940	0.820000
39	spinach	vegetable	NaN	NaN	NaN	NaN
40	summer_squash	vegetable	0.396832	0.845480	1.639477	0.769500
41	sweet_potatoes	vegetable	0.440925	0.499400	0.918897	0.811301
42	tomatoes	vegetable	NaN	NaN	NaN	NaN
43	turnip_greens	vegetable	0.319670	1.053526	2.471749	0.750000