Assignment 3

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type food form price_per_lb yield lb_per_cup price_per_cup
fruit watermelon Fresh1 0.333412 0.52 0.330693 0.212033
fruit cantaloupe Fresh1 0.535874 0.51 0.374786 0.3938
vegetables onions Fresh1 1.03811 0.9 0.35274 0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

Ask questions and search the documentation/web to find the functions you need.

In [1]:
from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
fruits = listdir('C:/Python27/mod1/fruit')
allfiles = []
for i in range(0, len(fruits)):
    allfiles.append('C:/Python27/mod1/fruit/' + fruits[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters
For whatever reason, the dot has trouble being stripped together'''
allfruits = []
for i in range(0, len(fruits)):
    allfruits.append(fruits[i])
    allfruits[i] = allfruits[i].rstrip(allfruits[i][-4:])
    allfruits[i] = allfruits[i].rstrip(allfruits[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb =[]
yields = []
lb_per_cup = []
price_per_cup = []
for i in range(0, len(fruits)):
    tempfile = pd.read_excel(allfiles[i])
    fresh = tempfile.loc[2, :]
    price_per_lb.append(float(fresh[1]))
    yields.append(float(fresh[3]))
    lb_per_cup.append(float(fresh[4]))
    price_per_cup.append(float(fresh[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category = []
for i in range(0, len(fruits)):
    category.append("fruit")
'''Creates fruit dataframe'''
dffruit = pd.DataFrame({'Type': category, 'Name': allfruits, 'price/lb': price_per_lb, 'yield': yields, 'lb/cup': lb_per_cup, 'price/cup': price_per_cup})
dffruit
Out[1]:
Name Type lb/cup price/cup price/lb yield
0 apples fruit 0.242508 0.422373 1.567515 0.90
1 apricots fruit 0.363763 1.189102 3.040072 0.93
2 bananas fruit 0.330693 0.292965 0.566983 0.64
3 berries_mixed fruit 0.330693 1.127735 3.410215 1.00
4 blackberries fruit 0.319670 1.922919 5.774708 0.96
5 blueberries fruit 0.319670 1.593177 4.734622 0.95
6 cantaloupe fruit 0.374786 0.393800 0.535874 0.51
7 cherries fruit 0.341717 1.334548 3.592990 0.92
8 cranberries fruit 0.123163 0.589551 4.786741 1.00
9 dates fruit 0.165347 0.792234 4.791351 1.00
10 figs fruit 0.165347 0.990068 5.748318 0.96
11 fruit_cocktail fruit NaN NaN NaN NaN
12 grapefruit fruit 0.462971 0.848278 0.897802 0.49
13 grapes fruit 0.330693 0.721266 2.093827 0.96
14 honeydew fruit 0.374786 0.649077 0.796656 0.46
15 kiwi fruit 0.385809 1.037970 2.044683 0.76
16 mangoes fruit 0.363763 0.705783 1.377563 0.71
17 nectarines fruit 0.319670 0.618667 1.761148 0.91
18 oranges fruit 0.407855 0.578357 1.035173 0.73
19 papaya fruit 0.308647 0.646174 1.298012 0.62
20 peaches fruit 0.341717 0.566390 1.591187 0.96
21 pears fruit 0.363763 0.590740 1.461575 0.90
22 pineapple fruit 0.363763 0.447686 0.627662 0.51
23 plums fruit 0.363763 0.707176 1.827416 0.94
24 pomegranate fruit 0.341717 1.326342 2.173590 0.56
25 raspberries fruit 0.319670 2.322874 6.975811 0.96
26 strawberries fruit 0.319670 0.802171 2.358808 0.94
27 tangerines fruit 0.407855 0.759471 1.377962 0.74
28 watermelon fruit 0.330693 0.212033 0.333412 0.52

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

In [2]:
from os import listdir
import pandas as pd
import numpy as np
'''Gets all file names'''
vegetables = listdir('C:/Python27/mod1/vegetables')
allfiles2 = []
for i in range(0, len(vegetables)):
    allfiles2.append('C:/Python27/mod1/vegetables/' + vegetables[i])
'''Gets all the names of the fruits and strips the .xlsx which is the last 5 characters'''
allveggies = []
for i in range(0, len(vegetables)):
    allveggies.append(vegetables[i])
    allveggies[i] = allveggies[i].rstrip(allveggies[i][-4:])
    allveggies[i] = allveggies[i].rstrip(allveggies[i][-1])
'''Creates empty lists
I extract the fresh column in each file which happens to be all in the same place and stores each value I need into a different list'''
price_per_lb2 =[]
yields2 = []
lb_per_cup2 = []
price_per_cup2 = []
for i in range(0, len(vegetables)):
    tempfile2 = pd.read_excel(allfiles2[i])
    fresh2 = tempfile2.loc[2, :]
    '''I noticed there was an error with one of the prices as a string so I converted it to NaN string instead'''
    try:
        price_per_lb2.append(float(fresh2[1]))
    except ValueError:
        price_per_lb2.append(np.nan)
    '''Same with yields and other values'''
    try:
        yields2.append(float(fresh2[3]))
    except ValueError:
        yields2.append(np.nan)
    try:
        lb_per_cup2.append(float(fresh2[4]))
    except ValueError:
        lb_per_cup2.append(np.nan)
    price_per_cup2.append(float(fresh2[6]))
'''Copies the fruit text the number of fruits there are for data frame creation'''
category2 = []
for i in range(0, len(vegetables)):
    category2.append("vegetable")
'''Creates fruit dataframe'''
dfveg = pd.DataFrame({'Type': category2, 'Name': allveggies, 'price/lb': price_per_lb2, 'yield': yields2, 'lb/cup': lb_per_cup2, 'price/cup': price_per_cup2})
dfveg
Out[2]:
Name Type lb/cup price/cup price/lb yield
0 acorn_squash vegetable 0.451948 1.155360 1.172248 0.458554
1 artichoke vegetable 0.385809 2.274967 2.213050 0.375309
2 asparagus vegetable 0.396832 2.582272 3.213494 0.493835
3 avocados vegetable 0.319670 0.964886 2.235874 0.740753
4 beets vegetable 0.374786 0.586555 1.017275 0.650000
5 blackeye_peas vegetable 0.374786 0.524954 0.910441 0.650000
6 black_beans vegetable 0.385809 0.582025 0.980580 0.650000
7 broccoli vegetable NaN NaN NaN NaN
8 brussels_sprouts vegetable 0.341717 0.890898 2.763553 1.060000
9 butternut_squash vegetable 0.451948 0.787893 1.244737 0.714000
10 cabbage vegetable 0.330693 0.245944 0.579208 0.778797
11 carrots vegetable NaN NaN NaN NaN
12 cauliflower vegetable NaN NaN NaN NaN
13 celery vegetable NaN NaN NaN NaN
14 collard_greens vegetable 0.286601 0.650001 2.630838 1.160000
15 corn_sweet vegetable 0.363763 1.812497 2.690623 0.540000
16 cucumbers vegetable 0.264555 0.353448 1.295931 0.970000
17 great_northern_beans vegetable 0.385809 0.548392 0.923916 0.650000
18 green_beans vegetable 0.275578 0.696606 2.139972 0.846575
19 green_peas vegetable 0.352740 0.549769 1.013069 0.650000
20 green_peppers vegetable 0.264555 0.455022 1.410363 0.820000
21 kale vegetable 0.286601 0.766262 2.807302 1.050000
22 kidney_beans vegetable 0.385809 0.535194 0.901680 0.650000
23 lentils vegetable 0.385809 0.196738 1.385036 2.716095
24 lettuce_iceberg vegetable 0.242508 0.309655 1.213039 0.950000
25 lettuce_romaine vegetable NaN NaN NaN NaN
26 lima_beans vegetable 0.374786 0.797757 1.383569 0.650000
27 mixed_vegetables vegetable NaN NaN NaN NaN
28 mushrooms vegetable NaN NaN NaN NaN
29 mustard_greens vegetable 0.308647 0.944032 2.569235 0.840000
30 navy_beans vegetable 0.385809 0.575997 0.970423 0.650000
31 okra vegetable 0.352740 1.473146 3.213552 0.769474
32 olives vegetable 0.297624 1.246102 4.186832 1.000000
33 onions vegetable 0.352740 0.406868 1.038107 0.900000
34 pinto_beans vegetable 0.385809 0.514129 0.866190 0.650000
35 potatoes vegetable 0.264555 0.184017 0.564320 0.811301
36 pumpkin vegetable 0.540133 0.730411 1.352280 1.000000
37 radish vegetable 0.275578 0.401618 1.311629 0.900000
38 red_peppers vegetable 0.264555 0.734926 2.277940 0.820000
39 spinach vegetable NaN NaN NaN NaN
40 summer_squash vegetable 0.396832 0.845480 1.639477 0.769500
41 sweet_potatoes vegetable 0.440925 0.499400 0.918897 0.811301
42 tomatoes vegetable NaN NaN NaN NaN
43 turnip_greens vegetable 0.319670 1.053526 2.471749 0.750000

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

In [3]:
import numpy as np
'''Drops cocktail'''
newdffruit = dffruit.loc[np.isnan(dffruit['price/lb']) == False]
newdffruit
'''Drop all the NaN vegetable observations'''
newdfveg = dfveg.loc[np.isnan(dfveg['price/lb']) == False]
newdfveg
'''Merge dataframes'''
foods = [newdffruit, newdfveg]
dffoods = pd.concat(foods)
dffoods
Out[3]:
Name Type lb/cup price/cup price/lb yield
0 apples fruit 0.242508 0.422373 1.567515 0.900000
1 apricots fruit 0.363763 1.189102 3.040072 0.930000
2 bananas fruit 0.330693 0.292965 0.566983 0.640000
3 berries_mixed fruit 0.330693 1.127735 3.410215 1.000000
4 blackberries fruit 0.319670 1.922919 5.774708 0.960000
5 blueberries fruit 0.319670 1.593177 4.734622 0.950000
6 cantaloupe fruit 0.374786 0.393800 0.535874 0.510000
7 cherries fruit 0.341717 1.334548 3.592990 0.920000
8 cranberries fruit 0.123163 0.589551 4.786741 1.000000
9 dates fruit 0.165347 0.792234 4.791351 1.000000
10 figs fruit 0.165347 0.990068 5.748318 0.960000
12 grapefruit fruit 0.462971 0.848278 0.897802 0.490000
13 grapes fruit 0.330693 0.721266 2.093827 0.960000
14 honeydew fruit 0.374786 0.649077 0.796656 0.460000
15 kiwi fruit 0.385809 1.037970 2.044683 0.760000
16 mangoes fruit 0.363763 0.705783 1.377563 0.710000
17 nectarines fruit 0.319670 0.618667 1.761148 0.910000
18 oranges fruit 0.407855 0.578357 1.035173 0.730000
19 papaya fruit 0.308647 0.646174 1.298012 0.620000
20 peaches fruit 0.341717 0.566390 1.591187 0.960000
21 pears fruit 0.363763 0.590740 1.461575 0.900000
22 pineapple fruit 0.363763 0.447686 0.627662 0.510000
23 plums fruit 0.363763 0.707176 1.827416 0.940000
24 pomegranate fruit 0.341717 1.326342 2.173590 0.560000
25 raspberries fruit 0.319670 2.322874 6.975811 0.960000
26 strawberries fruit 0.319670 0.802171 2.358808 0.940000
27 tangerines fruit 0.407855 0.759471 1.377962 0.740000
28 watermelon fruit 0.330693 0.212033 0.333412 0.520000
0 acorn_squash vegetable 0.451948 1.155360 1.172248 0.458554
1 artichoke vegetable 0.385809 2.274967 2.213050 0.375309
... ... ... ... ... ... ...
5 blackeye_peas vegetable 0.374786 0.524954 0.910441 0.650000
6 black_beans vegetable 0.385809 0.582025 0.980580 0.650000
8 brussels_sprouts vegetable 0.341717 0.890898 2.763553 1.060000
9 butternut_squash vegetable 0.451948 0.787893 1.244737 0.714000
10 cabbage vegetable 0.330693 0.245944 0.579208 0.778797
14 collard_greens vegetable 0.286601 0.650001 2.630838 1.160000
15 corn_sweet vegetable 0.363763 1.812497 2.690623 0.540000
16 cucumbers vegetable 0.264555 0.353448 1.295931 0.970000
17 great_northern_beans vegetable 0.385809 0.548392 0.923916 0.650000
18 green_beans vegetable 0.275578 0.696606 2.139972 0.846575
19 green_peas vegetable 0.352740 0.549769 1.013069 0.650000
20 green_peppers vegetable 0.264555 0.455022 1.410363 0.820000
21 kale vegetable 0.286601 0.766262 2.807302 1.050000
22 kidney_beans vegetable 0.385809 0.535194 0.901680 0.650000
23 lentils vegetable 0.385809 0.196738 1.385036 2.716095
24 lettuce_iceberg vegetable 0.242508 0.309655 1.213039 0.950000
26 lima_beans vegetable 0.374786 0.797757 1.383569 0.650000
29 mustard_greens vegetable 0.308647 0.944032 2.569235 0.840000
30 navy_beans vegetable 0.385809 0.575997 0.970423 0.650000
31 okra vegetable 0.352740 1.473146 3.213552 0.769474
32 olives vegetable 0.297624 1.246102 4.186832 1.000000
33 onions vegetable 0.352740 0.406868 1.038107 0.900000
34 pinto_beans vegetable 0.385809 0.514129 0.866190 0.650000
35 potatoes vegetable 0.264555 0.184017 0.564320 0.811301
36 pumpkin vegetable 0.540133 0.730411 1.352280 1.000000
37 radish vegetable 0.275578 0.401618 1.311629 0.900000
38 red_peppers vegetable 0.264555 0.734926 2.277940 0.820000
40 summer_squash vegetable 0.396832 0.845480 1.639477 0.769500
41 sweet_potatoes vegetable 0.440925 0.499400 0.918897 0.811301
43 turnip_greens vegetable 0.319670 1.053526 2.471749 0.750000

63 rows × 6 columns

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

  • What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
  • How do the price distributions compare for fruit and vegetables?
  • Which foods are the best value for the price?
  • What's something surprising about this data set?
  • Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
In [6]:
fruit = dffoods.loc[dffoods['Type'] == "fruit"]
maxprice = fruit['price/lb'].max()
maxfruit = dffoods.loc[dffoods['price/lb'] == maxprice]
'''I do not know why this won't return a string if I only grab the name. Had to look this up on Stack Overflow
http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame'''
maxfruit = maxfruit.iloc[0]['Name']
minprice = fruit['price/lb'].min()
minfruit = dffoods.loc[dffoods['price/lb'] == minprice]
minfruit = minfruit.iloc[0]['Name']
print maxfruit, maxprice, minfruit, minprice
raspberries 6.97581079886 watermelon 0.333412035323

I separated the dataset back into the two parts, one containing all the fruits and one containing all the vegetables. From there I got the minimum and maximum prices and found which fruits and vegetables they belonged to. Then I found that the most expensive fruit per pound are raspberries and the least expensive fruit per pound are watermelons. I did take some of the observations out containing N/A for the prices in the dataset so this could possibly be wrong and that not all the types of fruits and vegetables are in the dataset.

In [5]:
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['price/lb'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show()

I plotted a histogram for both fruits and vegetables to get a sense of the price distributions and the number of fruits or vegetables at each price. From this plot, it looks like fruits are more expensive than vegetables in general. Most fruits and vegetables cost around 0 to less than 4 dollars in terms of price per pound and based on the plot, only fruits costs more than 5 dollars per pound. Most of the vegetables seem to cost around 50 cents to 1.5 dollars per pound. There could possibly be more vegetables that are as expensive as the fruits in the outlier parts of the graph as some observations are taken out of the dataset.

In [18]:
ratio = dffoods['price/lb']/dffoods['yield']
dffoods['ratio'] = ratio
dffoods
maxratio = dffoods['ratio'].max()
bestvalue = dffoods.loc[dffoods['ratio'] == maxratio]
bestvalue = bestvalue.iloc[0]['Name']
minratio = dffoods['ratio'].min()
worstvalue = dffoods.loc[dffoods['ratio'] == minratio]
worstvalue = worstvalue.iloc[0]['Name']
print bestvalue, maxratio, worstvalue, minratio
raspberries 7.26646958214 lentils 0.509936423192

I interpreted "best value" for the price as having the best price/yield ratio. I first took a ratio of all the prices and yields and divided them then appended that to a column in my data frame which from there, I got the foods with best and worst ratio. The food with the best value seems to be raspberries and worst is lentils.

In [19]:
maxvegprice = vegetable['price/lb'].max()
maxveg = dffoods.loc[dffoods['price/lb'] == maxvegprice]
maxveg = maxveg.iloc[0]['Name']
minvegprice = vegetable['price/lb'].min()
minveg = dffoods.loc[dffoods['price/lb'] == minvegprice]
minveg = minveg.iloc[0]['Name']
print maxveg, maxvegprice, minveg, minvegprice
olives 4.18683171458 potatoes 0.564319813929
In [24]:
expveg = dfveg.loc[dfveg['price/lb'] > 4]
expfruit = dffruit.loc[dffruit['price/lb'] > 4]
cheapveg = dfveg.loc[dfveg['price/lb'] < 1]
cheapfruit = dffruit.loc[dffruit['price/lb'] < 1]
print expveg, expfruit
print cheapveg, cheapfruit
      Name       Type    lb/cup  price/cup  price/lb  yield
32  olives  vegetable  0.297624   1.246102  4.186832    1.0             Name   Type    lb/cup  price/cup  price/lb  yield
4   blackberries  fruit  0.319670   1.922919  5.774708   0.96
5    blueberries  fruit  0.319670   1.593177  4.734622   0.95
8    cranberries  fruit  0.123163   0.589551  4.786741   1.00
9          dates  fruit  0.165347   0.792234  4.791351   1.00
10          figs  fruit  0.165347   0.990068  5.748318   0.96
25   raspberries  fruit  0.319670   2.322874  6.975811   0.96
                    Name       Type    lb/cup  price/cup  price/lb     yield
5          blackeye_peas  vegetable  0.374786   0.524954  0.910441  0.650000
6            black_beans  vegetable  0.385809   0.582025  0.980580  0.650000
10               cabbage  vegetable  0.330693   0.245944  0.579208  0.778797
17  great_northern_beans  vegetable  0.385809   0.548392  0.923916  0.650000
22          kidney_beans  vegetable  0.385809   0.535194  0.901680  0.650000
30            navy_beans  vegetable  0.385809   0.575997  0.970423  0.650000
34           pinto_beans  vegetable  0.385809   0.514129  0.866190  0.650000
35              potatoes  vegetable  0.264555   0.184017  0.564320  0.811301
41        sweet_potatoes  vegetable  0.440925   0.499400  0.918897  0.811301           Name   Type    lb/cup  price/cup  price/lb  yield
2      bananas  fruit  0.330693   0.292965  0.566983   0.64
6   cantaloupe  fruit  0.374786   0.393800  0.535874   0.51
12  grapefruit  fruit  0.462971   0.848278  0.897802   0.49
14    honeydew  fruit  0.374786   0.649077  0.796656   0.46
22   pineapple  fruit  0.363763   0.447686  0.627662   0.51
28  watermelon  fruit  0.330693   0.212033  0.333412   0.52

Something that surprised me in the dataset is that there are more fruits that cost a lot than vegetables are. I expected berries to cost more than 4 dollars per pound but did not expect olives, dates, and figs to cost so much too, especially figs which is $5.74 per pound. I found that pineapples costs less than 1 dollar per pound as well which made me surprised because they usually cost a lot more in the market from what I've seen. I'm also surprised that corn isn't in the list of the cheaper vegetables because I've learned from previous classes that producers generally produce a lot of corn, potatoes, and beans so I'd expect them to have cheaper prices. From what I just saw for the more expensive and cheaper foods, most beans and potato varieties are in the list of cheaper vegetables I've got but corn isn't one of them.

In [32]:
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
fruit = dffoods.loc[dffoods['Type'] == "fruit"]
vegetable = dffoods.loc[dffoods['Type'] == "vegetable"]
plt.hist(fruit['ratio'], bins = 20, alpha = 0.4, label = "Fruits")
plt.hist(vegetable['price/lb'], bins = 20, alpha = 0.4, label = "Vegetables")
plt.legend()
plt.show() 
In [33]:
bval = dffoods.loc[dffoods['ratio'] > 4]
wval = dffoods.loc[dffoods['ratio'] < 1]
print bval
print wval
            Name       Type    lb/cup  price/cup  price/lb     yield     ratio
4   blackberries      fruit  0.319670   1.922919  5.774708  0.960000  6.015321
5    blueberries      fruit  0.319670   1.593177  4.734622  0.950000  4.983812
8    cranberries      fruit  0.123163   0.589551  4.786741  1.000000  4.786741
9          dates      fruit  0.165347   0.792234  4.791351  1.000000  4.791351
10          figs      fruit  0.165347   0.990068  5.748318  0.960000  5.987831
25   raspberries      fruit  0.319670   2.322874  6.975811  0.960000  7.266470
1      artichoke  vegetable  0.385809   2.274967  2.213050  0.375309  5.896615
2      asparagus  vegetable  0.396832   2.582272  3.213494  0.493835  6.507216
15    corn_sweet  vegetable  0.363763   1.812497  2.690623  0.540000  4.982635
31          okra  vegetable  0.352740   1.473146  3.213552  0.769474  4.176299
32        olives  vegetable  0.297624   1.246102  4.186832  1.000000  4.186832
          Name       Type    lb/cup  price/cup  price/lb     yield     ratio
2      bananas      fruit  0.330693   0.292965  0.566983  0.640000  0.885912
28  watermelon      fruit  0.330693   0.212033  0.333412  0.520000  0.641177
10     cabbage  vegetable  0.330693   0.245944  0.579208  0.778797  0.743722
23     lentils  vegetable  0.385809   0.196738  1.385036  2.716095  0.509936
35    potatoes  vegetable  0.264555   0.184017  0.564320  0.811301  0.695574

From the histogram above, I think that foods with the best values include blackberries, blueberries, cranberries, dates, figs, raspberries, okra, olives, sweet corn, asparagus, and artichokes and the foods with the worst values are bananas, watermelons, potatoes, lentils, and cabbages. I took ratios of the price per yield to determine this since this is what I interpreted as best combination for the yield and price.