Assignment 5

In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

In [1]:
import numpy as np
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
import requests
import requests_cache
import requests_ftp
from urllib2 import Request, urlopen
import json

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

  • Have a parameter url for the URL of the article list.

  • Have a parameter page for the number of pages to fetch links from. The default should be 1.

  • Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

  • Be polite to The Aggie and save time by setting up requests_cache before you write your function.

  • Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

  • You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

In [2]:
requests_cache.install_cache('cache')
In [3]:
from bs4 import BeautifulSoup
import lxml.html as lx
In [4]:
# Scrapes a page to get all article URLs
def get_links(url, page):
    urllist = []
    for i in range(1, page + 1):
        num = str(i)
        urllist.append(url + 'page/' + num)
    for i in range(1, len(urllist) + 1): 
        request = requests.get(urllist[i - 1])
        html = request.text
        text = BeautifulSoup(html, 'html.parser')
        sets = text.find_all(name='h2',attrs={'class':'entry-title'})
        links = []
        for j in range(0, len(sets)):
            links.append(sets[j].find_all('a'))
        if i == 1:
            linklist = []
        else:
            linklist = linklist
        for j in range(0, len(links)):
            linklist.append(links[j][0]['href'])
    return linklist
In [5]:
get_links('https://theaggie.org/campus/', 2)
Out[5]:
[u'https://theaggie.org/2017/02/23/asucd-president-alex-lee-vetoes-amendment-for-creation-of-judicial-council/',
 u'https://theaggie.org/2017/02/22/senate-candidate-zaki-shaheen-withdraws-from-race/',
 u'https://theaggie.org/2017/02/21/uc-davis-experiences-several-recent-hate-based-crimes/',
 u'https://theaggie.org/2017/02/21/uc-president-selects-gary-may-as-new-uc-davis-chancellor/',
 u'https://theaggie.org/2017/02/20/katehi-controversy-prompts-decline-of-uc-administrators-seeking-profitable-subsidiary-board-positions/',
 u'https://theaggie.org/2017/02/20/asucd-senate-passes-resolution-submitting-comments-on-lrdp/',
 u'https://theaggie.org/2017/02/20/uc-releases-2016-annual-report-on-sustainable-practices/',
 u'https://theaggie.org/2017/02/19/uc-davis-global-affairs-holds-discussion-on-president-donald-trumps-executive-orders-on-immigration/',
 u'https://theaggie.org/2017/02/19/trumps-immigration-ban-affects-uc-davis-community/',
 u'https://theaggie.org/2017/02/17/uc-davis-students-participate-in-uc-wide-nodapl-day-of-action/',
 u'https://theaggie.org/2017/02/17/uc-davis-holds-first-mental-health-conference/',
 u'https://theaggie.org/2017/02/16/last-week-in-senate-6/',
 u'https://theaggie.org/2017/02/16/2017-asucd-winter-elections-meet-the-candidates/',
 u'https://theaggie.org/2017/02/14/shields-library-hosts-new-exhibit-for-davis-centennial/',
 u'https://theaggie.org/2017/02/14/student-health-and-counseling-services-hosts-step-up-to-the-plate-campaign/',
 u'https://theaggie.org/2017/02/16/2017-asucd-winter-elections-meet-the-candidates/',
 u'https://theaggie.org/2017/02/14/shields-library-hosts-new-exhibit-for-davis-centennial/',
 u'https://theaggie.org/2017/02/14/student-health-and-counseling-services-hosts-step-up-to-the-plate-campaign/',
 u'https://theaggie.org/2017/02/13/pe-classes-may-charge-additional-fees/',
 u'https://theaggie.org/2017/02/12/11-new-chancellor-fellows-honored-for-2016/',
 u'https://theaggie.org/2017/02/12/muslim-students-respond-to-recent-political-events/',
 u'https://theaggie.org/2017/02/12/sexcessful-campaign-launched-in-time-for-valentines-day/',
 u'https://theaggie.org/2017/02/10/michael-chan-sworn-in-as-interim-senator/',
 u'https://theaggie.org/2017/02/09/university-of-california-regents-meet-approve-first-tuition-raise-in-six-years/',
 u'https://theaggie.org/2017/02/09/last-week-in-senate-5/',
 u'https://theaggie.org/2017/02/09/uc-davis-receives-2-2-million-from-assembly-bill-2664/',
 u'https://theaggie.org/2017/02/06/senator-bill-dodd-visits-uc-davis/',
 u'https://theaggie.org/2017/02/05/ab-1887-prevents-use-of-state-funds-including-uc-funds-for-travel-to-states-with-anti-lgbt-laws/',
 u'https://theaggie.org/2017/02/02/uc-system-hires-title-ix-coordinator/',
 u'https://theaggie.org/2017/02/02/last-week-in-senate-4/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

  • Have a parameter url for the URL of the article.

  • For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

  • Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

  • The author line is always the last line of the last paragraph.

  • Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)

    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })

    If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

In [6]:
# I think it's better if I separate this into 4 functions
# This one gets the text of the function from strong and span attrs
def get_content(url):
    request = requests.get(url)
    html = request.text
    text = BeautifulSoup(html, 'html.parser')
    strong = text.find_all(name = 'strong')
    try:
        strong = strong[0].text.strip()
        strong = strong.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    except IndexError:
        strong = ''
    sets = text.find_all(name='span',attrs={'style':'font-weight: 400;'})
    content = []
    for i in range(0, len(sets)):
        content.append(sets[i].text)
    content = "".join(content)
    content = content.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    body = strong + ' ' + content
    return body
In [7]:
# Gets author, I think the names are always on the third to last p observation
def get_author(url):
    request = requests.get(url)
    html = request.text
    text = BeautifulSoup(html, 'html.parser')
    author = text.find_all(name='p')
    try:
        author = author[len(author) - 2].text.strip()
    except IndexError:
        author = "Unknown"
    return author
In [48]:
# Gets title and then strips the part that says 'The Aggie'
def get_title(url):
    request = requests.get(url)
    html = request.text
    text = BeautifulSoup(html, 'html.parser')
    title = text.find_all(name='title')
    title = title[0].text.strip()
    title = title.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    title = title.rstrip(' | The Aggie')
    return title
In [49]:
# Extracts title, text, url, and author by finding corresponding places, and appending them to one observation
def extract(url):
    author = get_author(url)
    body = get_content(url)
    title = get_title(url)
    result = 'author: ' + author + '\n' + 'text: ' + body + '\n' + 'title: ' + title + '\n' + 'url: ' + url
    return result
In [50]:
extract('https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/')
Out[50]:
u'author: Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org\ntext: Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond."Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand.The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments"This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?"Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto."It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city."Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation."city@theaggie.org\ntitle: Project Toto aims to address questions regarding city finances\nurl: https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [51]:
# get all 120 links
campus = get_links('https://theaggie.org/campus/', 4)
city = get_links('https://theaggie.org/city/', 4)
links = campus + city
In [52]:
# get column of type of article
camplist = ['campus'] * 60
citylist = ['city'] * 60
types = camplist + citylist
In [53]:
# gets author, title, and content
authorlist = []
bodylist = []
titlelist = []
for i in range(0, len(links)):
    bodylist.append(get_content(links[i]))
    titlelist.append(get_title(links[i]))
    authorlist.append(get_author(links[i]))
In [54]:
import pandas as pd
In [55]:
articles = pd.DataFrame({'type':types, 'title':titlelist, 'author':authorlist, 'content':bodylist, 'url':links})
In [56]:
articles.head()
Out[56]:
author content title type url
0 Written by: Ivan Valenzuela — campus@theaggie.org Veto included revision abandoning creation of ... ASUCD President Alex Lee vetoes amendment for ... campus https://theaggie.org/2017/02/23/asucd-presiden...
1 Written by: Alyssa Vandenberg  — campus@theagg... Shaheen's name to remain on ballot, his votes ... Senate candidate Zaki Shaheen withdraws from rac campus https://theaggie.org/2017/02/22/senate-candida...
2 Written by: Aaron Liss  — campus@theaggie.org Students receive email warnings from UC Davis ... UC Davis experiences several recent hate-based... campus https://theaggie.org/2017/02/21/uc-davis-exper...
3 Written by: Alyssa Vandenberg  — campus@theagg... UC Board of Regents to vote on the appointment... UC President selects Gary May as new UC Davis ... campus https://theaggie.org/2017/02/21/uc-president-s...
4 The UC Davis and UC Office of the President’s ... Tighter policies require greater approval of o... Katehi controversy prompts decline of UC admin... campus https://theaggie.org/2017/02/20/katehi-controv...

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

  • What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

  • What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

  • Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

  • The nltk book and scikit-learn documentation may be helpful here.

  • You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

  • If you want, you can use the wordcloud package to plot a word cloud. To install the package, run

    conda install -c https://conda.anaconda.org/amueller wordcloud

    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

In [21]:
stemmer = nltk.stem.porter.PorterStemmer()
tokenize = nltk.word_tokenize
# stem and lemma functions
def stem(tokens,stemmer = PorterStemmer().stem):
    return [stemmer(w.lower()) for w in tokens] 
def lemmatize(text):
    return stem(tokenize(text))
In [135]:
def dictionaries(df):
    textd = {}
    for n in range (0, len(articles.index)):
        s = set(lemmatize(articles['content'][n]))
        try:
            toks = toks | s
        except NameError:
            toks = s
        for tok in s:
            try:
                textd[tok].append(n)
            except KeyError:
                textd[tok] = [n]
    # ids for each article
    artids = {}
    N = len(articles.index)
    for i in xrange(N):
        artids[articles['content'][i]] = i
    # word ids
    tokids = {}
    tok_list = list(toks)
    m = len(tok_list)
    for j in xrange(m):
        tokids[tok_list[j]] = j
    return textd, artids, tokids
In [136]:
textd, artids, tokids = dictionaries(articles)
In [137]:
# smoothed idf
numd = {key:len(set(val)) for key,val in textd.items()}
logN = np.log(len(articles.index))
idf_smooth = {key:logN - np.log(1 + val) for key, val in numd.items() if val > 1}
In [138]:
plt.hist(idf_smooth.values(),bins=20)
Out[138]:
(array([  23.,    8.,   22.,   15.,   26.,   37.,   37.,   59.,   69.,
          51.,  125.,   84.,  157.,   91.,  200.,  135.,  193.,  253.,
         349.,  869.]),
 array([-0.0082988 ,  0.17656011,  0.36141902,  0.54627794,  0.73113685,
         0.91599576,  1.10085467,  1.28571359,  1.4705725 ,  1.65543141,
         1.84029033,  2.02514924,  2.21000815,  2.39486706,  2.57972598,
         2.76458489,  2.9494438 ,  3.13430272,  3.31916163,  3.50402054,
         3.68887945]),
 <a list of 20 Patch objects>)
In [131]:
# appends indexes with highest values
lowval = []
for i in range(0, len(idf_smooth.values())):
    if idf_smooth.values()[i] < 0.5:
        lowval.append(i)
In [139]:
for i in range(0, len(lowval)):
    print(list(tokids.keys())[list(tokids.values()).index(lowval[i])])
slate
bone
healthcar
renter
reengag
threw
sad
spawn
format
hartwel
aafreen
just
nowaday
insecur
roll
examin
imper
torn
plaqu
irrepar
mediat
'sit
offspr
conclus
fill
rp
dollar
emili
mnd
delet
plane
found
sew
fung
implement
up
awestruck
send
next
uneth
alloc
1:50
placement
portfolio
redirect
drum
competit
flashlight
lake
arraign
46

I found the words that have the lowest idf values because by the definition of inverse document frequency, I believe that those are the words with the highest importance in the collection of articles. Of those words, it seems like they cover a lot of issues having to do with the university, competitions, and possibly health and crime.

In [140]:
# splits dfs
campusdf = articles.loc[articles['type'] == "campus"]
citydf = articles.loc[articles['type'] == "city"]
In [141]:
textd, artids, tokids = dictionaries(campusdf)
numd = {key:len(set(val)) for key,val in textd.items()}
logN = np.log(len(articles.index))
idf_smooth = {key:logN - np.log(1 + val) for key, val in numd.items() if val > 1}
plt.hist(idf_smooth.values(),bins=20)
Out[141]:
(array([  23.,    8.,   22.,   15.,   26.,   37.,   37.,   59.,   69.,
          51.,  125.,   84.,  157.,   91.,  200.,  135.,  193.,  253.,
         349.,  869.]),
 array([-0.0082988 ,  0.17656011,  0.36141902,  0.54627794,  0.73113685,
         0.91599576,  1.10085467,  1.28571359,  1.4705725 ,  1.65543141,
         1.84029033,  2.02514924,  2.21000815,  2.39486706,  2.57972598,
         2.76458489,  2.9494438 ,  3.13430272,  3.31916163,  3.50402054,
         3.68887945]),
 <a list of 20 Patch objects>)
In [142]:
lowval = []
for i in range(0, len(idf_smooth.values())):
    if idf_smooth.values()[i] < 0.5:
        lowval.append(i)
for i in range(0, len(lowval)):
    print(list(tokids.keys())[list(tokids.values()).index(lowval[i])])
slate
bone
healthcar
renter
reengag
threw
sad
spawn
format
hartwel
aafreen
just
nowaday
insecur
roll
examin
imper
torn
plaqu
irrepar
mediat
'sit
offspr
conclus
fill
rp
dollar
emili
mnd
delet
plane
found
sew
fung
implement
up
awestruck
send
next
uneth
alloc
1:50
placement
portfolio
redirect
drum
competit
flashlight
lake
arraign
46
In [143]:
textd, artids, tokids = dictionaries(citydf)
numd = {key:len(set(val)) for key,val in textd.items()}
logN = np.log(len(articles.index))
idf_smooth = {key:logN - np.log(1 + val) for key, val in numd.items() if val > 1}
plt.hist(idf_smooth.values(),bins=20)
Out[143]:
(array([  23.,    8.,   22.,   15.,   26.,   37.,   37.,   59.,   69.,
          51.,  125.,   84.,  157.,   91.,  200.,  135.,  193.,  253.,
         349.,  869.]),
 array([-0.0082988 ,  0.17656011,  0.36141902,  0.54627794,  0.73113685,
         0.91599576,  1.10085467,  1.28571359,  1.4705725 ,  1.65543141,
         1.84029033,  2.02514924,  2.21000815,  2.39486706,  2.57972598,
         2.76458489,  2.9494438 ,  3.13430272,  3.31916163,  3.50402054,
         3.68887945]),
 <a list of 20 Patch objects>)
In [144]:
lowval = []
for i in range(0, len(idf_smooth.values())):
    if idf_smooth.values()[i] < 0.5:
        lowval.append(i)
for i in range(0, len(lowval)):
    print(list(tokids.keys())[list(tokids.values()).index(lowval[i])])
slate
bone
healthcar
renter
reengag
threw
sad
spawn
format
hartwel
aafreen
just
nowaday
insecur
roll
examin
imper
torn
plaqu
irrepar
mediat
'sit
offspr
conclus
fill
rp
dollar
emili
mnd
delet
plane
found
sew
fung
implement
up
awestruck
send
next
uneth
alloc
1:50
placement
portfolio
redirect
drum
competit
flashlight
lake
arraign
46

I split the dataframes based on the types of articles and found the inverse document frequency words for both. They both have similar words so the topics they cover may be similar but I think it's very hard to tell because most of the words that differ from the two doesn't seem relevant to any particular topic.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=lemmatize, stop_words = "english", smooth_idf = True, norm = None)
tfs = vectorizer.fit_transform(articles['content'])
In [47]:
sim = tfs.dot(tfs.T)
In [80]:
campuslabel = [1 for i in range (0,60)] + [0 for i in range (0,60)]
citylabel = [0 for i in range (0,60)] + [1 for i in range (0,60)]
In [81]:
# knn
k_max = 20
nbrs = NearestNeighbors(n_neighbors=k_max).fit(tfs)
err_list = []

for k in xrange(1,k_max+1,2):
    neighmat = nbrs.kneighbors_graph(n_neighbors=k)
    pred_lab = (neighmat.dot(campuslabel) > k/2.)*1 pred
    err =  np.mean(pred_lab != campuslabel)
    err_list.append(err)
In [83]:
plt.plot(range(1,k_max + 1,2),err_list)
plt.xlabel('k')
plt.ylabel('Error Rate')
Out[83]:
<matplotlib.text.Text at 0xc496828>
In [161]:
values = []
for i in range(0, len(pred_lab)):
    if pred_lab[i] == 1:
        values.append(i)
In [162]:
values
Out[162]:
[12, 15, 46, 50, 55, 115]
In [168]:
# Find similar article titles
# Title list correspond with observations
for i in range(0, len(values)):
    print titlelist[values[i]]
2017 ASUCD Winter Elections — Meet the Candidates
2017 ASUCD Winter Elections — Meet the Candidates
Former chancellor turns down feminist leadership role at UC Davis
UC-wide walkout, teach-ins on Trump's inauguration day
Student regent recruitment for the 2018-2019 school year begins
Nov. 8 2016: An Election Day many may never forget

The most similar articles are 2017 ASUCD Winter Elections - Meet the Candidates articles, Former chanecellor turns down feminist leadership role in UC Davis, UC-Wide walkout, teachins on Trump's inauguration day, Student regent recruitment for the 2018-2019 school year begins, and Nov. 8 2016: An Election Day many may never forget.

I think that some parts of the Aggie may represent the corpus I got because I think that the Aggie should cover issues that are school related such as health and crime which benefits and alerts students and election and school official topics that informs students of what goes on in their school. The corpus is implying that what the Aggie covers is mostly about election and school official issues along with other health, crime, and safety issues which benefits students. A lot of the words I see are used in a lot of everyday articles so this may require further analysis of the corpus.