Methods Used
Web Scraper
A Python script was developed to scrape headline articles from CNBC every hour, from 9 a.m. to 4 p.m., on weekdays. These specific times were chosen to capture headlines approximately every hour while the stock market is open. The script utilized the Python library BeautifulSoup for web scraping, and a DigitalOcean server with crontab was employed to automate the process. The script generated a JSON file containing all the collected data. The data was collected from April 1 to April 18.
Data Cleaning
The data underwent manual cleaning to ensure quality. This process involved removing duplicate articles and adding corresponding dates to the records. Human sentiment was appended to each article to facilitate future analysis. Additionally, the JSON data was converted to XML, providing more format options for flexible analysis.
Text Processing
Our processing of our corpus was completed with the help of XSL. We ran the program to have us seperate the corpus into its respective articles.
Data Aggregation
To give us a hint about what key words to start with, we used Voyant to give us the most frequently used terms. The words returned to us were:
-Inflation
-Year
-Market
-Rates
Given a good start on where to look, we used Xquery for any other key words to get their word count. We also further catagorized them by seperating the word count for each article to see which ones had unique words used more often than others.
Xquery Code
declare variable $results := doc('results.xml');
declare variable $wordsOfInterest := ('inflation', 'market', 'rate', 'march', 'prices', 'tesla');
for $w in $wordsOfInterest
let $matches := $results//key[@name ! string()[contains(., $w)]]
let $count := count($matches)
let $dates := $matches/date
let $minDate := min($dates)
let $maxDate := max($dates)
for $m in $matches
let $article := $m/@name ! string()
let $tokens := tokenize($article, ' ')[. = $w]
let $countPerArticle := $tokens => count()
let $dates := $m/date ! xs:date(.)
where $countPerArticle > 0
order by $dates descending
return ($w || ': ' || $dates || ': '|| $countPerArticle || ' ')
XSL Code
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:output method="text"/>
<xsl:variable name="results" as="document-node()" select="doc('results.xml')"/>
<xsl:template match="/">
<xsl:apply-templates select="//key"/>
</xsl:template>
<xsl:template match="key">
<xsl:variable name="filename" as="xs:string" select="@name ! substring(., 1, 20)>
<! replace(., '\W', '') ! lower-case(.)"/>
<xsl:value-of select="$filename"/>
<xsl:result-document method="text" href="textCorpus/{date}-{$filename}.txt">
<xsl:value-of select="@name"/>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>