Methods Used

Web Scraper

A Python script was developed to scrape headline articles from CNBC every hour, from 9 a.m. to 4 p.m., on weekdays. These specific times were chosen to capture headlines approximately every hour while the stock market is open. The script utilized the Python library BeautifulSoup for web scraping, and a DigitalOcean server with crontab was employed to automate the process. The script generated a JSON file containing all the collected data. The data was collected from April 1 to April 18.

Data Cleaning

The data underwent manual cleaning to ensure quality. This process involved removing duplicate articles and adding corresponding dates to the records. Human sentiment was appended to each article to facilitate future analysis. Additionally, the JSON data was converted to XML, providing more format options for flexible analysis.

Text Processing

Our processing of our corpus was completed with the help of XSL. We ran the program to have us seperate the corpus into its respective articles.

Data Aggregation

To give us a hint about what key words to start with, we used Voyant to give us the most frequently used terms. The words returned to us were:

-Inflation

-Year

-Market

-Rates

Given a good start on where to look, we used Xquery for any other key words to get their word count. We also further catagorized them by seperating the word count for each article to see which ones had unique words used more often than others.

Xquery Code

                        
                            declare variable $results := doc('results.xml');
                            declare variable $wordsOfInterest := ('inflation', 'market', 'rate', 'march', 'prices', 'tesla');
                            
                            for $w in $wordsOfInterest
                            let $matches := $results//key[@name ! string()[contains(., $w)]]
                            let $count := count($matches)
                            let $dates := $matches/date
                            let $minDate := min($dates)
                            let $maxDate := max($dates)
                            
                            for $m in $matches
                            let $article := $m/@name ! string()
                            let $tokens := tokenize($article, ' ')[. = $w]
                            let $countPerArticle := $tokens => count()
                            let $dates := $m/date ! xs:date(.)
                            where $countPerArticle > 0
                            order by $dates descending
                            return ($w || ': ' || $dates || ': '|| $countPerArticle || '
')
                        
                    

XSL Code

                        
                            <?xml version="1.0" encoding="UTF-8"?>
                            <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                            xmlns:xs="http://www.w3.org/2001/XMLSchema"
                            xmlns:math="http://www.w3.org/2005/xpath-functions/math"
                            exclude-result-prefixes="xs math"
                            version="3.0">
                            
                            <xsl:output method="text"/>
                            
                            <xsl:variable name="results" as="document-node()" select="doc('results.xml')"/>
                            
                            <xsl:template match="/">
                            <xsl:apply-templates select="//key"/>
                            </xsl:template>
                            
                            <xsl:template match="key">
                            <xsl:variable name="filename" as="xs:string" select="@name ! substring(., 1, 20)>
                            <! replace(., '\W', '') ! lower-case(.)"/>
                            <xsl:value-of select="$filename"/>
                            <xsl:result-document method="text" href="textCorpus/{date}-{$filename}.txt">
                            <xsl:value-of select="@name"/>
                            </xsl:result-document>
                            </xsl:template>
                            
                            </xsl:stylesheet>