PuddleNet—it was to be our answer to a question we had not yet found. This project started with a desire to understand the political climate and the language being consumed by the masses. We wanted to be explorers of a polarized lexicon that seemed to be dividing this country.
As our project unfolded, we identified 20 sources from which to gather language. We chose an equal number of popular left- and right-leaning news media sites to scrape, and began our process of gathering human thought that has been recorded in sentences.
As the data began rolling in, millions of words per site, we needed to find a way to start making sense of it all. We chose a two-pronged approach. First, we would begin grouping and inspecting to look for patterns, find relationships, further direct our exploration. Second, we we wanted a machine learning model to learn what left- or right-leaning text would look like so that we could use it to identify political leanings of a text, or even build text blocks with perspective.
And so we got to work...
Word clouds are very subjective and require data cleaning in order to extract insights from them. It relies heavily on the human brain to identify patterns and/or relationships. In its simplest form, the more frequent words are illustrated with bigger size in the cloud. Common grammar words are typically removed so "THE" does not steal the show.
In looking at the individual word clouds (select from drop downs), we began inspecting the language. We began to have an inkling that there was definitely something different between the lexicons used by each side. However, we were unable to put a finger on it. We then rolled the word clouds up into groups of left and right, and the pattern began to take shape.
The Conservative Lexicon seemed to be a little more focused on political terms, whereas the Liberal Lexicon almost seemed a little erratic.
This was surprising. So, we needed to explore the anatomy of the articles…
This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (first and third quartile), and a typical value (the median). Upper and Lower Fences help identify outliers in the data set. In short, the box plot provides a simple visual of the dimensions of the data.
Our box plots were very insightful. We noticed almost immediately that the article anatomies were very different. Liberal media tended to have longer articles by nearly twice as many words, on average, than the Conservative Media. This began to explain some of the disparity between the word clouds, but we simply needed more information.
We wanted to explore why longer articles that clearly had more information, seemed to not be heard as loudly in the political arena.
And so we looked for a way to visualize the interconnectedness of the sites, to see if there was a proportionate flow between the sides...
A Chord plot is used to display the inter-relationships between data. It is arranged in a circle and arcs will show links to other sites, and the thickness of the link represents volume of interconnectedness.
We went back to our raw data and began scouring the pages for links to external sites. This would show definitively how frequently the sites referred to each other, and if the sites crossed the political chasm seemingly between the sides.
We found there is a chasm. The left and right do not tend to share. This was not too surprising, though still disappointing.
However, we did find what we call “hub” sites. Only three seemed to cross the divide heavily, and in both directions.
So, there we were, looking at all but insulated sides, and we noticed there simply was more volume on the liberal side, they wrote longer articles, and they covered more topics of interest…
At this point, our assumption would be the liberal media should be dominating the political ecosphere.
This was discordant with what we see everyday, so we went back to the data…
This cloud spanned both sides of the stage. It clearly shows it is more similar to the Conservative Word Cloud than it is to the Liberal Word Cloud. Despite what we discovered, the volume and quality of the Left, the Right-Leaning media was dominating the space.
Our exploration of the media lexicon has taught us 4 very important lessons: