Data Processes

Here, we describe the process of finding and cleaning our content sets. 

Effects and Impact

To analyze the impacts and effects of the suffragettes, I built a new content set. The search term I used was “suffragette”. I filtered results by document type for articles and the publication dates from August 1920 to the present day. I chose August 18, 1920, when women were given the right to vote after the major events of women’s suffrage. I also filtered results by an OCR confidence range of above 65%, and by different subjects. Subjects I filtered by were “feminism”, “voting rights”, and “gender equality”. From the filtered results, I chose 24 documents I thought could be helpful in answering my research question. Experimenting with analysis tools, NGrams, and Sentiment analysis, I found that they were not relevant in displaying information regarding answering my research question. 

Reading over each document in my new content set, I chose six documents spaced from 1920 to recent times that represent continued advances toward gender equality. 


In order to glean more information from the Criminalization of the suffragettes, I removed excess correspondence and other data that seemed more irrelevant to the topic. I first removed it by filtering out "correspondence," and then manually looked through the categories of government documents. The search term I used was “Criminal." I did keep the license numbers of the women to see if any would repeatedly show up (if the NGram was done with more words, then their numbers came up)

I then looked at the public records and legal documents, and eventually narrowed down the data to where I would be able to do NGrams (Voyant) and Sentiment analysis (Gale Digital Scholar). Although they were still pretty messy and cluttered due to the high number of documents available, the two were able to give me a more generalized notion of what the data was representing and portraying. I ended up using those two as they served my purpose best in showing that there was a criminalization of the action the suffragettes were doing over the other analysis tools.

Media Portrayal

To understand how the movement was portrayed, I created a new content set.

I conducted two searches in the DSL to find documents. My goal was to find current (last 20 years) documents, specifically articles, about feminist movements so I could compare them to the documents within the suffragette movement. I also searched to supplement some articles from the period of the suffragette dataset, because it mostly contains legal documents.

For my first search, I looked up ‘feminism’ and used the filters to select articles from 2000-2020 published in England. I selected the top 23 articles with OCR ratings above 50%. Next, I searched for suffragette articles from 1087-1952 published in England and selected 53 under the same criteria.

 I then cleaned this content set. I used the English stopwords list, added the word 'said' to the list, and excluded a few special characters. I had originally decided to add 'women' to the stop words list, feeling it was obviously going to be a frequent word and irrelevant to the analysis. Its absence was noted in every analysis I conducted, so I removed it from the stop words list. 

Screen Shot 2023-03-12 at 2.52.51 PM.png

A screenshot of the cleaning configuration used for the Media Portrayal content set.