Pre-processing text data with tm, quanteda & tidytext packages

Reading Time:

Suppose you start with some sentences / passages / documents, and you want to pre-process the corpus before generating a document-term matrix (DTM, or DFM).

This post will outline how text can be cleaned with some of the popular packages in R. I'm writing this for two reasons:

  • I noticed that sometimes people convert objects when it's not really necessary. For example, there isn't necessarily a need to change your text object into a tidy dataset for the purpose of "removing stop words."
  • Some actions are taken by default (quanteda::dfm will convert terms to lower case, and so will tm::DocumentTermMatrix unless the user deliberately prevents that) but it's not always clear whether an action is taken "in the background" and what pre-processing options are available. The main take-away here is that the provided list of stop words can be extended, and you will often want to add items to the list in order to remove URL or other strings.

Let's look at 3 popular packages:

1. I need/want to use quanteda

Consider Barack Obama's 2009 inaugural speech:

speeches <- quanteda.corpora::data_corpus_sotu
obama2009 <- corpus_subset(speeches,Date == "2009-02-24")

A document feature matrix can be set up with the code below, and if you explicitly specify the main pre-processing steps then there will be on surprises.

bo_dfm_2009 <- dfm(obama2009,
                   remove_punct = TRUE,
                   tolower = TRUE,
                   remove = stopwords("english"),
                   stem = FALSE)

The last two options are typically more consequential than the other decisions.

If you want to compare/detect authorship, keep the stop words.

2. I need/want to use the tm package

Suppose you have a corpus of documents:

# Remove stop words
documents <- tm_map(documents, removeWords, stopwords("english")) 

# Or could run: documents <- tm_map(documents, content_transformer(removeWords), stopwords("SMART"))

documents <- tm_map(documents, content_transformer(tolower)) 
documents <- tm_map(documents, content_transformer(removePunctuation)) 

dtm <- DocumentTermMatrix(documents)

Or, just following the documentation:

dtm <- DocumentTermMatrix(documents,
                          control = list(weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     FALSE),
                                         stopwords = TRUE))

3. I need/want to use tidytext

url_words <- tibble(
  word = c("https","http"))

tidy_dataset <- documents %>%
   unnest_tokens(output = word, input = text) %>%
   filter(!str_detect(word, "^[0-9]*$")) %>% # remove numbers
   anti_join(get_stopwords()) %>%    # remove snowball stop words
   anti_join(url_words) %>%    # remove some urls
   mutate(word = SnowballC::wordStem(word))    # apply a stemming procedure

How can I make the matrix less sparse?

Using the tm package, you would run:

dtm <- removeSparseTerms(dtm, 0.95)

Using quanteda, you would run:

dfm <- dfm_trim(dfm, 
                min_termfreq = 10,
                min_docfreq = 10)

For explanations and more options see Trim a dfm using frequency threshold-based feature selection — dfm_trim • quanteda

Trust in the EU climbing up in Slovakia

The EU is more popular in Slovakia than a cursory look at political debates would make you believe....