Pre-processing text data with tm, quanteda & tidytext packages
Suppose you start with some sentences / passages / documents, and you want to pre-process the corpus before generating a document-term matrix (DTM, or DFM).
This post will outline how text can be cleaned with some of the popular packages in R. I'm writing this for two reasons:
quanteda::dfmwill convert terms to lower case, and so will
tm::DocumentTermMatrixunless the user deliberately prevents that) but it's not always clear whether an action is taken "in the background" and what pre-processing options are available. The main take-away here is that the provided list of stop words can be extended, and you will often want to add items to the list in order to remove URL or other strings.
Let's look at 3 popular packages:
Consider Barack Obama's 2009 inaugural speech:
speeches <- quanteda.corpora::data_corpus_sotu
obama2009 <- corpus_subset(speeches,Date == "2009-02-24")
A document feature matrix can be set up with the code below, and if you explicitly specify the main pre-processing steps then there will be on surprises.
bo_dfm_2009 <- dfm(obama2009, remove_punct = TRUE, tolower = TRUE, remove = stopwords("english"), stem = FALSE)
The last two options are typically more consequential than the other decisions.
If you want to compare/detect authorship, keep the stop words.
Suppose you have a corpus of documents:
# Remove stop words documents <- tm_map(documents, removeWords, stopwords("english")) # Or could run: documents <- tm_map(documents, content_transformer(removeWords), stopwords("SMART")) documents <- tm_map(documents, content_transformer(tolower)) documents <- tm_map(documents, content_transformer(removePunctuation)) dtm <- DocumentTermMatrix(documents)
Or, just following the documentation:
dtm <- DocumentTermMatrix(documents, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE))
url_words <- tibble( word = c("https","http")) tidy_dataset <- documents %>% unnest_tokens(output = word, input = text) %>% filter(!str_detect(word, "^[0-9]*$")) %>% # remove numbers anti_join(get_stopwords()) %>% # remove snowball stop words anti_join(url_words) %>% # remove some urls mutate(word = SnowballC::wordStem(word)) # apply a stemming procedure
tm package, you would run:
dtm <- removeSparseTerms(dtm, 0.95)
quanteda, you would run:
dfm <- dfm_trim(dfm, min_termfreq = 10, min_docfreq = 10)
For explanations and more options see Trim a dfm using frequency threshold-based feature selection — dfm_trim • quanteda