I've been using the tm package to create a DocumentTerm Matrix as follows:
library(tm)
library(RWeka)
library(SnowballC)
src <- DataframeSource(data.frame(data3$JobTitle))
# create a corpus and transform data
# Sets the default number of threads to use
options(mc.cores=1)
c_copy <- c <- Corpus(src)
c <- tm_map(c, content_transformer(tolower), mc.cores=1)
c <- tm_map(c,content_transformer(removeNumbers), mc.cores=1)
c <- tm_map(c,removeWords, stopwords("english"), mc.cores=1)
c <- tm_map(c,content_transformer(stripWhitespace), mc.cores=1)
#make DTM
dtm <- DocumentTermMatrix(c, control = list(tokenize = BigramTokenizer))
Now, the DTM comes out fine – what I want to do is get the frequencies of the frequent terms within the DTM. Obviously, I can use findFreqTerms to get the terms themselves, but not the actual frequencies. termFreq only works on TextDocument, not a DTM or TDM – any ideas?
Output from str – the frequent terms are in $ Terms:
> str(dtm)
List of 6
$ i : int [1:190] 1 2 3 4 5 6 7 8 9 10 ...
$ j : int [1:190] 1 2 3 4 5 6 7 8 9 10 ...
$ v : num [1:190] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 119
$ ncol : int 146
$ dimnames:List of 2
..$ Docs : chr [1:119] "1" "2" "3" "4" ...
..$ Terms: chr [1:146] "account administrator" "account assistant" "account director" "account executive" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
Best Answer
Thanks to NicE for the advice - it works well. Adding in the weighting argument allows me to get out the term frequencies when I inspect the DTM. Simple matter then of summing up per column.