I have several TermDocumentMatrix
s created with the tm
package in R.
I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like:
corpus1 corpus2
"beach" "city"
"sand" "sidewalk"
... ...
[10th most frequent word]
By definition, findFreqTerms(corpus1,N)
returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms
is listed alphabetically so unless I picked exactly the right N, I wouldn't actually know which were the top 10. I suspect that this involves manipulating the internal structure of the TDM that you can see with str(corpus1)
as in R tm package create matrix of Nmost frequent terms but the answer here was very opaque to me so I wanted to rephrase the question.
Thanks!
Best Answer
Here's one way to find the top N terms in a document term matrix. Briefly, you convert the dtm to a matrix, then sort by row sums:
Here's the method in your Q, which returns words in alpha order, not always very useful, as you note...
And here's what you can do to get the top N words in order of their abundance:
For several document term matrices, you could do something like this:
Is that what you want to do?
Hat-tip to Ian Fellows' wordcloud package where I first saw this method.
UPDATE: following the comment below, here's some more detail...
Here's some data to make a reproducible example with multiple corpora:
Now let's process the example text a little, in the usual way. First convert the character vectors to corpora.
Now remove stopwords, numbers, punctuation, etc.
Convert processed corpora to term document matrix:
Get the most frequently occuring words in each corpus:
And reshape it into a dataframe according to the specified form:
Can you adapt that to work with your data? If not, please edit your question to more accurately show what your data look like.