我想在这里找到一个危险数据集的wordcloud:https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
我的代码如下:
library(tm) library(SnowballC) library(wordcloud) jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE) jeopCorpus <- Corpus(VectorSource(jeopQ$Question)) jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument) jeopCorpus <- tm_map(jeopCorpus, removePunctuation) jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english'))) jeopCorpus <- tm_map(jeopCorpus, stemDocument) wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
单词'the'和'this'仍出现在wordcloud中.为什么会发生这种情况,我该如何解决?
问题在于您没有执行小写操作.很多问题都以"The"开头.停用词都是小写的,例如"the"和"this".由于"The"!="the","The"它不会从语料库中删除
如果您使用下面的代码,它应该正常工作:
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower)) jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english')) jeopCorpus <- tm_map(jeopCorpus, removePunctuation) jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument) jeopCorpus <- tm_map(jeopCorpus, stemDocument) wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)