更新时间2017-12-21更新版本的quanteda
很高兴看到你正在使用这个包!我认为你正在努力解决的问题有两个.第一个是如何在形成ngrams之前应用特征选择.第二个是如何定义特征选择(使用quanteda).
第一个问题:如何在形成ngrams之前应用特征选择.在这里您已经定义了一个字典来执行此操作.(正如我将在下面展示的那样,这里没有必要.)你想删除不在选择列表中的所有术语,然后形成bigrams.quanteda默认情况下不这样做,因为它不是"bigram"的标准形式,根据严格按邻接定义的某个窗口,单词不会并置.例如,在您的预期结果中,law capital
不是一对相邻的术语,这是bigram的通常定义.
但是,我们可以通过更"手动"构建文档特征矩阵来覆盖此行为.
首先,标记文本.
# tokenize the original toks <- tokens(ZcObj, removePunct = TRUE, removeNumbers = TRUE) %>% tokens_tolower() toks ## tokens object from 2 documents. ## text1 : ## [1] "the" "new" "law" "included" "a" "capital" "gains" "tax" "and" "an" "inheritance" "tax" ## ## text2 : ## [1] "new" "york" "city" "has" "raised" "a" "taxes" "an" "income" "tax" "and" "a" "sales" "tax"
现在我们使用以下方法将字典mydict
应用于标记化文本tokens_select()
:
(toksDict <- tokens_select(toks, mydict, selection = "keep")) ## tokens object from 2 documents. ## text1 : ## [1] "the" "new" "law" "capital" "gains" "tax" "inheritance" "tax" ## ## text2 : ## [1] "new" "city" "tax" "tax"
从这组选定的标记中,我们现在可以形成双字母(或者我们可以toksDict
直接输入dfm()
):
(toks2 <- tokens_ngrams(toksDict, n = 2, concatenator = " ")) ## tokens object from 2 documents. ## text1 : ## [1] "the new" "new law" "law capital" "capital gains" "gains tax" "tax inheritance" "inheritance tax" ## ## text2 : ## [1] "new city" "city tax" "tax tax" # now create the dfm (myDfm2 <- dfm(toks2)) ## Document-feature matrix of: 2 documents, 10 features. ## 2 x 10 sparse Matrix of class "dfm" ## features ## docs the new new law law capital capital gains gains tax tax inheritance inheritance tax new city city tax tax tax ## text1 1 1 1 1 1 1 1 0 0 0 ## text2 0 0 0 0 0 0 0 1 1 1 topfeatures(myDfm2) # the new new law law capital capital gains gains tax tax inheritance inheritance tax new city city tax tax tax # 1 1 1 1 1 1 1 1 1 1
功能列表现在非常接近您想要的.
在第二个问题是,为什么你的字典的方法似乎效率不高.这是因为您正在创建一个字典来执行特征选择但不是真的将其用作字典 - 换句话说,一个字典,其中每个键等于它自己的键,因为值实际上不是字典.只需给它一个选择标记的字符向量,它就可以正常工作,例如:
(myDfm1 <- dfm(ZcObj, verbose = FALSE, keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city"))) ## Document-feature matrix of: 2 documents, 8 features. ## 2 x 8 sparse Matrix of class "dfm" ## features ## docs the new law capital gains tax inheritance city ## text1 1 1 1 1 1 2 1 0 ## text2 0 1 0 0 0 2 0 1