当前位置:  开发笔记 > 编程语言 > 正文

结合低频率计数

如何解决《结合低频率计数》经验,为你挑选了1个好方法。

尝试通过将低频率计数组合成"其他"类别来折叠名义分类向量:

数据(数据框的列)如下所示,包含所有50个状态的信息:

California
Florida
Alabama
...

table(colname)/length(colname)正确地返回频率,我想要做的是将任何低于给定阈值(比如f = 0.02)的东西混在一起.什么是正确的方法?



1> A5C1D2H2I1M1..:

从它的声音,像下面这样的东西应该适合你:

condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}

试试看:

## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))

round(prop.table(table(a)), 2)
# a
#    a    A    b    B    c    C    d    D    e    E    f    g    h 
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
#    i    j 
# 0.08 0.07 

a
#  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"

condenseMe(a)
#  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
#  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
# [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
# [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
# [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
# [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
# [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
# [57] "e"     "c"     "Other" "b"   

但请注意,如果您正在处理factors,则应as.character首先转换它们.

推荐阅读
夏晶阳--艺术
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有