尝试通过将低频率计数组合成"其他"类别来折叠名义分类向量:
数据(数据框的列)如下所示,包含所有50个状态的信息:
California Florida Alabama ...
table(colname)/length(colname)
正确地返回频率,我想要做的是将任何低于给定阈值(比如f = 0.02)的东西混在一起.什么是正确的方法?
从它的声音,像下面这样的东西应该适合你:
condenseMe <- function(vector, threshold = 0.02, newName = "Other") { toCondense <- names(which(prop.table(table(vector)) < threshold)) vector[vector %in% toCondense] <- newName vector }
试试看:
## Sample data set.seed(1) a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE))) round(prop.table(table(a)), 2) # a # a A b B c C d D e E f g h # 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 # i j # 0.08 0.07 a # [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h" # [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e" # [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j" # [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b" condenseMe(a) # [1] "c" "d" "d" "e" "j" "h" "c" "h" # [9] "g" "i" "g" "d" "f" "Other" "g" "h" # [17] "h" "a" "b" "h" "e" "g" "h" "b" # [25] "d" "e" "e" "g" "i" "f" "d" "e" # [33] "g" "c" "g" "a" "Other" "i" "i" "b" # [41] "i" "j" "f" "d" "c" "h" "Other" "j" # [49] "j" "c" "Other" "e" "f" "a" "a" "h" # [57] "e" "c" "Other" "b"
但请注意,如果您正在处理factor
s,则应as.character
首先转换它们.