以下是用于此问题的代码:
set.seed(1337) myDT <- data.table(Key1 = sample(letters, 500, replace = TRUE), Key2 = sample(LETTERS[1:5], 500, TRUE), Data = sample(1:26, 500, replace = TRUE)) setkey(myDT, Key1, Key2) # showing what myDT looks like > myDT Key1 Key2 Data 1: a A 6 2: a A 3 3: a B 2 4: a B 20 5: a B 13 --- 496: z D 23 497: z E 3 498: z E 18 499: z E 11 500: z E 2
我想配对myDT
,只为每个Key1,Key2对获取最大的数据值.例如(使用(Key1,Key2)表示一对)(a,A)我想摆脱Data为3的行并保持Data为6的行.对于(z,E)我想要仅保留Data为18的行.
在输入这个问题时,我找到了一个解决方案(我将在下面发布),但请帮助我知道如何处理这个问题.
我的答案
myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)] # if you are on 1.9.6 or lower use this one myDT[order(-Data), .SD[1], by = .(Key1, Key2)]
或者来自评论
unique(myDT[order(-Data)], by = c("Key1", "Key2"))
基准50M行.
library(dplyr) library(data.table) library(microbenchmark) set.seed(1337) n = 5e7 myDT <- data.table(Key1 = sample(letters, n, replace = TRUE), Key2 = sample(LETTERS[1:5], n, TRUE), Data = sample(1:26, n, replace = TRUE)) setkey(myDT, Key1, Key2) microbenchmark(times = 10L, CathG = myDT[, .SD[which.max(Data)], by = .(Key1, Key2)], jangorecki = myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)], jangorecki.keeporder = myDT[order(-Data), head(.SD, 1), keyby = .(Key1, Key2)], nist = myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data)), David = unique(myDT[order(-Data)], by = c("Key1", "Key2"))) #Unit: milliseconds # expr min lq mean median uq max neval # CathG 659.6150 689.3035 733.9177 739.795 780.0075 811.1456 10 # jangorecki 2844.7565 3026.3385 3089.6764 3097.332 3219.1951 3343.9919 10 # jangorecki.keeporder 2935.3733 3194.1606 3232.9297 3214.581 3308.0735 3411.4319 10 # nist 803.1921 844.5002 1011.7878 1007.755 1188.6127 1228.3869 10 # David 3410.4853 3501.5918 3590.2382 3590.190 3652.8091 3803.9038 10
以前发布的小数据基准显示了很多不同的结果,所以我认为它在很大程度上取决于数据,不仅仅是数量,还有基数(唯一值的数量) - 在某些情况下甚至更多.