注意到data.table的一些奇怪的行为,希望有人比我能解释的更了解data.table.
说我有这个data.table:
library(data.table) DT <- data.table( C1 = c(rep("A", 4), rep("B",4), rep("C", 4)), C2 = c(rep("a", 3), rep("b",3), rep("c",3), rep("d",3)), Val = c(1:5, NaN, NaN, 8,9,10,NaN,12)) DT C1 C2 Val 1: A a 1 2: A a 2 3: A a 3 4: A b 4 5: B b 5 6: B b NaN 7: B c NaN 8: B c 8 9: C c 9 10: C d 10 11: C d NaN 12: C d 12
现在,在我看来,以下两种方法应该生成相同的结果,但它们不会.
TEST1 <- DT[, agg := min(Val, na.rm = TRUE), by = c('C1', 'C2')] TEST1 <- data.table(unique(TEST1[, c('C1','C2','agg'), with = FALSE])) TEST2 <- DT[, list(agg = min(Val, na.rm = TRUE)), by = c('C1', 'C2')] TEST1 C1 C2 agg 1: A a 1 2: A b 4 3: B b 5 4: B c 8 5: C c 9 6: C d 10 TEST2 C1 C2 agg 1: A a 1 2: A b 4 3: B b 5 4: B c NaN 5: C c 9 6: C d 10
如您所见,使用":="会为(C1 = B,C2 = c)生成最小值8.而list命令会生成NaN.有趣的是,对于(C1 = B,C2 = b)和(C1 = C,C2 = d),它们也有NaNs,list命令确实产生一个值.我相信这是因为在NaN首先在给定C1 C2组合的值之前的情况下,NaN结果.而在另外两个例子中,NaN来自一个值.
为什么会这样?
我注意到如果用NA替换NaN,则生成的值没有问题.
修复了这个问题,#1461刚刚开发,v1.9.7,提交2080.
require(data.table) # v1.9.7, commit 2080+ DT <- data.table( C1 = c(rep("A", 4), rep("B",4), rep("C", 4)), C2 = c(rep("a", 3), rep("b",3), rep("c",3), rep("d",3)), Val = c(1:5, NaN, NaN, 8,9,10,NaN,12)) DT[, list(agg = min(Val, na.rm = TRUE)), by = c('C1', 'C2')] # C1 C2 agg # 1: A a 1 # 2: A b 4 # 3: B b 5 # 4: B c 8 # 5: C c 9 # 6: C d 10