这就是我的数据框架的样子.最右边的两列是我想要的列.我计算每行的唯一FundTypes的累计数量.第4列是所有"ActivityType"的累积唯一计数,第5列是仅"ActivityType =="Sale"的累积唯一计数.
dt <- read.table(text=' Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly) John Email a 1 0 John Sale a;b 2 2 John Webinar c;d 4 2 John Sale b 4 2 John Webinar e 5 2 John Conference b;d 5 2 John Sale b;e 5 3 Tom Email a 1 0 Tom Sale a;b 2 2 Tom Webinar c;d 4 2 Tom Sale b 4 2 Tom Webinar e 5 2 Tom Conference b;d 5 2 Tom Sale b;e;f 6 4 ', header=T, row.names = NULL)
我试过dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name]
但是例如它将a&a; b&c; d计为3个唯一值而不是所需的4个唯一值,因为因子用分号分隔.请告诉我一个解决方案.
更新:我的真实数据集看起来更像是这样的:
dt <- read.table(text=' Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly) John Email "" 0 0 John Conference "" 0 0 John Email a 1 0 John Sale a;b 2 2 John Webinar c;d 4 2 John Sale b 4 2 John Webinar e 5 2 John Conference b;d 5 2 John Sale b;e 5 3 John Email "" 5 3 John Webinar "" 5 3 Tom Email a 1 0 Tom Sale a;b 2 2 Tom Webinar c;d 4 2 Tom Sale b 4 2 Tom Webinar e 5 2 Tom Conference b;d 5 2 Tom Sale b;e;f 6 4 ', header=T, row.names = NULL)
唯一累积向量需要考虑缺失值.
nrussell建议编写自定义函数的简洁解决方案.让我放弃我得到的东西.我尝试使用cumsum()
和duplicated()
你尝试过.我做了两个主要的操作.一个为alltype
另一个为saleonly
.首先,我为每个名字创建了索引.然后,我FundType
使用cSplit()
splitstackshape包以长格式拆分和格式化数据.然后,我为每个名称的每个索引号选择了最后一行.最后,我只选了一列alltype
.
library(splitstackshape) library(zoo) library(data.table) setDT(dt)[, ind := 1:.N, by = "Name"] cSplit(dt, "FundType", sep = ";", direction = "long")[, alltype := cumsum(!duplicated(FundType)), by = "Name"][, .SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype
第二次操作是saleonly.基本上,我对待销售的子集数据重复了相同的方法,即ana
.我还创建了一个没有销售的数据集,这是ana2
.然后,我创建了一个包含两个数据集(即l
)的列表并绑定它们.我用Name
和改变了数据集的顺序ind
,取每个名称和索引号的最后一行,处理NAs(填充NA并用0替换每个Name的第一个NA),最后选择一列.最后的操作是原来的结合dt
,alltype
和saleonly
.
# data for sale only cSplit(dt, "FundType", sep = ";", direction = "long")[ ActivityType == "Sale"][, saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana # Data without sale cSplit(dt, "FundType", sep = ";", direction = "long")[ ActivityType != "Sale"] -> ana2 # Combine ana and ana2 l <- list(ana, ana2) rbindlist(l, use.names = TRUE, fill = TRUE) -> temp setorder(temp, Name, ind)[, .SD[.N], by = c("Name", "ind")][, saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][, saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly cbind(dt, alltype, saleonly) Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly 1: John Email a 1 0 1 1 0 2: John Sale a;b 2 2 2 2 2 3: John Webinar c;d 4 2 3 4 2 4: John Sale b 4 2 4 4 2 5: John Webinar e 5 2 5 5 2 6: John Conference b;d 5 2 6 5 2 7: John Sale b;e 5 3 7 5 3 8: Tom Email a 1 0 1 1 0 9: Tom Sale a;b 2 2 2 2 2 10: Tom Webinar c;d 4 2 3 4 2 11: Tom Sale b 4 2 4 4 2 12: Tom Webinar e 5 2 5 5 2 13: Tom Conference b;d 5 2 6 5 2 14: Tom Sale b;e;f 6 4 7 6 4
编辑
对于新数据集,我尝试了以下内容.基本上,我使用我的方法将saleonly数据用于这个新数据集.修订仅在alltype
部分内容中进行.首先,我添加了索引,用NA替换了"",并用具有非NA值的行对数据进行了子集化.这是temp
.其余部分与之前的答案相同.现在我想在FundType中使用NA的数据集,所以我使用了setdiff()
.使用rbindlist()
,我合并了两个数据集并创建了temp
.其余部分与之前的答案相同.销售部分没有任何变化.我希望这对你的真实数据有用.
### all type setDT(dt)[, ind := 1:.N, by = "Name"][, FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp cSplit(temp, "FundType", sep = ";", direction = "long")[, alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype whatever <- list(setdiff(dt, temp), alltype) rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][, alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][, alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype ### sale only cSplit(dt, "FundType", sep = ";", direction = "long")[ ActivityType == "Sale"][, saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana cSplit(dt, "FundType", sep = ";", direction = "long")[ ActivityType != "Sale"] -> ana2 l <- list(ana, ana2) rbindlist(l, use.names = TRUE, fill = TRUE) -> temp setorder(temp, Name, ind)[, .SD[.N], by = c("Name", "ind")][, saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][, saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly cbind(dt, alltype, saleonly) Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly 1: John Email NA 0 0 1 0 0 2: John Conference NA 0 0 2 0 0 3: John Email a 1 0 3 1 0 4: John Sale a;b 2 2 4 2 2 5: John Webinar c;d 4 2 5 4 2 6: John Sale b 4 2 6 4 2 7: John Webinar e 5 2 7 5 2 8: John Conference b;d 5 2 8 5 2 9: John Sale b;e 5 3 9 5 3 10: John Email NA 5 3 10 5 3 11: John Webinar NA 5 3 11 5 3 12: Tom Email a 1 0 1 1 0 13: Tom Sale a;b 2 2 2 2 2 14: Tom Webinar c;d 4 2 3 4 2 15: Tom Sale b 4 2 4 4 2 16: Tom Webinar e 5 2 5 5 2 17: Tom Conference b;d 5 2 6 5 2 18: Tom Sale b;e;f 6 4 7 6 4
我认为这是实现目标的一种方式.首先添加一个辅助索引变量来维护输入顺序; 并key
继续Name
:
Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"] setkeyv(Dt, "Name")
为清楚起见,我使用了这个功能
n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl))))
而不是在运行中输入正文的表达式 - 下面的操作足够密集,因为它没有一堆嵌套函数调用卷积的东西.
最后,
Dt[Dt, allow.cartesian = TRUE][ gIdx <= i.gIdx, .("UniqueFunds(AllTypes)" = n_usplit(FundType), "UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])), keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE] # Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly) # 1: John Email a 1 0 # 2: John Sale a;b 2 2 # 3: John Webinar c;d 4 2 # 4: John Sale b 4 2 # 5: John Webinar e 5 2 # 6: John Conference b;d 5 2 # 7: John Sale b;e 5 3 # 8: Tom Email a 1 0 # 9: Tom Sale a;b 2 2 # 10: Tom Webinar c;d 4 2 # 11: Tom Sale b 4 2 # 12: Tom Webinar e 5 2 # 13: Tom Conference b;d 5 2 # 14: Tom Sale b;e;f 6 4
我觉得我可以用SQL解释这个问题,但是我们在这里:
Dt
自己加入(by Name
)
使用额外的索引列(gIdx
),只考虑序列中的前一行(包含) - 这会产生一种累积效应(缺少更好的术语)
计算UniqueFunds(...)
列 - 注意在第二种情况下完成的额外子集 -n_usplit(FundType[ActivityType == "Sale"])
删除无关的索引列(i.gIdx
).
我不确定这会因为使用笛卡尔连接而缩放,所以希望你的真实数据集不是数百万行.
数据:
library(data.table) ## dt <- fread(' Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly) John Email a 1 0 John Sale a;b 2 2 John Webinar c;d 4 2 John Sale b 4 2 John Webinar e 5 2 John Conference b;d 5 2 John Sale b;e 5 3 Tom Email a 1 0 Tom Sale a;b 2 2 Tom Webinar c;d 4 2 Tom Sale b 4 2 Tom Webinar e 5 2 Tom Conference b;d 5 2 Tom Sale b;e;f 6 4 ', header = TRUE)