当前位置:  开发笔记 > 数据库 > 正文

计算由分号按名称分组的累积唯一因子

如何解决《计算由分号按名称分组的累积唯一因子》经验,为你挑选了2个好方法。

这就是我的数据框架的样子.最右边的两列是我想要的列.我计算每行的唯一FundTypes的累计数量.第4列是所有"ActivityType"的累积唯一计数,第5列是仅"ActivityType =="Sale"的累积唯一计数.

dt <- read.table(text='

Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         

John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                    

                         ', header=T, row.names = NULL)

我试过dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name]但是例如它将a&a; b&c; d计为3个唯一值而不是所需的4个唯一值,因为因子用分号分隔.请告诉我一个解决方案.

更新:我的真实数据集看起来更像是这样的:

dt <- read.table(text='

    Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
    John       Email               ""           0                     0
    John       Conference          ""           0                     0
    John       Email               a            1                     0
    John       Sale                a;b          2                     2 
    John       Webinar             c;d          4                     2
    John       Sale                b            4                     2
    John       Webinar             e            5                     2
    John       Conference          b;d          5                     2
    John       Sale                b;e          5                     3
    John       Email               ""           5                     3
    John       Webinar             ""           5                     3
    Tom        Email               a            1                     0
    Tom        Sale                a;b          2                     2 
    Tom        Webinar             c;d          4                     2
    Tom        Sale                b            4                     2
    Tom        Webinar             e            5                     2
    Tom        Conference          b;d          5                     2
    Tom        Sale                b;e;f        6                     4                    

                             ', header=T, row.names = NULL)

唯一累积向量需要考虑缺失值.



1> jazzurro..:

nrussell建议编写自定义函数的简洁解决方案.让我放弃我得到的东西.我尝试使用cumsum()duplicated()你尝试过.我做了两个主要的操作.一个为alltype另一个为saleonly.首先,我为每个名字创建了索引.然后,我FundType使用cSplit()splitstackshape包以长格式拆分和格式化数据.然后,我为每个名称的每个索引号选择了最后一行.最后,我只选了一列alltype.

library(splitstackshape)
library(zoo)
library(data.table)

setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"][,
    .SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype

第二次操作是saleonly.基本上,我对待销售的子集数据重复了相同的方法,即ana.我还创建了一个没有销售的数据集,这是ana2.然后,我创建了一个包含两个数据集(即l)的列表并绑定它们.我用Name和改变了数据集的顺序ind,取每个名称和索引号的最后一行,处理NAs(填充NA并用0替换每个Name的第一个NA),最后选择一列.最后的操作是原来的结合dt,alltypesaleonly.

# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2 

# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)

    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email        a                     1                     0   1       1        0
 2: John         Sale      a;b                     2                     2   2       2        2
 3: John      Webinar      c;d                     4                     2   3       4        2
 4: John         Sale        b                     4                     2   4       4        2
 5: John      Webinar        e                     5                     2   5       5        2
 6: John   Conference      b;d                     5                     2   6       5        2
 7: John         Sale      b;e                     5                     3   7       5        3
 8:  Tom        Email        a                     1                     0   1       1        0
 9:  Tom         Sale      a;b                     2                     2   2       2        2
10:  Tom      Webinar      c;d                     4                     2   3       4        2
11:  Tom         Sale        b                     4                     2   4       4        2
12:  Tom      Webinar        e                     5                     2   5       5        2
13:  Tom   Conference      b;d                     5                     2   6       5        2
14:  Tom         Sale    b;e;f                     6                     4   7       6        4

编辑

对于新数据集,我尝试了以下内容.基本上,我使用我的方法将saleonly数据用于这个新数据集.修订仅在alltype部分内容中进行.首先,我添加了索引,用NA替换了"",并用具有非NA值的行对数据进行了子集化.这是temp.其余部分与之前的答案相同.现在我想在FundType中使用NA的数据集,所以我使用了setdiff().使用rbindlist(),我合并了两个数据集并创建了temp.其余部分与之前的答案相同.销售部分没有任何变化.我希望这对你的真实数据有用.

### all type

setDT(dt)[, ind := 1:.N, by = "Name"][,
    FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype


whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
    alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
    alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype


### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2

l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)


    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email       NA                     0                     0   1       0        0
 2: John   Conference       NA                     0                     0   2       0        0
 3: John        Email        a                     1                     0   3       1        0
 4: John         Sale      a;b                     2                     2   4       2        2
 5: John      Webinar      c;d                     4                     2   5       4        2
 6: John         Sale        b                     4                     2   6       4        2
 7: John      Webinar        e                     5                     2   7       5        2
 8: John   Conference      b;d                     5                     2   8       5        2
 9: John         Sale      b;e                     5                     3   9       5        3
10: John        Email       NA                     5                     3  10       5        3
11: John      Webinar       NA                     5                     3  11       5        3
12:  Tom        Email        a                     1                     0   1       1        0
13:  Tom         Sale      a;b                     2                     2   2       2        2
14:  Tom      Webinar      c;d                     4                     2   3       4        2
15:  Tom         Sale        b                     4                     2   4       4        2
16:  Tom      Webinar        e                     5                     2   5       5        2
17:  Tom   Conference      b;d                     5                     2   6       5        2
18:  Tom         Sale    b;e;f                     6                     4   7       6        4



2> nrussell..:

我认为这是实现目标的一种方式.首先添加一个辅助索引变量来维护输入顺序; 并key继续Name:

Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name") 

为清楚起见,我使用了这个功能

n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl)))) 

而不是在运行中输入正文的表达式 - 下面的操作足够密集,因为它没有一堆嵌套函数调用卷积的东西.

最后,

Dt[Dt, allow.cartesian = TRUE][
  gIdx <= i.gIdx, 
  .("UniqueFunds(AllTypes)" = n_usplit(FundType),
    "UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
  keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
#      Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1:   John          Email          a                     1                     0
# 2:   John           Sale        a;b                     2                     2
# 3:   John        Webinar        c;d                     4                     2
# 4:   John           Sale          b                     4                     2
# 5:   John        Webinar          e                     5                     2
# 6:   John     Conference        b;d                     5                     2
# 7:   John           Sale        b;e                     5                     3
# 8:    Tom          Email          a                     1                     0
# 9:    Tom           Sale        a;b                     2                     2
# 10:   Tom        Webinar        c;d                     4                     2
# 11:   Tom           Sale          b                     4                     2
# 12:   Tom        Webinar          e                     5                     2
# 13:   Tom     Conference        b;d                     5                     2
# 14:   Tom           Sale      b;e;f                     6                     4

我觉得我可以用SQL解释这个问题,但是我们在这里:

    Dt自己加入(by Name)

    使用额外的索引列(gIdx),只考虑序列中的前一行(包含) - 这会产生一种累积效应(缺少更好的术语)

    计算UniqueFunds(...)列 - 注意在第二种情况下完成的额外子集 -n_usplit(FundType[ActivityType == "Sale"])

    删除无关的索引列(i.gIdx).


我不确定这会因为使用笛卡尔连接而缩放,所以希望你的真实数据集不是数百万行.


数据:

library(data.table)
##
dt <- fread('
Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                     
            ', header = TRUE)

推荐阅读
重庆制造漫画社
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有