我正在尝试收集两个不同变量的数据,每个变量分布在几个列上,并由另外两个变量分组.这是问题所在.我有几个基因,几个样本.每个样本具有三种不同的可能基因型,每种基因型具有相关的频率.我想整理这个以获得基因,样本,基因型,频率的单一列.
我有一个hackjob解决方案,涉及创建listcolumns,传播它们,然后使用purrr :: map函数提取列.它很丑陋,不是真正的可扩展性,频率在转换回数字之前转换为字符,不理想.
有没有更好的方法来解决这个问题?
library(tidyverse)
# or, separately load dplyr, tibble, tidyr, purrr
# Here's what I have
have <- data_frame(gene=rep(c("gX", "gY"), each=2),
sample=rep(c("s1", "s2"), 2),
genotype1=c("AA", "AA", "GG", "GG"),
genotype2=c("AC", "AC", "GT", "GT"),
genotype3=c("CC", "CC", "TT", "TT"),
freq1=c(.8,.9, .7, .6),
freq2=c(.15,.1, .2, .35),
freq3=c(.05,0, .1, .05))
have
#> # A tibble: 4 × 8
#> gene sample genotype1 genotype2 genotype3 freq1 freq2 freq3
#>
#> 1 gX s1 AA AC CC 0.8 0.15 0.05
#> 2 gX s2 AA AC CC 0.9 0.10 0.00
#> 3 gY s1 GG GT TT 0.7 0.20 0.10
#> 4 gY s2 GG GT TT 0.6 0.35 0.05
# Here's what I want.
# Do a multicolumn gather grouped by gene and sample
want <- have %>%
group_by(gene, sample) %>%
summarize(x1=list(c(genotype=genotype1, freq=freq1)),
x2=list(c(genotype=genotype2, freq=freq2)),
x3=list(c(genotype=genotype3, freq=freq3))) %>%
ungroup() %>%
gather(key, value, x1, x2, x3) %>%
mutate(genotype=map_chr(value, "genotype"),
freq=map_chr(value, "freq") %>% as.numeric) %>%
select(-key, -value) %>%
arrange(gene, sample, genotype)
want
#> # A tibble: 12 × 4
#> gene sample genotype freq
#>
#> 1 gX s1 AA 0.80
#> 2 gX s1 AC 0.15
#> 3 gX s1 CC 0.05
#> 4 gX s2 AA 0.90
#> 5 gX s2 AC 0.10
#> 6 gX s2 CC 0.00
#> 7 gY s1 GG 0.70
#> 8 gY s1 GT 0.20
#> 9 gY s1 TT 0.10
#> 10 gY s2 GG 0.60
#> 11 gY s2 GT 0.35
#> 12 gY s2 TT 0.05
Daniel.. 6
你可以使用to_long()
从sjmisc包,这一次收集多个列:
to_long(have, keys = "genos", values = c("genotype", "freq"), c("genotype1", "genotype2", "genotype3"), c("freq1", "freq2", "freq3")) ## A tibble: 12 × 5 ## gene sample genos genotype freq #### 1 gX s1 genotype1 AA 0.80 ## 2 gX s2 genotype1 AA 0.90 ## 3 gY s1 genotype1 GG 0.70 ## 4 gY s2 genotype1 GG 0.60 ## 5 gX s1 genotype2 AC 0.15 ## 6 gX s2 genotype2 AC 0.10 ## 7 gY s1 genotype2 GT 0.20 ## 8 gY s2 genotype2 GT 0.35 ## 9 gX s1 genotype3 CC 0.05 ## 10 gX s2 genotype3 CC 0.00 ## 11 gY s1 genotype3 TT 0.10 ## 12 gY s2 genotype3 TT 0.05
to_long()
需要键值和值列的名称,然后是每个应该收集的向量的多个列名.
你可以使用to_long()
从sjmisc包,这一次收集多个列:
to_long(have, keys = "genos", values = c("genotype", "freq"), c("genotype1", "genotype2", "genotype3"), c("freq1", "freq2", "freq3")) ## A tibble: 12 × 5 ## gene sample genos genotype freq #### 1 gX s1 genotype1 AA 0.80 ## 2 gX s2 genotype1 AA 0.90 ## 3 gY s1 genotype1 GG 0.70 ## 4 gY s2 genotype1 GG 0.60 ## 5 gX s1 genotype2 AC 0.15 ## 6 gX s2 genotype2 AC 0.10 ## 7 gY s1 genotype2 GT 0.20 ## 8 gY s2 genotype2 GT 0.35 ## 9 gX s1 genotype3 CC 0.05 ## 10 gX s2 genotype3 CC 0.00 ## 11 gY s1 genotype3 TT 0.10 ## 12 gY s2 genotype3 TT 0.05
to_long()
需要键值和值列的名称,然后是每个应该收集的向量的多个列名.