我有一个如下所示的数据框:
V1 V2 peanut butter sandwich 2 slices of bread 1 tablespoon peanut butter
我的目标是:
V1 V2 peanut butter sandwich 2 slices of bread peanut butter sandwich 1 tablespoon peanut butter
我试图分裂字符串使用strsplit(df$v2, " ")
,但我只能拆分" "
.我不确定你是否只能在第一个数字处拆分字符串,然后取字符直到下一个数字.
您可以按如下方式拆分字符串:
txt <- "2 slices of bread 1 tablespoon peanut butter" strsplit(txt, " (?=\\d)", perl=TRUE)[[1]] #[1] "2 slices of bread" "1 tablespoon peanut butter"
这里使用的正则表达式是查找后跟数字的空格.它使用零宽度正向前瞻(?=)
来表示如果空格后跟一个数字(\\d
),那么它就是我们要分割的空间类型.为什么零宽度前瞻?这是因为我们不想将数字用作分裂字符,我们只想匹配任何后跟数字的空格.
要使用该想法并构建数据框,请参阅以下示例:
item <- c("peanut butter sandwich", "onion carrot mix", "hash browns") txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato") df <- data.frame(item, txt, stringsAsFactors=FALSE) # thanks to Ananda for recommending setNames split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item) # alternately: #split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE) #names(split.strings) <- df$item stack(split.strings) # values ind #1 2 slices of bread peanut butter sandwich #2 1 tablespoon peanut butter peanut butter sandwich #3 1 onion onion carrot mix #4 3 carrots onion carrot mix #5 potato hash browns
让我们想象你正在处理的事情如下:
mydf <- data.frame( V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"), V2 = c("2 slices of bread 1 tablespoon peanut butter", "2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam")) mydf ## V1 ## 1 peanut butter sandwich ## 2 peanut butter and jam sandwich ## V2 ## 1 2 slices of bread 1 tablespoon peanut butter ## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam
您可以先在"V2"中添加一个您不期望的分隔符,并使用cSplit
我的"splitstackshape"获取"长"数据集格式.
library(splitstackshape) mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2) cSplit(mydf, "V2", "|", "long") ## V1 V2 ## 1: peanut butter sandwich 2 slices of bread ## 2: peanut butter sandwich 1 tablespoon peanut butter ## 3: peanut butter and jam sandwich 2 slices of bread ## 4: peanut butter and jam sandwich 1 tablespoon peanut butter ## 5: peanut butter and jam sandwich 1 tablespoon jam
以下不足以自己发布作为答案,因为它们是@Jota方法的变体,但我在这里分享它们是为了完整性:
strsplit
在"data.table"中拆分list
自动展平为一列....
library(data.table) as.data.table(mydf)[, list( V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]
您可以使用unnest
"tidyr"将列表列扩展为长格式....
library(dplyr) library(tidyr) mydf %>% mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>% unnest(V2)