我有一个包含单元格中文本的大型数据集.有些文本只是之前的附加单元格,除非日期不同,否则我不想在我的分析中包含它.这是它的样子的一个例子:
10-01-17 | 你好你好吗?
10-01-17 | 你好你好吗?哦,我很好.
11-01-17 | 你好你好吗?哦,我很好.今天天气很好.
如果1在2中,如果日期相同,我想删除1.如果2在3中,则仅在日期相同时删除2.我想要保留的唯一内容是两个和三个.
您可以grepl
在整个列上使用每个观察作为模式.如果得到的布尔向量的总和大于1,则该行匹配的比自身多,并且是重复的.
df[mapply(function(d, t) { sum(grepl(t, df$text, fixed = TRUE) & d == df$date) == 1 }, df$date, df$text), ] ## date text ## 2 10-01-17 Hi, how are you? Oh, I'm just fine. ## 3 11-01-17 Hi, how are you? Oh, I'm just fine. The weather is nice today.
或者在dplyr中,
library(dplyr) df %>% rowwise() %>% filter(sum(grepl(text, .$text, fixed = TRUE) & date == .$date) == 1) ## Source: local data frame [2 x 2] ## Groups:## ## # A tibble: 2 × 2 ## date text ## ## 1 10-01-17 Hi, how are you? Oh, I'm just fine. ## 2 11-01-17 Hi, how are you? Oh, I'm just fine. The weather is nice today.
df <- structure(list(date = c("10-01-17", "10-01-17", "11-01-17" ), text = c("Hi, how are you?", "Hi, how are you? Oh, I'm just fine.", "Hi, how are you? Oh, I'm just fine. The weather is nice today." )), class = "data.frame", row.names = c(NA, -3L), .Names = c("date", "text"))