我的数据框如下:
输入
one<-c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two<-c("Overcast;lightning","Overcast;dismal and dreary")
df2<-data.frame(one,two)
我想逐行比较列表中的字符串并提取相同的内容,以及新列中的不同之处
我期待的输出是:
same<-c("lightning","dismal and dreary")
different_Incol1ButNot2<-c("Rainy and sunny;thundering","thundering")
different_Incol2ButNot1<-c("Overcast","Overcast")
df2<-data.frame(one,two,same,different_Incol1ButNot2,different_Incol2ButNot1,stringsAsFactors=F)
应输出:
one two same different_Incol1ButNot2 different_Incol2ButNot1
Rainy and sunny;thundering;lightning Overcast;lightning lightning Rainy and sunny;thundering Overcast
dismal and dreary;thundering Overcast;dismal and dreary dismal and dreary thundering Overcast
所以我的第一个想法是拆分并列出每个字符串:
df3$one<-as.list(strsplit(df3$one, ";"))
df3$two<-as.list(strsplit(df3$two, ";"))
但是现在我不知道如何比较我在数据帧中创建的列表,所以我想问题是如何在数据帧中的字符串列表之间进行这些行比较或者是否有更简单的方法这样做?
答案 0 :(得分:5)
以下是dplyr
,
library(dplyr)
df %>%
mutate_all(funs(strsplit(as.character(.), ';'))) %>%
rowwise() %>%
mutate(same = toString(intersect(one, two)),
differs_1 = toString(setdiff(one, two)),
differs_2 = setdiff(two, one))
给出,
Source: local data frame [2 x 5] Groups: <by row> # A tibble: 2 x 5 one two same differs_1 differs_2 <list> <list> <chr> <chr> <chr> 1 <chr [3]> <chr [2]> lightning Rainy and sunny, thundering Overcast 2 <chr [2]> <chr [2]> dismal and dreary thundering Overcast
答案 1 :(得分:3)
首先,您应该使用character
列,而不是因素(默认为stringsAsFactors=TRUE
),即:
one <- c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two <- c("Overcast;lightning","Overcast;dismal and dreary")
df2 <- data.frame(one,two, stringsAsFactors = FALSE)
您可以在此处使用设置操作,即intersect
和setdiff
。你可以在外面试一下,但功能很方便。
compare_strings <- function(x){
l <- sapply(x, strsplit, ";")
list(one=x$one,
two=x$two,
same=intersect(l[[1]], l[[2]]),
different_Incol1ButNot2=paste(setdiff(l[[1]], l[[2]]), collapse=";"),
different_Incol2ButNot1=paste(setdiff(l[[2]], l[[1]]), collapse=";")
)
}
应用于df2
的单行,它会返回包含所需组件的命名列表。
> compare_strings(df2[1, ])
$one
[1] "Rainy and sunny;thundering;lightning"
$two
[1] "Overcast;lightning"
$same
[1] "lightning"
$different_Incol1ButNot2
[1] "Rainy and sunny;thundering"
$different_Incol2ButNot1
[1] "Overcast"
如果我们将此应用于data.frame
和rbind
生成的列表列表的每一行,那么我们会得到您想要的最终data.frame
:
do.call("rbind", lapply(seq_len(nrow(df2)), function(i) compare_strings(df2[i, ])))
one two
[1,] "Rainy and sunny;thundering;lightning" "Overcast;lightning"
[2,] "dismal and dreary;thundering" "Overcast;dismal and dreary"
same different_Incol1ButNot2 different_Incol2ButNot1
[1,] "lightning" "Rainy and sunny;thundering" "Overcast"
[2,] "dismal and dreary" "thundering" "Overcast"
这会解决您的问题吗?