Question

我的数据框如下：

输入

one<-c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two<-c("Overcast;lightning","Overcast;dismal and dreary")
df2<-data.frame(one,two)

我想逐行比较列表中的字符串并提取相同的内容，以及新列中的不同之处

我期待的输出是：

same<-c("lightning","dismal and dreary")
different_Incol1ButNot2<-c("Rainy and sunny;thundering","thundering")
different_Incol2ButNot1<-c("Overcast","Overcast")

df2<-data.frame(one,two,same,different_Incol1ButNot2,different_Incol2ButNot1,stringsAsFactors=F)

应输出：

    one                                  two                        same               different_Incol1ButNot2  different_Incol2ButNot1
 Rainy and sunny;thundering;lightning   Overcast;lightning          lightning          Rainy and sunny;thundering      Overcast
 dismal and dreary;thundering           Overcast;dismal and dreary  dismal and dreary  thundering                      Overcast

所以我的第一个想法是拆分并列出每个字符串：

df3$one<-as.list(strsplit(df3$one, ";"))
df3$two<-as.list(strsplit(df3$two, ";"))

但是现在我不知道如何比较我在数据帧中创建的列表，所以我想问题是如何在数据帧中的字符串列表之间进行这些行比较或者是否有更简单的方法这样做？

Answer 1

以下是dplyr，

的想法

library(dplyr)

df %>% 
 mutate_all(funs(strsplit(as.character(.), ';'))) %>% 
 rowwise() %>% 
 mutate(same = toString(intersect(one, two)), 
        differs_1 = toString(setdiff(one, two)), 
        differs_2 = setdiff(two, one))

给出，

Source: local data frame [2 x 5]
Groups: <by row>

# A tibble: 2 x 5
        one       two              same                   differs_1 differs_2
     <list>    <list>             <chr>                       <chr>     <chr>
1 <chr [3]> <chr [2]>         lightning Rainy and sunny, thundering  Overcast
2 <chr [2]> <chr [2]> dismal and dreary                  thundering  Overcast

Answer 2

首先，您应该使用character列，而不是因素（默认为stringsAsFactors=TRUE），即：

one <- c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two <- c("Overcast;lightning","Overcast;dismal and dreary")
df2 <- data.frame(one,two, stringsAsFactors = FALSE)

您可以在此处使用设置操作，即intersect和setdiff。你可以在外面试一下，但功能很方便。

compare_strings <- function(x){
  l <- sapply(x, strsplit, ";")
  list(one=x$one,
       two=x$two,
       same=intersect(l[[1]], l[[2]]),
       different_Incol1ButNot2=paste(setdiff(l[[1]], l[[2]]), collapse=";"),
       different_Incol2ButNot1=paste(setdiff(l[[2]], l[[1]]), collapse=";")                                 
  )
}

应用于df2的单行，它会返回包含所需组件的命名列表。

> compare_strings(df2[1, ])
$one
[1] "Rainy and sunny;thundering;lightning"

$two
[1] "Overcast;lightning"

$same
[1] "lightning"

$different_Incol1ButNot2
[1] "Rainy and sunny;thundering"

$different_Incol2ButNot1
[1] "Overcast"

如果我们将此应用于data.frame和rbind生成的列表列表的每一行，那么我们会得到您想要的最终data.frame：

do.call("rbind", lapply(seq_len(nrow(df2)), function(i) compare_strings(df2[i, ])))
one                                    two                         
[1,] "Rainy and sunny;thundering;lightning" "Overcast;lightning"        
[2,] "dismal and dreary;thundering"         "Overcast;dismal and dreary"
same                different_Incol1ButNot2      different_Incol2ButNot1
[1,] "lightning"         "Rainy and sunny;thundering" "Overcast"             
[2,] "dismal and dreary" "thundering"                 "Overcast"

这会解决您的问题吗？

如何比较数据框中的列表

2 个答案: