合并多个文本数据框以查看彼此对应的单词

时间:2014-08-23 23:10:50

标签: r text merge dataframe

概述: 我有以上3个数据帧。我希望最终结果为final_merge_df。

详细说明: 我有很多表,如下表。我想将所有表合并到一个表中(如下所示:“final_merge_df”)。每个表格具有相同的格式但数据不同。 在每个表中,有两列。在第一列中,每行有一个单词。所有表都有相同的单词,但每个表可能有任意数量的单词。另请注意,可能存在包含具有特定单词的零行的表。 第二列有一个与第一个单词相对应的单词(出于何种原因)。但是,第二列的每一个中只有一个单词,第二列中的单词可能与第一列中的单词相同或不同。 每个第二列可能包含任何其他表的第二列中不存在的单词。

df1 = data.frame( 
  x1=c("bus","bus","cat","cat"),
  df1=c("bus","driver","mouse","dog"),
  stringsAsFactors = FALSE)

>df1
   x1      df1
1 bus      bus
2 bus   driver
3 cat    mouse
4 cat      dog

df2 = data.frame(
  x1=c("bus","bus","bus","cat","cat"),
  df2=c("car","driver","bus","dog","paw"),
  stringsAsFactors = FALSE)

>df2
   x1      df2
1 bus      car
2 bus   driver
3 bus      bus
4 cat      dog
5 cat      paw

df3 = data.frame(
  x1=c("bus","bus","cat","cat","cat","cat"),
  df3=c("bus","autobus","dog","bed","paw","tree"),
  stringsAsFactors = FALSE)

df3
  x1         df3
1 bus        bus
2 bus    autobus
3 cat        dog
4 cat        bed
5 cat        paw
6 cat       tree

我想要一个表是所有其他表的合并(见下文)。 第一列同样包含与原始表的每个第一列相同的单词。 第二列包含第二个表中第二列的单词。第三列包含第三个表中的单词,第四列包含第四个表中的单词,依此类推。 如果在第2列到第N列的每一列中,如果该列中的单词与第一列中的单词相对应(如原始表中那样),则将单词写入。如果不对应,则写入“”

例如,查看输出中的第一行。所有三个原始表都有“总线”一词,而“总线”一词对应于“总线”一词。查看第二行:表1和2包含单词“driver”,对应单词“bus”,而第3行不包含单词“driver”,因此我们写入“”。

# this is an example of how the result should look from the examples tables above
final_merge_df = data.frame(
  x1  = c(rep("bus",4), rep("cat",5)),
  df1 = c("bus","driver","<NA>","<NA>",   "mouse","dog","<NA>","<NA>","<NA>"),
  df2 = c("bus","driver","car", "<NA>",   "<NA>", "dog","paw", "<NA>","<NA>"),
  df3 = c("bus","<NA>",  "<NA>","autobus","<NA>", "dog","paw", "bed", "tree"))

>final_merge_df
  x1       df1      df2        df3
1 bus      bus      bus        bus
2 bus   driver   driver       <NA>
3 bus     <NA>      car       <NA>
4 bus     <NA>     <NA>    autobus
5 cat    mouse     <NA>       <NA>
6 cat      dog      dog        dog
7 cat     <NA>      paw        paw
8 cat     <NA>     <NA>        bed
9 cat     <NA>     <NA>       tree

我尝试过很多东西,包括:

df = merge( df1, df2, by.x="df1", by.y="df2", all=T)

>df
      df1  x1.x x1.y
1     bus   bus  bus
2     car  <NA>  bus
3     dog   cat  cat
4  driver   bus  bus
5   mouse   cat <NA>
6     paw  <NA>  cat

基于上面的输出,我写了一个简短的函数,将df转换为:

   x1       df1      df2
1 bus       bus      bus
4 bus    driver   driver
2 bus      <NA>      car
3 cat       dog      dog
5 cat     mouse     <NA>
6 cat      <NA>      paw

这正是我想要的,但它只适用于两个表。我需要一种能够处理2个以上表格的方法。

我还尝试了一些频率表对话,还创建了一个termdocumentmatrix(使用tm包),但没有成功。

我非常感谢任何帮助。感谢。

2 个答案:

答案 0 :(得分:0)

我可以尝试一下吗?如果我说错了,请告诉我。我很高兴撤回我的回答。如果我没有弄错的话,我想你是说你希望让你的方法适用于三个数据帧,这两个数据帧适用于两个数据帧。我考虑过做一些事情Reduce(function(x,y) merge(x,y, all = TRUE), list(df1,df2,df3)。但是,我想不出来。 (我想一些专家将能够在这一行中提供一些东西。)所以,我决定以下列方式完成合并过程。这是一种特定于问题的方法,这可能不是这里的专家如何应对您的挑战。但是,至少这允许你有一个数据框,你可以应用你的功能,并得出你想要的结果。

library(dplyr)
### I follow your script.
df = merge(df1, df2, by.x="df1", by.y="df2", all=T)
df <- arrange(df, df1)

### I want to repeat the same procedure, but two colums with bus and cat 
### won't help. So I drop x1.y in df which comes from df2.

### Separate the df2 part (x1.y)
foo <- df$x1.y

### Create df1 (new version)
ana <- select(df, df1, x1.x)

### This is merge with the new version of df1 and df3
bob = merge(ana, df3, by.x="df1", by.y="df3", all=T)

### There are three new items (i.e., autobus, bed, and tree).
### They are in df3, but not df2.
### So, I added NA in the positions of the items in df2.

foo2 <- c(NA, NA, foo, NA)

### Now add the df2 part.
cathy <- cbind(bob, foo2)
names(cathy) <- c("whatever", "df1", "df3", "df2")

### Reorder columns
david <- cathy[,c(1,2,4,3)]

#> david
#  whatever  df1  df2  df3
#1  autobus <NA> <NA>  bus
#2      bed <NA> <NA>  cat
#3      bus  bus  bus  bus
#4      car <NA>  bus <NA>
#5      dog  cat  cat  cat
#6   driver  bus  bus <NA>
#7    mouse  cat <NA> <NA>
#8      paw <NA>  cat  cat
#9     tree <NA> <NA>  cat

答案 1 :(得分:0)

我的时间很短。所以这不是一个非常优雅的解决方案,但它确实有效。

df1 = data.frame( 
  x1=c("bus","bus","cat","cat"),
  df1=c("bus","driver","mouse","dog"),
  stringsAsFactors = FALSE)
df2 = data.frame(
  x1=c("bus","bus","bus","cat","cat"),
  df2=c("car","driver","bus","dog","paw"),
  stringsAsFactors = FALSE)
df3 = data.frame(
  x1=c("bus","bus","cat","cat","cat","cat"),
  df3=c("bus","autobus","dog","bed","paw","tree"),
  stringsAsFactors = FALSE)

s <- function(df) {
  split(df[,2], df[,1])
}

l <- lapply(list(df1, df2, df3), s)

n <- unique(unlist(lapply(l, names)))

m <- do.call(rbind, lapply(n, function(i) {
  tmp <- lapply(l, "[[", i)
  u <- unique(unlist(tmp))
  cbind(rep(i, length(u)), u, sapply(tmp, function(x) u %in% x))
}))

m

m2 <- t(apply(m, 1, function(i) ifelse(i[3:length(i)], i[2], NA)))

as.data.frame(cbind(m[,1], m2))