概述: 我有以上3个数据帧。我希望最终结果为final_merge_df。
详细说明: 我有很多表,如下表。我想将所有表合并到一个表中(如下所示:“final_merge_df”)。每个表格具有相同的格式但数据不同。 在每个表中,有两列。在第一列中,每行有一个单词。所有表都有相同的单词,但每个表可能有任意数量的单词。另请注意,可能存在包含具有特定单词的零行的表。 第二列有一个与第一个单词相对应的单词(出于何种原因)。但是,第二列的每一个中只有一个单词,第二列中的单词可能与第一列中的单词相同或不同。 每个第二列可能包含任何其他表的第二列中不存在的单词。
df1 = data.frame(
x1=c("bus","bus","cat","cat"),
df1=c("bus","driver","mouse","dog"),
stringsAsFactors = FALSE)
>df1
x1 df1
1 bus bus
2 bus driver
3 cat mouse
4 cat dog
df2 = data.frame(
x1=c("bus","bus","bus","cat","cat"),
df2=c("car","driver","bus","dog","paw"),
stringsAsFactors = FALSE)
>df2
x1 df2
1 bus car
2 bus driver
3 bus bus
4 cat dog
5 cat paw
df3 = data.frame(
x1=c("bus","bus","cat","cat","cat","cat"),
df3=c("bus","autobus","dog","bed","paw","tree"),
stringsAsFactors = FALSE)
df3
x1 df3
1 bus bus
2 bus autobus
3 cat dog
4 cat bed
5 cat paw
6 cat tree
我想要一个表是所有其他表的合并(见下文)。 第一列同样包含与原始表的每个第一列相同的单词。 第二列包含第二个表中第二列的单词。第三列包含第三个表中的单词,第四列包含第四个表中的单词,依此类推。 如果在第2列到第N列的每一列中,如果该列中的单词与第一列中的单词相对应(如原始表中那样),则将单词写入。如果不对应,则写入“”
例如,查看输出中的第一行。所有三个原始表都有“总线”一词,而“总线”一词对应于“总线”一词。查看第二行:表1和2包含单词“driver”,对应单词“bus”,而第3行不包含单词“driver”,因此我们写入“”。
# this is an example of how the result should look from the examples tables above
final_merge_df = data.frame(
x1 = c(rep("bus",4), rep("cat",5)),
df1 = c("bus","driver","<NA>","<NA>", "mouse","dog","<NA>","<NA>","<NA>"),
df2 = c("bus","driver","car", "<NA>", "<NA>", "dog","paw", "<NA>","<NA>"),
df3 = c("bus","<NA>", "<NA>","autobus","<NA>", "dog","paw", "bed", "tree"))
>final_merge_df
x1 df1 df2 df3
1 bus bus bus bus
2 bus driver driver <NA>
3 bus <NA> car <NA>
4 bus <NA> <NA> autobus
5 cat mouse <NA> <NA>
6 cat dog dog dog
7 cat <NA> paw paw
8 cat <NA> <NA> bed
9 cat <NA> <NA> tree
我尝试过很多东西,包括:
df = merge( df1, df2, by.x="df1", by.y="df2", all=T)
>df
df1 x1.x x1.y
1 bus bus bus
2 car <NA> bus
3 dog cat cat
4 driver bus bus
5 mouse cat <NA>
6 paw <NA> cat
基于上面的输出,我写了一个简短的函数,将df转换为:
x1 df1 df2
1 bus bus bus
4 bus driver driver
2 bus <NA> car
3 cat dog dog
5 cat mouse <NA>
6 cat <NA> paw
这正是我想要的,但它只适用于两个表。我需要一种能够处理2个以上表格的方法。
我还尝试了一些频率表对话,还创建了一个termdocumentmatrix(使用tm包),但没有成功。
我非常感谢任何帮助。感谢。
答案 0 :(得分:0)
Reduce(function(x,y) merge(x,y, all = TRUE), list(df1,df2,df3)
。但是,我想不出来。 (我想一些专家将能够在这一行中提供一些东西。)所以,我决定以下列方式完成合并过程。这是一种特定于问题的方法,这可能不是这里的专家如何应对您的挑战。但是,至少这允许你有一个数据框,你可以应用你的功能,并得出你想要的结果。
library(dplyr)
### I follow your script.
df = merge(df1, df2, by.x="df1", by.y="df2", all=T)
df <- arrange(df, df1)
### I want to repeat the same procedure, but two colums with bus and cat
### won't help. So I drop x1.y in df which comes from df2.
### Separate the df2 part (x1.y)
foo <- df$x1.y
### Create df1 (new version)
ana <- select(df, df1, x1.x)
### This is merge with the new version of df1 and df3
bob = merge(ana, df3, by.x="df1", by.y="df3", all=T)
### There are three new items (i.e., autobus, bed, and tree).
### They are in df3, but not df2.
### So, I added NA in the positions of the items in df2.
foo2 <- c(NA, NA, foo, NA)
### Now add the df2 part.
cathy <- cbind(bob, foo2)
names(cathy) <- c("whatever", "df1", "df3", "df2")
### Reorder columns
david <- cathy[,c(1,2,4,3)]
#> david
# whatever df1 df2 df3
#1 autobus <NA> <NA> bus
#2 bed <NA> <NA> cat
#3 bus bus bus bus
#4 car <NA> bus <NA>
#5 dog cat cat cat
#6 driver bus bus <NA>
#7 mouse cat <NA> <NA>
#8 paw <NA> cat cat
#9 tree <NA> <NA> cat
答案 1 :(得分:0)
我的时间很短。所以这不是一个非常优雅的解决方案,但它确实有效。
df1 = data.frame(
x1=c("bus","bus","cat","cat"),
df1=c("bus","driver","mouse","dog"),
stringsAsFactors = FALSE)
df2 = data.frame(
x1=c("bus","bus","bus","cat","cat"),
df2=c("car","driver","bus","dog","paw"),
stringsAsFactors = FALSE)
df3 = data.frame(
x1=c("bus","bus","cat","cat","cat","cat"),
df3=c("bus","autobus","dog","bed","paw","tree"),
stringsAsFactors = FALSE)
s <- function(df) {
split(df[,2], df[,1])
}
l <- lapply(list(df1, df2, df3), s)
n <- unique(unlist(lapply(l, names)))
m <- do.call(rbind, lapply(n, function(i) {
tmp <- lapply(l, "[[", i)
u <- unique(unlist(tmp))
cbind(rep(i, length(u)), u, sapply(tmp, function(x) u %in% x))
}))
m
m2 <- t(apply(m, 1, function(i) ifelse(i[3:length(i)], i[2], NA)))
as.data.frame(cbind(m[,1], m2))