在两列的基础上合并两个数据集

时间:2015-12-07 04:29:59

标签: regex r

有n个文件。在每个文件中有多个列,我只需要选择前两个。我必须在这两个列的基础上合并这些n个文件,并附加一列。该值将像一个字符串。字符串的长度取决于文件的数量。例如,假设有4个文件, 文件1:

cat  dog 
lion ele
mice hello
new  lion
ele  that

文件2:

 cat lion
 mice hello
 cub  pet
 old  lion

文件3:

new    lion
cub    pet
cat    dog
hello  cat

FILE4:

ele  that
hello cat
new   old

我想生成一个新文件

cat    dog     PAPA
lion   ele     PAAA
mice   hello   PPAA
new    lion    PAPA
ele    that    PAAP
cat    lion    APAA
cub    pet     APPA
old    lion    APAA
new    lion    AAPA
hello  cat     AAPP
new    old     AAAP

该值应位于' i'是' A'如果它们不存在于第i个文件中,否则它将会出现'P'这就是字符串的形成方式。

2 个答案:

答案 0 :(得分:0)

如果您有一个小数据集,则可以通过重塑

来完成此操作
library(dplyr)
library(tidyr)

list_of_file_names = c(...)

data_frame(file = list_of_file_names) %>%
  group_by(file) %>%
  do(read.csv(.$file) ) %>%
  distinct %>%
  mutate(present = "P") %>%
  spread(file, present, fill = "A") %>%
  gather(file, present_absent, first_file_name:last_file_name) %>%
  group_by(column1, column2) %>%
  summarize(present_absent_string = 
              present_absent %>%
              paste(collapse = "") )

答案 1 :(得分:0)

  

我在安装tidyr包时遇到麻烦。还有别的吗?   方式是什么?

这里没有额外的图书馆。

#!/usr/bin/Rscript --vanilla
# data input - filenames are to be provided as command line arguments:
t = lapply(commandArgs(T), read.table, col.names=1:2, flush=T)  # only 2 columns
t = mapply('[<-', t, 3, value="P", SIMPLIFY=F)  # mark the values as "present"
t = Reduce(function(x, y) merge(x, y, 1:2, all=T, suffixes=ncol(x)), t) # merge
t[is.na(t)] = "A"           # mark the not present values as "absent"
t[3] = Reduce(function(...) paste(..., sep=''), t[-(1:2)])  # concatenate P&A
# data output - write the desired output format
write.table(format(t[1:3], justify="l"), quote=F, row.names=F, col.names=F)