有n个文件。在每个文件中有多个列,我只需要选择前两个。我必须在这两个列的基础上合并这些n个文件,并附加一列。该值将像一个字符串。字符串的长度取决于文件的数量。例如,假设有4个文件, 文件1:
cat dog
lion ele
mice hello
new lion
ele that
文件2:
cat lion
mice hello
cub pet
old lion
文件3:
new lion
cub pet
cat dog
hello cat
FILE4:
ele that
hello cat
new old
我想生成一个新文件
cat dog PAPA
lion ele PAAA
mice hello PPAA
new lion PAPA
ele that PAAP
cat lion APAA
cub pet APPA
old lion APAA
new lion AAPA
hello cat AAPP
new old AAAP
该值应位于' i'是' A'如果它们不存在于第i个文件中,否则它将会出现'P'这就是字符串的形成方式。
答案 0 :(得分:0)
如果您有一个小数据集,则可以通过重塑
来完成此操作library(dplyr)
library(tidyr)
list_of_file_names = c(...)
data_frame(file = list_of_file_names) %>%
group_by(file) %>%
do(read.csv(.$file) ) %>%
distinct %>%
mutate(present = "P") %>%
spread(file, present, fill = "A") %>%
gather(file, present_absent, first_file_name:last_file_name) %>%
group_by(column1, column2) %>%
summarize(present_absent_string =
present_absent %>%
paste(collapse = "") )
答案 1 :(得分:0)
我在安装tidyr包时遇到麻烦。还有别的吗? 方式是什么?
这里没有额外的图书馆。
#!/usr/bin/Rscript --vanilla
# data input - filenames are to be provided as command line arguments:
t = lapply(commandArgs(T), read.table, col.names=1:2, flush=T) # only 2 columns
t = mapply('[<-', t, 3, value="P", SIMPLIFY=F) # mark the values as "present"
t = Reduce(function(x, y) merge(x, y, 1:2, all=T, suffixes=ncol(x)), t) # merge
t[is.na(t)] = "A" # mark the not present values as "absent"
t[3] = Reduce(function(...) paste(..., sep=''), t[-(1:2)]) # concatenate P&A
# data output - write the desired output format
write.table(format(t[1:3], justify="l"), quote=F, row.names=F, col.names=F)