我有几个文件(20),它们的列结构相同,但行结构不同。全部由两列组成,第一列是因子,第二列是整数。我想对重复的因子和简单添加的新因子求和一列整数。如何合并和总结已经重复的内容?
我曾经考虑过将cbind和tapply结合起来,但是我真的不知道如何实现。
文件结构的简单示例:
Shop Clients Shop Clients Shop Clients
A 9 D 8 A 5
B 7 A 4 R 4
C 4 F 3 C 3
D 2 B 1 B 2
我希望输出:
Shop Clients
A 18
B 10
C 7
D 10
F 3
R 4
我循环读取了不同的文件,为每个文件创建了一个数据集,以便该数据集显示例如City1 $ Shop和City1 $ Clients。这种情况仅适用于20个文件,但我想知道如何使用更多文件(例如100个文件)。 如何通过以这种方式读取数据集来解决此问题?
f<-function(x){
read.delim2("p01.txt",header=T,sep="\t",stringsAsFactors = FALSE)
}
for(i in x){
total<-f(i)
#Here I suppose I would combine and sum the datasets
}
答案 0 :(得分:1)
我们可以将列名中的melt
指定为measure
的列,将patterns
的数据{long}格式设置为'Shop''Clients',然后按'Shop'分组sum
的“客户”
library(data.table)
melt(setDT(df1), measure = patterns("^Shop", "^Clients"),
value.name = c("Shop", "Clients"))[, .(Clients = sum(Clients)), by = Shop]
# Shop Clients
#1: A 18
#2: B 10
#3: C 7
#4: D 10
#5: F 3
#6: R 4
或使用tidyverse
library(tidyverse)
map_dfc(list(Shop = "Shop", Clients = "Clients"), ~
df1 %>%
select(matches(.x)) %>%
unlist) %>%
group_by(Shop) %>%
summarise(Clients = sum(Clients))
# A tibble: 6 x 2
# Shop Clients
# <chr> <int>
#1 A 18
#2 B 10
#3 C 7
#4 D 10
#5 F 3
#6 R 4
或者使用rowsum
中的base R
i1 <- grepl("^Shop", names(df1))
rowsum(unlist(df1[!i1]), group = unlist(df1[i1]))
df1 <- structure(list(Shop = c("A", "B", "C", "D"), Clients = c(9L,
7L, 4L, 2L), Shop.1 = c("D", "A", "F", "B"), Clients.1 = c(8L,
4L, 3L, 1L), Shop.2 = c("A", "R", "C", "B"), Clients.2 = 5:2),
class = "data.frame", row.names = c(NA, -4L))
答案 1 :(得分:1)
将数据重整形为长格式后,可以使用aggregate
。
inx <- grep("Shop", names(df1))
long <- do.call(rbind, lapply(inx, function(i) df1[i:(i + 1)]))
aggregate(Clients ~ Shop, long, sum)
# Shop Clients
#1 A 18
#2 B 10
#3 C 7
#4 D 10
#5 F 3
#6 R 4
编辑。
在对问题进行编辑之后,我相信下面的内容会被要求做。我将再次使用aggregate
。
fnames <- list.files(pattern = "\\.txt")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
aggregate(Clients ~ Shop, data = df_all, sum)
答案 2 :(得分:1)
一种tidyverse
可能是:
df %>%
select_at(vars(contains("Shop"))) %>%
gather(var1, val1) %>%
bind_cols(df %>%
select_at(vars(contains("Client"))) %>%
gather(var2, val2)) %>%
group_by(Shop = val1) %>%
summarise(Clients = sum(val2))
Shop Clients
<chr> <int>
1 A 18
2 B 10
3 C 7
4 D 10
5 F 3
6 R 4
与base R
相同:
long_df <- data.frame(Shop = stack(df[, grepl("Shop", names(df))])[, 1],
Clients = stack(df[, grepl("Client", names(df))])[, 1])
aggregate(Clients ~ Shop, long_df, sum)
Shop Clients
1 A 18
2 B 10
3 C 7
4 D 10
5 F 3
6 R 4