如何汇总一列具有不同结构的多个数据集?

时间:2019-05-20 16:48:58

标签: r

我有几个文件(20),它们的列结构相同,但行结构不同。全部由两列组成,第一列是因子,第二列是整数。我想对重复的因子和简单添加的新因子求和一列整数。如何合并和总结已经重复的内容?

我曾经考虑过将cbind和tapply结合起来,但是我真的不知道如何实现。

文件结构的简单示例:

Shop   Clients     Shop  Clients     Shop Clients
 A        9          D      8          A     5
 B        7          A      4          R     4
 C        4          F      3          C     3
 D        2          B      1          B     2

我希望输出:

Shop Clients
A      18 
B      10
C       7
D      10
F       3
R       4

我循环读取了不同的文件,为每个文件创建了一个数据集,以便该数据集显示例如City1 $ Shop和City1 $ Clients。这种情况仅适用于20个文件,但我想知道如何使用更多文件(例如100个文件)。 如何通过以这种方式读取数据集来解决此问题?

f<-function(x){
  read.delim2("p01.txt",header=T,sep="\t",stringsAsFactors = FALSE)
}
for(i in x){
total<-f(i)
#Here I suppose I would combine and sum the datasets
}

3 个答案:

答案 0 :(得分:1)

我们可以将列名中的melt指定为measure的列,将patterns的数据{long}格式设置为'Shop''Clients',然后按'Shop'分组sum的“客户”

library(data.table)
melt(setDT(df1), measure = patterns("^Shop", "^Clients"), 
  value.name = c("Shop", "Clients"))[, .(Clients = sum(Clients)), by = Shop]
#    Shop Clients
#1:    A      18
#2:    B      10
#3:    C       7
#4:    D      10
#5:    F       3
#6:    R       4

或使用tidyverse

library(tidyverse)
map_dfc(list(Shop = "Shop", Clients = "Clients"), ~
    df1 %>%
       select(matches(.x)) %>% 
       unlist) %>% 
  group_by(Shop) %>% 
  summarise(Clients = sum(Clients))
# A tibble: 6 x 2
#  Shop  Clients
#  <chr>   <int>
#1 A          18
#2 B          10
#3 C           7
#4 D          10
#5 F           3
#6 R           4

或者使用rowsum中的base R

i1 <- grepl("^Shop", names(df1))
rowsum(unlist(df1[!i1]), group =  unlist(df1[i1]))

数据

df1 <- structure(list(Shop = c("A", "B", "C", "D"), Clients = c(9L, 
 7L, 4L, 2L), Shop.1 = c("D", "A", "F", "B"), Clients.1 = c(8L, 
 4L, 3L, 1L), Shop.2 = c("A", "R", "C", "B"), Clients.2 = 5:2), 
  class = "data.frame", row.names = c(NA, -4L))

答案 1 :(得分:1)

将数据重整形为长格式后,可以使用aggregate

inx <- grep("Shop", names(df1))
long <- do.call(rbind, lapply(inx, function(i) df1[i:(i + 1)]))
aggregate(Clients ~ Shop, long, sum)
#  Shop Clients
#1    A      18
#2    B      10
#3    C       7
#4    D      10
#5    F       3
#6    R       4

编辑。
在对问题进行编辑之后,我相信下面的内容会被要求做。我将再次使用aggregate

fnames <- list.files(pattern = "\\.txt")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
aggregate(Clients ~ Shop, data = df_all, sum)

答案 2 :(得分:1)

一种tidyverse可能是:

df %>%
 select_at(vars(contains("Shop"))) %>%
 gather(var1, val1) %>%
 bind_cols(df %>%
 select_at(vars(contains("Client"))) %>%
 gather(var2, val2)) %>%
 group_by(Shop = val1) %>%
 summarise(Clients = sum(val2))

  Shop  Clients
  <chr>   <int>
1 A          18
2 B          10
3 C           7
4 D          10
5 F           3
6 R           4

base R相同:

long_df <- data.frame(Shop = stack(df[, grepl("Shop", names(df))])[, 1], 
Clients = stack(df[, grepl("Client", names(df))])[, 1])
aggregate(Clients ~ Shop, long_df, sum)

  Shop Clients
1    A      18
2    B      10
3    C       7
4    D      10
5    F       3
6    R       4