Question

我有一个数据框，其中包含有关不同国家/地区某些组织的活动的信息。列 orga 包含组织的名称， c1 到 c4 是国家/地区列，其中包含组织在该国家/地区正在执行的活动的数量， home 是该组织的居住国。 home 中的值对应于 c1 到 c4 的列名中的数字。

orga <- c("AA", "AB", "AC", "BA", "BB", "BC", "BD")
c1 <- c(3,1,0,0,2,0,1)
c2 <- c(0,2,2,0,1,0,1)
c3 <- c(1,0,0,1,0,2,0)
c4 <- c(0,1,1,0,0,0,0)
home <- c(1,2,3,2,1,3,1)
df <- data.frame(orga, c1, c2, c3, c4, home)

我知道想要添加一个额外的列 foreign ，其中包含有关所有组织外部活动的信息，总结 c1 中提到的所有活动 c4 < / em>但在本国列中不。因此，该功能不应该总结所有的国家/地区列，而只能是那些不是本国的列。例如，如果home = 1，它应该省略c1，如果home = 2则省略c2，等等。

在示例中， foreign 应如下所示：

df$foreign <- c(1,2,3,1,1,0,1)

有没有办法总结不同组的列，为每个组留出不同的列，并将总和作为新列添加到数据框？

我已经查看了dplyr-package的 group by 函数，以及base-r中的聚合和 tapply ，但是无法想出解决方案。因此，我非常感谢你的帮助。谢谢！

Answer 1

使用rowSums，

执行此操作的一种方法

diag(as.matrix(rowSums(df[2:5])- df[2:5][df$home]))
#[1] 1 2 3 1 1 0 1

Answer 2

以下是使用rowSums的另一个选项。使用row/column索引，我们会在数据集副本中将值替换为NA，然后使用rowSums和na.rm=TRUE获取行的总和以排除“主页”列

df1 <- df
df1[-1][cbind(1:nrow(df), df$home)] <- NA
df$foreign <- rowSums(df1[2:5],na.rm=TRUE) 
df$foreign
#[1] 1 2 3 1 1 0 1

或使用apply

df$foreign <- apply(df[-1], 1, function(x) sum(head(x, -1)[-x[5]]))
df$foreign
#[1] 1 2 3 1 1 0 1

Answer 3

以下是使用dplyr和tidyr包的解决方案。

library(dplyr)
library(tidyr)

df2 <- df %>%
  # Change the home column from number to character,
  # Make the ID (c1, c2, c3, c4) consistent to the column names from c1 to c4
  mutate(home = paste0("c", home)) %>%
  # Convert the data frame from wide format to long format
  # activity contains the columns names from c1 to c4 as labels
  # number is the original number for each
  gather(activity, number, -orga, -home) %>%
  # Remove rows when home and activity number are the same
  filter(home != activity) %>%
  # Group by the organization
  group_by(orga) %>%
  # Calculate the total number of activities, call it foreign
  summarise(foreign = sum(number)) %>%
  # Join the results back with df by organization
  left_join(df, by = "orga") %>%
  # Re-organiza the column
  select(orga, c1:home, foreign)

这是最终结果。您需要的信息位于数据框foreign的{{1}}列中。

df2

总结不同组的不同列

3 个答案: