Question

我有一个超过100 000行的数据集。我想在每行的特定列中找到许多外观，并将其保存到另一列（参见下面的示例）。

我可以迭代每行的整个数据集，但这将是100k * 100k迭代。有没有更有效的方法呢？

输入数据集

输出数据集

A B number_of_appearances (based on column B)
1 6    2
3 1    2
2 6    2
4 2    1
1 4    1
9 1    2

Answer 1

您可以使用dplyr：

library(dplyr)

a <- c(2,1,2,3,4,3,2,1,4)
b <- c(3,2,1,2,3,4,3,2,1)

df <- data.frame(a, b)

df %>%
  group_by(b) %>%
  mutate(appearences_in_b = n())

Source: local data frame [9 x 3]
Groups: b [4]

     a     b appearences_in_b
   <dbl> <dbl>            <int>
1     2     3                3
2     1     2                3
3     2     1                2
4     3     2                3
5     4     3                3
6     3     4                1
7     2     3                3
8     1     2                3
9     4     1                2

Answer 2

没有dplyr：

# create the dataframe
x = sample(1:3, 10, TRUE);
y = sample(c("a","b","c"), 10, TRUE);
d = data.frame(x,y);

# get the frequencies of y
tb = table(d$y);
tb = as.data.frame(tb);

# make an "SQL join-like" merging of the two data-frames
res = merge(d,tb,by.x="y",by.y="Var1", sort=FALSE);

Answer 3

我们可以使用ave

中的base R

df1$appearance_in_b <- with(df1, ave(B, B, FUN=length))
df1$appearance_in_b
#[1] 2 2 2 1 1 2

Answer 4

只需添加data.table方法：

library(data.table)
dt <- data.table(A = c(1, 3, 2, 4, 1, 9), B = c(6, 1, 6, 2, 4, 1))

dt[, number_of_appearances := .N, by = "B"]

print(dt)
   A B number_of_appearances
1: 1 6                     2
2: 3 1                     2
3: 2 6                     2
4: 4 2                     1
5: 1 4                     1
6: 9 1                     2

R计数和存储重复数到另一列

4 个答案: