我有这样的数据输入:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))
我希望得到这样的结果:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product stock","yahoo product stock, amazon","yahoo product stock", "google stock"))
combination frequency
1 google stock - yahoo product stock 2
2 amazon - yahoo product stock 2
3 yahoo product stock 1
4 google stock 1
我尝试过:
library(tidyverse)
df %>%
separate_rows(stocks, sep = ",") %>%
full_join(df %>%
separate_rows(stocks), by = c("id" = "id")) %>%
filter(stocks.x != stocks.y) %>%
count(stocks.x, stocks.y) %>%
transmute(stocks = paste(pmax(stocks.x, stocks.y), pmin(stocks.x, stocks.y), sep = "-"),
n) %>%
distinct(stocks, .keep_all = TRUE)
但是我收到了这个结果
# A tibble: 16 x 2 stocks n <chr> <int> 1 amazon- yahoo product 2 2 product- yahoo product 2 3 yahoo- yahoo product 2 4 google- yahoo product stock 2 5 product- yahoo product stock 2 6 stock- yahoo product stock 4 7 yahoo- yahoo product stock 2 8 product-amazon 2 9 yahoo-amazon 2 10 google stock-google 3 11 product-google stock 2 12 stock-google stock 5 13 yahoo-google stock 2 14 yahoo product stock-product 1 15 yahoo product stock-stock 1 16 yahoo product stock-yahoo 1
使用table()
并不是我的情况的最佳解决方案,因为我的真实数据集包含更多数据
答案 0 :(得分:1)
您是否正在寻找这样的东西(如下)。如果是这样,我将注释每个步骤。基本上,它基于逗号分割字符串,清理空格,对分割的片段排序,将它们与“-”一起折叠,并使用dpylr
函数来获取计数。我做了很多假设,因此请告诉我它是否对您不起作用。另外,在data.table
depending on the number of groups中这样做可能更快,但是我坚持使用dplyr
,因为这就是您使用的方法。祝你好运!
split_stock <- lapply(strsplit(as.character(df1$stocks), ",", fixed = T), function(x) sort(trimws(x)))
df1$stocks2 <- sapply(split_stock, paste0, collapse = " - ")
df1 %>%
group_by(stocks2) %>%
count() %>%
arrange(desc(n))
# A tibble: 4 x 2
stocks2 n
<chr> <int>
1 amazon - yahoo product 2
2 google stock - yahoo product stock 2
3 google stock 1
4 yahoo product stock 1
数据:
df1 <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))
答案 1 :(得分:1)
您不需要使用full_join()
。
使用separate_rows()
来按stocks
标识所有公司id
,然后按顺序将group_by()
/ summarise()
与功能paste(collapse = ' ')
一起使用在您的stocks
变量中串联不同的可能性。最后,根据需要使用count()
。
df %>%
separate_rows(stocks) %>%
filter(!stocks %in% c('stock', 'product')) %>%
group_by(id) %>%
summarise(group_stocks = paste(sort(stocks), collapse = ' ')) %>%
count(group_stocks)
# group_stocks n
# <chr> <int>
# 1 amazon yahoo 2
# 2 google 1
# 3 google yahoo 2
# 4 yahoo 1