预期结果的输出差异

时间:2019-05-13 14:12:33

标签: r tidyverse

我有这样的数据输入:

df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))

我希望得到这样的结果:

df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product stock","yahoo product stock, amazon","yahoo product stock", "google stock"))
                              combination frequency
    1 google stock - yahoo product stock       2
    2        amazon - yahoo product stock         2
    3                 yahoo product stock         1
    4                        google stock         1

我尝试过:

library(tidyverse)
 df %>%
     separate_rows(stocks, sep = ",") %>%
     full_join(df %>%
                   separate_rows(stocks), by = c("id" = "id")) %>%
     filter(stocks.x != stocks.y) %>%
     count(stocks.x, stocks.y) %>%
     transmute(stocks = paste(pmax(stocks.x, stocks.y), pmin(stocks.x, stocks.y), sep = "-"),
               n) %>%
     distinct(stocks, .keep_all = TRUE)

但是我收到了这个结果

# A tibble: 16 x 2
   stocks                           n
   <chr>                        <int>
 1 amazon- yahoo product            2
 2 product- yahoo product           2
 3 yahoo- yahoo product             2
 4 google- yahoo product stock      2
 5 product- yahoo product stock     2
 6 stock- yahoo product stock       4
 7 yahoo- yahoo product stock       2
 8 product-amazon                   2
 9 yahoo-amazon                     2
10 google stock-google              3
11 product-google stock             2
12 stock-google stock               5
13 yahoo-google stock               2
14 yahoo product stock-product      1
15 yahoo product stock-stock        1
16 yahoo product stock-yahoo        1

使用table()并不是我的情况的最佳解决方案,因为我的真实数据集包含更多数据

2 个答案:

答案 0 :(得分:1)

您是否正在寻找这样的东西(如下)。如果是这样,我将注释每个步骤。基本上,它基于逗号分割字符串,清理空格,对分割的片段排序,将它们与“-”一起折叠,并使用dpylr函数来获取计数。我做了很多假设,因此请告诉我它是否对您不起作用。另外,在data.table depending on the number of groups中这样做可能更快,但是我坚持使用dplyr,因为这就是您使用的方法。祝你好运!

split_stock <- lapply(strsplit(as.character(df1$stocks), ",", fixed = T), function(x) sort(trimws(x)))

df1$stocks2 <- sapply(split_stock, paste0, collapse = " - ")

df1 %>%
  group_by(stocks2) %>%
  count() %>%
  arrange(desc(n))

# A tibble: 4 x 2
  stocks2                                n
  <chr>                              <int>
1 amazon - yahoo product                 2
2 google stock - yahoo product stock     2
3 google stock                           1
4 yahoo product stock                    1

数据

df1 <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))

答案 1 :(得分:1)

您不需要使用full_join()

使用separate_rows()来按stocks标识所有公司id,然后按顺序将group_by() / summarise()与功能paste(collapse = ' ')一起使用在您的stocks变量中串联不同的可能性。最后,根据需要使用count()

df %>% 
  separate_rows(stocks) %>% 
  filter(!stocks %in% c('stock', 'product')) %>% 
  group_by(id) %>% 
  summarise(group_stocks = paste(sort(stocks), collapse = ' ')) %>% 
  count(group_stocks)

#   group_stocks     n
#   <chr>        <int>
# 1 amazon yahoo     2
# 2 google           1
# 3 google yahoo     2
# 4 yahoo            1