词对之间的频率

时间:2019-05-10 12:44:14

标签: r

具有这样的数据框:

df <- data.frame(id = c(1,2,3,4,5), keywords = c("google, yahoo, air, cookie", "cookie, air", "air, cookie", "google", "yahoo, google"))

如何提取表

df_binary_exist <- data.frame(id = c(1,2,3,4,5), google = c(1,0,0,1,1), yahoo = c(1,0,0,0,1), air = c(1,1,1,0,0), cookie = c(1,1,1,0,0))
df_binary_exist
  id google yahoo air cookie
1  1      1     1   1      1
2  2      0     0   1      1
3  3      0     0   1      1
4  4      1     0   0      0
5  5      1     1   0      0

从这张表中找到最频繁的夫妻?

df_frequency <- data.frame(couple = c("yahoo-google", "cookie-air"), freq = c(2,3))
df_frequency
        couple freq
1 yahoo-google    2
2   cookie-air    3

3 个答案:

答案 0 :(得分:2)

第一部分可以通过使用E/webviewchromiumloader: Failed to open relro file /data/misc/shared_relro/libwebviewchromium64.relro: No such file or directory E/dex2oat: Failed to create oat file: /data/dalvik-cache/arm64/data@app@com.google.android.webview-1@base.apk@classes.dex: Permission denied E/cr_LibraryLoader: Unable to load library: webviewchromium E/WebViewFactory: error instantiating provider Binary XML file line #7: Binary XML file line #7: Error inflating class android.webkit.WebView separate_rowscount

来实现。
spread

第二部分,我使用了基本的R方法,首先我们基于每两个元素的library(dplyr) library(tidyr) df1 <- df %>% separate_rows(keywords) df1 %>% dplyr::count(id, keywords) %>% spread(keywords, n, fill = 0) # id air cookie google yahoo # <dbl> <dbl> <dbl> <dbl> <dbl> #1 1 1 1 1 1 #2 2 1 1 0 0 #3 3 1 1 0 0 #4 4 0 0 1 0 #5 5 0 0 1 1 split组合keywords id,然后使用{{ 1}}。

paste

答案 1 :(得分:2)

一种tidyverse可能是:

df %>%
 mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
 unnest() %>%
 full_join(df %>%
            mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
            unnest(), by = c("id" = "id")) %>%
 filter(keywords.x != keywords.y) %>%
 count(keywords.x, keywords.y) %>%
 transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
           n) %>%
 distinct(keywords, .keep_all = TRUE)

  keywords          n
  <chr>         <int>
1 cookie-air        3
2 google-air        1
3 yahoo-air         1
4 google-cookie     1
5 yahoo-cookie      1
6 yahoo-google      2

它首先在,上拆分“关键字”列,然后对其进行完全连接。其次,它过滤掉值与OP对值对相同的行。第三,它计算成对出现的次数。最后,它创建成对的有序变量,并仅基于该变量保留不同的行。

或使用separate_rows()相同:

df %>%
 separate_rows(keywords) %>%
 full_join(df %>%
            separate_rows(keywords), by = c("id" = "id")) %>%
 filter(keywords.x != keywords.y) %>%
 count(keywords.x, keywords.y) %>%
 transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
           n) %>%
 distinct(keywords, .keep_all = TRUE)

答案 2 :(得分:1)

我们可以轻松地做到这一点

library(qdapTools)
cbind(df[1],  mtabulate(strsplit(as.character(df$keywords), ", ")))
#  id air cookie google yahoo
#1  1   1      1      1     1
#2  2   1      1      0     0
#3  3   1      1      0     0
#4  4   0      0      1     0
#5  5   0      0      1     1