R:如何从大型表格中快速提取2列中的常见单词或相同数字?

时间:2018-09-03 02:28:16

标签: r dataframe data-science

我有一个很大的表(1,000,000 X 20)要处理,需要快速完成。

例如,我的表中有2列X2和X3:

enter image description here

    X1  X2                                          X3
c1  1   100020003001, 100020003002, 100020003003    100020003001, 100020003002, 100020003004
c2  2   100020003001, 100020004002, 100020004003    100020003001, 100020004007, 100020004009
c3  3   100050006003, 100050006001, 100050006001    100050006011, 100050006013, 100050006021

现在我想创建2个包含

的新列

1)常用词或相同数字

例如:[1] "100020003001" "100020003002"

2)常用字或相同数字的计数

例如:[1] 2

我从下面的线程中尝试了该方法,但是,由于我使用for循环进行了处理,因此处理时间很慢:

Count common words in two strings

 library(stringi)
 Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))

感谢您的帮助! 我在这里真的很挣扎...

1 个答案:

答案 0 :(得分:1)

我们可以用,来分隔'X2','X3'列,并用intersect获得相应list元素的map2并使用{{1} }以“计算” lengths

中的元素数量
list

或使用library(tidyverse) df1 %>% mutate(common_words = map2(strsplit(X2, ", "), strsplit(X3, ", "), intersect), count = lengths(common_words)) # X1 X2 X3 #1 1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004 #2 2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009 #3 3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021 # common_words count #1 100020003001, 100020003002 2 #2 100020003001 1 #3 0

base R

数据

df1$common_words <- Map(intersect, strsplit(df1$X2, ", "), strsplit(df1$X3, ", "))
df1$count <- lengths(df1$common_words)