我有一个很大的表(1,000,000 X 20)要处理,需要快速完成。
例如,我的表中有2列X2和X3:
X1 X2 X3
c1 1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004
c2 2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009
c3 3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021
现在我想创建2个包含
的新列1)常用词或相同数字
例如:[1] "100020003001" "100020003002"
2)常用字或相同数字的计数
例如:[1] 2
我从下面的线程中尝试了该方法,但是,由于我使用for循环进行了处理,因此处理时间很慢:
Count common words in two strings
library(stringi)
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
感谢您的帮助! 我在这里真的很挣扎...
答案 0 :(得分:1)
我们可以用,
来分隔'X2','X3'列,并用intersect
获得相应list
元素的map2
并使用{{1} }以“计算” lengths
list
或使用library(tidyverse)
df1 %>%
mutate(common_words = map2(strsplit(X2, ", "),
strsplit(X3, ", "),
intersect),
count = lengths(common_words))
# X1 X2 X3
#1 1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004
#2 2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009
#3 3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021
# common_words count
#1 100020003001, 100020003002 2
#2 100020003001 1
#3 0
base R
df1$common_words <- Map(intersect, strsplit(df1$X2, ", "), strsplit(df1$X3, ", "))
df1$count <- lengths(df1$common_words)