在R中的两个不同数据帧之间匹配文本

时间:2015-07-25 20:48:37

标签: r text matching

我在数据框中有以下数据:

structure(list(`head(ker$text)` = structure(1:6, .Label = c("@_rpg_17 little league travel tourney. These parents about to be wild.", 
"@auscricketfan @davidwarner31 yes WI tour is coming soon", "@keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR", 
"@NWAWhatsup tour of duty in NWA considered a dismal assignment?  Companies send in their best ppl and then those ppl don't want to leave", 
"Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy", 
"Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO"
), class = "factor")), .Names = "head(ker$text)", row.names = c(NA, 
-6L), class = "data.frame")

我有另一个数据框,其中包含从上面的数据框中提取的主题标签。它如下:

structure(list(destination = c("#topstation", "#destination", "#munnar", 
"#Kerala", "#Delhi", "#beach")), .Names = "destination", row.names = c(NA, 
6L), class = "data.frame")

我想在我的第一个数据框中创建一个新列,它只包含与第二个数据帧匹配的标记。例如,df1的第一行没有任何主题标签,因此新列中的此单元格将为空白。但是,第二行包含4个主题标签,其中三个与第二个数据帧匹配。我尝试过使用:

str_match
str_extract

功能。我非常接近使用其中一个帖子中给出的代码来获取此内容。

new_col <- ker[unlist(lapply(destn$destination, agrep, ker$text)), ]

虽然我明白了,我得到一个列表作为输出我收到错误指示

replacement has 1472 rows, data has 644

我尝试将max.distance设置为不同的参数,每个参数都给出了差异误差。有人可以帮我解决问题吗?我想到的另一个选择是将每个主题标签放在一个单独的列中,但不确定它是否能帮助我用其他变量进一步分析数据。我正在寻找的输出如下:

text          new_col          new_col2    new_col3
statement1    
statement2
statement3    #destination     #munnar     #topstation
statement4
statement5    #Kerala
statement6    #Kerala

2 个答案:

答案 0 :(得分:0)

你可以这样做:

library(stringr)
results <- sapply(df$`head(ker$text)`, 
                  function(x) { str_match_all(x, paste(df2$destination, collapse = "|")) })

df$matches <- results

如果要将结果分开,可以使用:

df <- cbind(df, do.call(rbind, lapply(results, [, 1:max(sapply(results, length)))))

答案 1 :(得分:0)

library(stringi);
m <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) x[x%in%df2[[1]]]);
df1 <- cbind(df1,do.call(rbind,lapply(m,`[`,1:max(sapply(m,length)))));
df1;
##                                                                                                                             head(ker$text)            1       2           3
## 1                                                                   @_rpg_17 little league travel tourney. These parents about to be wild.         <NA>    <NA>        <NA>
## 2                                                                                 @auscricketfan @davidwarner31 yes WI tour is coming soon         <NA>    <NA>        <NA>
## 3                                                       @keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination #munnar #topstation
## 4 @NWAWhatsup tour of duty in NWA considered a dismal assignment?  Companies send in their best ppl and then those ppl don't want to leave         <NA>    <NA>        <NA>
## 5     Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy      #Kerala    <NA>        <NA>
## 6   Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO      #Kerala    <NA>        <NA>

修改:如果您想为每个代码添加单独的列:

use_frameworks!