R在字符串中提取重复的单词

时间:2016-09-27 05:15:02

标签: r string duplicates

我有ab字符串组成我的data。我的目的是获得一个包含重复单词的新变量。

    a = c("the red house av", "the blue sky", "the green grass")
    b = c("the house built", " the sky of the city", "the grass in the garden")

data = data.frame(a, b)

基于此answer,我可以了解那些使用duplicated()

重复的内容
data = data%>% mutate(c = paste(a,b, sep = " "),
                     d = vapply(lapply(strsplit(c, " "), duplicated), paste, character(1L), collapse = " "))

但我无法获得这些词语。我想要的数据应该是这样的

> data.1
                 a                       b         d
1 the red house av         the house built the house
2     the blue sky     the sky of the city   the sky
3  the green grass the grass in the garden the grass

对上述功能的任何帮助都将受到高度赞赏。

2 个答案:

答案 0 :(得分:5)

a = c("the red house av", "the blue sky", "the green grass")
b = c("the house built", " the sky of the city", "the grass in the garden")

data <-  data.frame(a, b, stringsAsFactors = FALSE)

func <- function(dta) {
    words <- intersect( unlist(strsplit(dta$a, " ")), unlist(strsplit(dta$b, " ")) )
    dta$c <- paste(words, collapse = " ")
    return( as.data.frame(dta, stringsAsFactors = FALSE) )
}

library(dplyr)
data %>% rowwise() %>% do( func(.) )

结果:

#Source: local data frame [3 x 3]
#Groups: <by row>
#
## A tibble: 3 x 3
#                 a                       b         c
#*            <chr>                   <chr>     <chr>
#1 the red house av         the house built the house
#2     the blue sky     the sky of the city   the sky
#3  the green grass the grass in the garden the grass

答案 1 :(得分:1)

这是使用基础R的另一种尝试(不需要包装):

df$c <- apply(df,1,function(x) 
               paste(Reduce(intersect, strsplit(x, " ")), collapse = " "))

                 # a                       b         c
# 1 the red house av         the house built the house
# 2     the blue sky     the sky of the city   the sky
# 3  the green grass the grass in the garden the grass

数据

df <- structure(list(a = c("the red house av", "the blue sky", "the green grass"
), b = c("the house built", " the sky of the city", "the grass in the garden"
)), .Names = c("a", "b"), row.names = c(NA, -3L), class = "data.frame")