匹配两个文件中的单词并提取匹配的单词

时间:2015-02-05 14:21:01

标签: regex r pattern-matching

我有以下数据框:

dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size",
                                                          "size perfect weight", "worth price", "website great tablet",
                                                          "pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9))

数据框&#34; dataFrame &#34;如下所示:

sent                word  val
  1         good printer   1
  1        wireless easy   2
  2      just right size   3
  2  size perfect weight   4
  3          worth price   5
  3 website great tablet   6
  3      pan nice tablet   7
  4          great price   8
  5 product easy install   9

然后我有话:

nouns <- c("printer", "wireless", "weight", "price", "tablet")

我需要从 dataFrame 中仅提取这些字词(名词),并且只有这些提取的内容会添加到 dataFrame中的新列(eg.extract)

我非常感谢你的任何帮助建议。非常感谢前进。

期望的输出:

  sent                word  val   extract
    1         good printer   1    printer
    1        wireless easy   2    wireless
    2      just right size   3    size
    2  size perfect weight   4    weight
    3          worth price   5    price
    3 website great tablet   6    table
    3      pan nice tablet   7    tablet
    4          great price   8    price
    5 product easy install   9    remove this row (no match)

3 个答案:

答案 0 :(得分:2)

以下是使用stringi包的简单解决方案(size列表中不在nouns列表中)。

library(stringi)
transform(dataFrame, 
          extract = stri_extract_all(word, 
          regex = paste(nouns, collapse = "|"), 
          simplify = TRUE))

#   sent                 word val  extract
# 1    1         good printer   1  printer
# 2    1        wireless easy   2 wireless
# 3    2      just right size   3     <NA>
# 4    2  size perfect weight   4   weight
# 5    3          worth price   5    price
# 6    3 website great tablet   6   tablet
# 7    3      pan nice tablet   7   tablet
# 8    4          great price   8    price
# 9    5 product easy install   9     <NA>

答案 1 :(得分:0)

这是另一种解决方案。有点复杂,但它也删除了名词和dataFrame $ word

之间没有匹配的行
require(stringr)
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5),
                            "word" = c("good printer", "wireless easy", "just right size",
                                       "size perfect weight", "worth price", "website great tablet",
                                       "pan nice tablet", "great price", "product easy install"),
                            val = c(1,2,3,4,5,6,7,8,9))

    nouns <- c("printer", "wireless", "weight", "price", "tablet")

    test <- character()
    df.del <- list()

    for (i in 1:nrow(dataFrame)) {
        if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) {
            df.del <- rbind(df.del, i)
        } else {
            test <- rbind(test,
                          intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " "))))
        }
    }

    dataFrame <- dataFrame[-c(unlist(df.del)), ]
    dataFrame <- cbind(dataFrame, test)
    names(dataFrame)[4] <- "extract"

输出:

  sent                 word val  extract
1    1         good printer   1  printer
2    1        wireless easy   2 wireless
4    2  size perfect weight   4   weight
5    3          worth price   5    price
6    3 website great tablet   6   tablet
7    3      pan nice tablet   7   tablet
8    4          great price   8    price

答案 2 :(得分:0)

这是使用循环函数和if语句的另一种解决方案。

word<-dataFrame$word
dat<-NULL
extract<-c(rep(c("remove"), each=length(word)))
n<-length(word) 
m<-length(nouns)

for (i in 1:n) {
g<-as.character(word[i])
for (j in 1:m) {
dat<-grepl(nouns[j], g)
if(dat == TRUE) {extract[i] <- nouns[j]} 
}
}

dataFrame$extract <- extract

#  sent                 word val  extract
#1    1         good printer   1  printer
#2    1        wireless easy   2 wireless
#3    2      just right size   3   remove
#4    2  size perfect weight   4   weight
#5    3          worth price   5    price
#6    3 website great tablet   6   tablet
#7    3      pan nice tablet   7   tablet
#8    4          great price   8    price
#9    5 product easy install   9   remove