Question

我是R. Apology的新手，提前提出一个基本问题。

我有一个数据框，如下所示：

head(df)
  cr_id                      description       region       type   status
1     1 Grant system adminstrator rights         EMEA      audit approved
2     2     grant access to all products           UK     system  pending
3     3                 change in design Asia Pacific      audit approved
4     4                 change in design           UK regulatory  pending
5     5      More robust system required Asia Pacific     system  pending
6     6  Volume productivity for NA 2016           UK      audit approved

现在假设我在变量new_cr中捕获了用户输入的新描述。我可以使用以下内容来获得任何两个描述之间的相似性：

library(fuzzywuzzyR)
init = SequenceMatcher$new(string1 = df$description, string2 = new_cr)
init$ratio

但是，任何人都可以帮助我将其放入循环或任何其他有效的方法，以便在列表中获得高于某个阈值（0.8）的所有类似描述（在整个数据帧中）以进行进一步处理吗？

Answer 1

使用for循环可以执行以下操作

ratios <- numeric(nrow(df))
for (ind in 1 : nrow(df)){
  init <- SequenceMatcher$new(string1 = df$description[ind], string2 = new_cr)
  ratios[ind] <- init$ratio()
}

获取ratios

的另一种方法

ratios <- sapply(df$description, function(x) 
                SequenceMatcher$new(string1 = x, string2 = new_cr)$ratio())

现在只保留所需的行

new_df <- df[which(ratios > 0.8), ]

如果您只想要类似的描述，可以执行以下操作。

df$description[ratios > 0.8]

R中循环中的字符序列匹配

1 个答案: