Question

我需要将一个单词列表搜索到另一个短语列表中。我目前正在使用str_detect：

示例，（50和53行）这只在第一个和最后一个目录行中找到，并将她的名字命名为：

rm(list = ls())
library(stringr)
library(stringdist)
library(data.table)

list <- data.table(listofnames = c("Hedy", "Eloise", "Lakeshia",  "Coleen", "Tawny", "Yolando", "Alida", "Jin", "Brigida", "Wendell",  "Elissa", "Evangeline", "Madison", "Napoleon", "Norah", "Mariana",  "Ella", "Marissa", "Jan", "Anya", "Eleanor", "Roderick", "Gillian",  "Carla", "Melva", "Tommie", "Eliana", "Cristal", "Hui", "Alycia",  "Vonnie", "Lala", "Cleveland", "Barbera", "Rosetta", "Meg", "Divina",  "Christy", "Dia", "Edna", "Foster", "Pa", "Tennille", "Renato",  "Ethelene", "Annemarie", "Jazmine", "Adela", "Aleida", "Alyse"))
catalog <- data.table(name ="", msg = c("The turn solicits Foster the wasteful metal.","The comfort licenses the river.", "The well-made stone evaluates the noise.","The page indexs the amazing peace.", "The note drafts the gold.","The taste exchanges the deranged thing.", "The snobbish reason compiles the roll.","The structure installs the current.", "The letter broadens the wide winter.","The lackadaisical argument comforts the detail.", "The fear nurses the learned fiction.","The heat convinces the luxuriant soup.", "The long-term edge tends the competition.","The puzzled stretch formulates the glass.", "The disease interprets the utter morning.","The abashed country gauges the size.", "The steam adapts the mountainous burst.","The tacit color derives the prose.", "The way exchanges the slim cough.","The moldy force ranks the room.", "The river discovers the expert.","The devilish experience converts the development.", "The lewd weather directs the friend.","The thought furnishs the half stone.", "The tart degree minimizes the doubt.","The deadpan color exercises the protest.", "The point inspires the shock.","The damp expansion acts the ice.", "The overconfident judge dealt withs the secretary.","The food relates the tacit market.", "The doubt troubleshots the scintillating smile.","The ink inventorys the pale invention.", "The kindly competition directs the error.","The feigned doubt writes the sand.", "The kick pilots the expert.","The meal nurses the delightful morning.", "The form traces the seat.","The reward conveys the loss.", "The belief troubleshots the building.","The growth details the mountain.", "The ambiguous kick centralizes the crack.","The system programs the wacky morning.", "The paste rehabilitates the gainful night.","The jumpy silver experiments the driving.", "The silk maximizes the trouble.","The testy doubt qualifys the level.", "The journey revitalizes the military decision.","The cough demonstrates the pleasure.", "The high-pitched debt employs the argument.","The noxious credit chairs the slip.", "The lift Renato monitors Tennille the daughter.","The fight insures the gratis sound.", "The zesty Annemarie credit navigates the mother."))

distnames <- as.character(sort(unique(list$listofnames[list$listofnames != ""])))
for(i in 1:length(catalog$msg)){
  names <- str_detect(catalog$msg[i], distnames)
  if (sum(names == TRUE) == 1){
    catalog$name[i] <- distnames[which(names == TRUE)]
  }
}

问题是它与grep相比太慢了，但是我无法在names中制作一个foreach，因为有多个messages (msg)，我还要写如果您已经完成了名称，并找到另一个名称，请将其删除，因为如果您在邮件中找到2个或更多数据库，我不想保存任何内容。（代码中的IF）

我不知道data.tables是否有任何str_detect函数，但只返回一个索引为TRUEs 的数组，我觉得这个过程有点加快了不必返回一百万个TRUE或FALSE的数组，然后用它搜索。

此示例运行速度很快，但我的list名称有7百万行，我认为使用粘贴创建一个模式不是一个选项，并且使用模式我无法找到她的名字。 catalog有5百万行

names var每次创建时都为50 TRUE或FALSE，我正在寻找更快的东西，只有TRUE的索引匹配像一个值为34的向量，表示我的{ {1}} distname[34]

Answer 1

根据您names向量的大小，您可以从names向量创建一个大的正则表达式：

names <-  c("Joss", "Mery", "Manson", "Tom")

catalog <- c("My name is Joss", "I hate you", 
  "I cant see if her name its Mery or Manson")

library(stringr)

regexpr <- paste0(names, collapse="|")
matches <- str_extract_all(catalog, regexpr) 

sel <- sapply(matches, function(d) length(d) == 1)
matched_names <- sapply(matches, function(d) d[1])
matched_names[!sel] <- NA

我不知道这与你的循环相比如何表现。

R中的文本匹配 - foreach中的单词矢量（data.table $ row [i]）

1 个答案: