我想根据另一列在一列中查找信息。所以我在一列中有一些单词,在另一列中有完整的句子。我想知道它是否找到了那些句子中的单词。但有时单词不一样,所以我不能使用SQL like
函数。因此,我认为模糊匹配+某种类似的'喜欢'函数会很有用,因为数据看起来像这样:
Names Sentences
Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. 100% ownership of Kidco.Ltd. is the mother company.
Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo.
这些数据有大约2,000行需要一个逻辑来判断Airplanes Sarl是否确实在句子中,而且它也适用于Kidco有限公司,它在句子中是' Kidco.Ltd' 。
为了简化问题,我不需要它来搜索列中的所有句子,它只需要查找Kidco Ltd.这个词并在数据帧的同一行中搜索它。
我已经在Python中尝试过: df.apply(lambda s:fuzz.ratio(s [' Names'],s [' Sentences']),axis = 1)
但是我收到了很多unicode / ascii错误所以我放弃了并想尝试使用R. 关于如何在R中进行此操作的任何建议?我在Stackoverflow上看到了模糊匹配列中所有句子的答案,这与我想要的不同。有什么建议吗?
答案 0 :(得分:2)
也许尝试标记化+语音匹配:
library(RecordLinkage)
library(quanteda)
df <- read.table(header=T, sep=";", text="
Names ;Sentences
Airplanes Sarl ;Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. ;Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. ;100% ownership of Kidco.Ltd. is the mother company.
Popsi Co. ;Cola Inc. is 50% share of PopsiCo which is part of LaLo.
Popsi Co. ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.")
f <- soundex
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co"
tokens <- lapply(tokens, f)
mapply(is.element, soundex(df$Names), tokens)
# A614 K324 K324 P122 P122
# TRUE FALSE TRUE TRUE TRUE
答案 1 :(得分:1)
以下是使用我在评论中建议的方法的解决方案,在此示例中效果很好:
library("stringdist")
df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.",
"Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.",
"Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.",
"some company","It is a truth universally acknowledged...",
"Hello world",list(NULL)),
ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE)
null_elements <- which(sapply(df$Sentences,is.null))
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors
df$dist <- mapply(stringdist,df$Names,df$Sentences)
df$n2 <- nchar(df$Sentences)
df$n1 <- nchar(df$Names)
df$match_quality <- df$dist-(df$n2-df$n1)
cutoff <- 2
df$match <- df$match_quality <= cutoff
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff
# Names Sentences dist n2 n1 match_quality match
# 1 Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl. 33 47 14 0 TRUE
# 2 Kidco Ltd. 100% ownership of Kidco.Ltd. is the mother company. 42 51 10 1 TRUE
# 3 Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo. 48 56 9 1 TRUE
# 4 some company It is a truth universally acknowledged... 36 41 12 7 FALSE
# 5 Hello world NULL 11 0 11 22 NA