计算字符串所需的转置,以便可以在另一个字符串中找到它

时间:2015-04-03 16:53:09

标签: r string-comparison string-matching stringdist

以下是我要做的事情: 当我正在分析的术语是" apples"时,我想知道需要多少次转换才能使用#34; apples"这样它就可以在一个字符串中找到。

"现在买苹果" =>需要0换位(苹果存在)。

"廉价aples online" =>需要1个换位(苹果到aples)。

"在这里找到你的帮助" =>需要进行2次换位(适用于苹果)。

" APLE" =>需要2个换位(苹果到aple)。

"香蕉" =>需要5个换位(苹果到香蕉)。

stringdist和adist函数不起作用,因为它们告诉我将一个字符串转换为另一个字符串需要多少个转置。无论如何,这是我到目前为止写的:

#build matrix
a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b)
colnames(d)<-c("term","string")

#count transpositions needed
d$transpositions <- mapply(adist,d$term,d$string)
print(d)

2 个答案:

答案 0 :(得分:0)

您需要先检查苹果,然后进行转置

a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b, stringsAsFactors = F)
colnames(d)<-c("term","string")

#check for apples first
d$apples <-grepl("apples", d$string)

#count transpositions needed
d$transpositions <- ifelse(d$apples ==FALSE, mapply(adist,d$term,d$string), 0)
print(d)

答案 1 :(得分:0)

所以,这是我到目前为止提出的一个肮脏的解决方案:

#create a data.frame
a <- c(rep("apples",5),rep("banana split",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d <- data.frame(a,b)
colnames(d) <- c("term","string")

#split the string into sequences of consecutive characters whose length is equal to the length of the term on the same row. Calculate the similarity to the term of each sequence of characters and identify the most relevant piece of string for each row.

mostrelevantpiece <- NULL

for (j in 1:length(d$string)){
  pieces<-NULL
  piecesdist<-NULL
  for (i in 1:max((nchar(as.character(d$string[j]))-nchar(as.character(d$term[j])))+1,1)){
    addpiece <- substr(d$string[j],i,i+nchar(as.character(d$term[j]))-1)
    dist <- adist(addpiece,d$term[j])
    pieces[i] <- str_trim(addpiece)
    piecesdist[i] <- dist
    mostrelevantpiece[j] <- pieces[which.min(piecesdist)]
  }
}

#calculate the number of transpositions needed to transform the "most relevant piece of string" into the term.

d$transpositionsneeded <- mapply(adist,mostrelevantpiece,d$term)