我正在处理主题标签。字符串之间存在少量差异。一些主题标签是动词或名称,或者是复数或单数... 它们都有相同的含义。在统计研究的情况下,我必须计算"架次" "架次" "出击" ......我试图只保留每个单词的偏差。
例如:
sortir(走出去):
sortir(动词),sortie(名字,单数),sorties(名字,复数):
Japon(英语)
日本(法国)
这是我的数据:
data <- as.character (c("Brest", "Nantes", "sortir", "sortir", "sortie", "sorties", "icones", "icones", "icone", "Icone", "japan", "japon"))
打印数据)
我的结果:
"Brest"
"Nantes"
"sortir"
"sortir"
"sortie"
"sorties"
"icones"
"icones"
"icone"
"Icone"
"japan"
"japon"
我想要的: 我想清理我的数据广告,每个主题标签的偏差只保留一个字。
"Brest"
"Nantes"
"sortir"
"sortir"
"sortir"
"sortir"
"icones"
"icones"
"icones"
"icones"
"japan"
"japan"
我做了什么:
# convert text to lower cases:
library (stringr)
data_lowercase <- tolower(data)
print (data_lowercase)
# converting to data frame object
data_lowercase_df <- as.data.frame(data_lowercase)
data_lowercase_df
# calculating string distance with levenshtein method
library (stringdist)
distance <- stringdistmatrix(data_lowercase ,
useNames="strings",
method="lv")
# Creating a matrix
distance2 <- as.matrix(distance)
distance2
# Converting to a data.frame object
library(reshape2)
distance_df <- unique(melt(distance2))
print (distance_df)
#Keeping text distance <3for "good" matches.
library (dplyr)
distance_df_2 <- distance_df %>%
filter (value>=0 & value<3)
print (distance_df_2)