我正在尝试修正我的调查数据。我的数据框包含多个应该相同的值;但是,拼写,间距和大写的变化会导致超过预期的级别数。
str(data.frame$race)
"American Indian and Alaska Native"
"Asian"
"Black of African American"
"Black or African American"
"Other"
"Unknown"
"white or Caucasian"
"White or Caucasian"
"White or Caucasion"
我如何找到并替换"创建一个统一的拼写并将其转换回具有适当级别数的因子?
答案 0 :(得分:0)
很难找到适合所有解决方案的一种尺寸。这是因为看似相似的字符串可能描述了非常不同的东西(例如格拉纳达与格林纳达)。原帖的评论值得研究。
参见"Approximate string matching" on Wikipedia(有时也称为“模糊匹配”)。您可以通过多种方式在字符串上定义“相似”。
最基本的工具是R函数adist
。它计算所谓的编辑距离。
x <- c("American Indian and Alaska Native" ,
"Asian" ,
"Black of African American" ,
"Black or African American" ,
"Other" ,
"Unknown" ,
"white or Caucasian" ,
"White or Caucasian" ,
"White or Caucasion" )
u <- unique(x)
# compare all strings against each other
d <- adist(u)
# Do not list combinations of similar words twice
d[lower.tri(d)] <- NA
# Say your threshold below which you want to consider strings similar is
# 2 edits:
a <- which(d > 0 & d < 2, arr.ind = TRUE)
a
## row col
## [1,] 3 4
## [2,] 7 8
## [3,] 8 9
pairs <- cbind(u[a[,1]], u[a[,2]])
pairs
## [,1] [,2]
## [1,] "Black of African American" "Black or African American"
## [2,] "white or Caucasian" "White or Caucasian"
## [3,] "White or Caucasian" "White or Caucasion"
但最终你必须自己策划结果,以避免不平等因素的意外均等化。
您可以使用命名向量作为翻译词典来重复执行此操作。例如,通过查看上面的示例,我可以创建以下字典:
dict <- c(
# incorrect spellings correct spellings
# ------------------------- ----------------------------
"Black of African American" = "Black or African American",
"white or Caucasian" = "white or Caucasian" ,
"White or Caucasion" = "White or Caucasian"
)
# The correct levels need to be included, to
dict <- c(dict, setNames(u,u)
然后使用as.character
将您的系数列转换为字符并应用
就像我在这里使用原始字符向量x
:
xcorrected <- dict[x]
# show without names, but the result is also correct if you just use
# xcorrected alone (remove as.character here to see the difference).
as.character(xcorrected)
[1] "American Indian and Alaska Native" "Asian"
[3] "Black or African American" "Black or African American"
[5] "Other" "Unknown"
[7] "white or Caucasian" "White or Caucasian"
[9] "White or Caucasian"