Question

我有这个数据框：

data <- data.frame(countries=c(rep('UK', 5),
                           rep('Netherlands 1a', 5),
                           rep('Netherlands', 5),
                           rep('USA', 5), 
                           rep('spain', 5), 
                           rep('Spain', 5),
                           rep('Spain 1a', 5),
                           rep('spain 1a', 5)),
               var=rnorm(40))

            countries          var
1              UK  0.506232270
2              UK  0.976348808
3              UK -0.752151769
4              UK  1.137267199
5              UK -0.363406715
6  Netherlands 1a -0.800835463
7  Netherlands 1a  1.767724231
8  Netherlands 1a  0.810757929
9  Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11    Netherlands  0.428511920
12    Netherlands  0.835184425
13    Netherlands -0.198316780
14    Netherlands  1.108191193
15    Netherlands  0.946819500
16            USA  0.226786121
17            USA -0.466886468
18            USA -2.217910876
19            USA -0.003472937
20            USA -0.784264921
21          spain -1.418014562
22          spain  1.002412706
23          spain  0.472621627
24          spain -1.378960222
25          spain -0.197020702
26          Spain  1.197971896
27          Spain  1.227648883
28          Spain -0.253083684
29          Spain -0.076562960
30          Spain  0.338882352
31       Spain 1a  0.074459521
32       Spain 1a -1.136391220
33       Spain 1a -1.648418916
34       Spain 1a  0.277264011
35       Spain 1a -0.568411569
36       spain 1a  0.250151646
37       spain 1a -1.527885883
38       spain 1a -0.452190849
39       spain 1a  0.454168927
40       spain 1a  0.889401396

我希望能够找到多次以不同形式出现的countries级别。可能出现countries级别的表单是：

小写，例如“西班牙”
标题，例如“西班牙”
附有不同单词的小写字母，例如“西班牙1a”
带有不同单词的标题，例如“Spain 1a”

所以我需要运行以返回出现不止一次的向量列表级别countries。在data中，应返回的向量是：

"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"

是否可以创建一个返回此向量的函数？

Answer 1

应满足所有要求的快速解决方案（假设国家/地区名称始终是data$country条目的第一个元素）：

# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)

# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
  levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))

[1] "Netherlands"    "Netherlands 1a" "spain"          "Spain"          "spain 1a"       "Spain 1a"

<强>更新

假设在第一个位置并不总是找到国家/地区名称，您需要应用我从here获取的其他方法。请注意，我稍微修改了您的示例数据以阐明我正在做的事情：

data <- data.frame(countries=c(rep('United Kingdom', 5),
                               rep('united kingdom', 5),
                               rep('Netherlands', 5), 
                               rep('Netherlands 1a', 5),
                               rep('1a Netherlands', 5),
                               rep('USA', 5), 
                               rep('spain', 5), 
                               rep('Spain', 5),
                               rep('Spain 1a', 5),
                               rep('spain 1a', 5)),
                   var=rnorm(50))

现在让我们确定所有不包含任何数字的国家/地区子串。后续步骤保持不变。这就是你需要的吗？

# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
  # Identify, paste and return alphabetic-only components
  tmp <- grep("^[[:alpha:]]*$", i)

  if (length(tmp) == 1)
    return(i[tmp])
  else
    return(paste(i[tmp], collapse = " ")) 
})

# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)

# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
  levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))

[1] "1a Netherlands" "Netherlands"    "Netherlands 1a" "spain"          "Spain"          "spain 1a"       "Spain 1a"       "united kingdom" "United Kingdom"

Answer 2

为什么不使用grep？ ignore.case参数正是您需要的。

> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
      if(!grepl("\\s|[0-9]", uch[i]))
          grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
  })
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands"    "spain" 
# [4] "Spain"          "Spain 1a"       "spain 1a"

这是我的逻辑：将列的唯一因子级别作为字符向量。然后，将其与自身进行比较，仅查看那些不包含空格或数字的级别。 grep会抓住这些，但反过来会更加艰难。然后，我们只找到独特的匹配。所以这是一个函数和一个测试运行，

find.matches <- function(column)
{
    uch <- unique(as.character(column))
    found <- sapply(seq(uch), function(i){
        if(!grepl("\\s|[0-9]", uch[i]))
            grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
        })
    ff <- found[sapply(found, function(x) length(x) > 1)]
    unique(unlist(ff))
}

> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
                    y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a"    "a1"   "a 1b"
# 
# $y
# [1] "fac"    "fac 1a" "tor"    "tor1a"

查找不止一次出现的因素的级别

2 个答案: