Question

我有一个表（表1），其中包含一堆城市（标点，大写字母和空格已删除）。

我想浏览第二张表（表2），并取出与之完全匹配或包含该字符串在其中任何位置的任何记录（第一条）。

int cutPoint2 = random.nextInt(maxIndexValue - minIndexValue + 1);

这将显示下面的第三张表。

# Table 1
  city1    
1 waterloo 
2 kitchener
3 toronto  
4 guelph   
5 ottawa


# Table 2
  city2
1 waterlookitchener  
2 toronto  
3 hamilton  
4 cityofottawa

Answer 1

我相信可以使用更复杂的方法来完成任务，但这是使用tidyverse的简单方法。

df <- read_table2("city1
waterloo
kitchener
toronto
guelph
ottawa")

df2 <- read_table2("city2
waterlookitchener
toronto
hamilton
cityofottawa")

df3 <- df$city1 %>% 
  lapply(grep, df2$city2, value=TRUE) %>%
  lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>%
  unlist

df3 <- cbind(df, df3)

搜索df$city1中df2$city2的每个元素（部分或完全匹配），然后返回df2$city2的这个元素。有关更多信息，请参见?grep。
将character(0)（找不到元素）替换为NA。有关详细信息，请参见How to convert character(0) to NA in a list with R language?。
将列表转换为向量（unlist）。
将结果附加到城市列表（cbind）。

Answer 2

您也可以尝试使用fuzzyjoin。在这种情况下，您可以使用stri_detect_fixed包中的函数stringi来识别字符串中至少一个固定模式的出现。

library(fuzzyjoin)
library(stringi)
library(dplyr)

fuzzy_right_join(table2, table1, by = c("city2" = "city1"), match_fun = stri_detect_fixed) %>% 
  select(city1, city2)

输出

      city1             city2
1  waterloo waterlookitchener
2 kitchener waterlookitchener
3   toronto           toronto
4    guelph              <NA>
5    ottawa      cityofottawa

数据

table1 <- structure(list(city1 = c("waterloo", "kitchener", "toronto", 
"guelph", "ottawa")), class = "data.frame", row.names = c(NA, 
-5L))

table2 <- structure(list(city2 = c("waterlookitchener", "toronto", "hamilton", 
"cityofottawa")), class = "data.frame", row.names = c(NA, -4L
))

根据R中的部分字符串匹配查找值

2 个答案: