弄清楚为什么字符匹配不起作用

时间:2019-03-25 08:37:51

标签: r merge tidyverse

对于手头的项目,我被迫通过字符列连接多个数据框。有时由于(例如)尾随空白而造成问题,但也可以很容易地解决。但是,在这种情况下,连接无法正常工作,我无法弄清楚在用于连接的列中字符值的区别。

由于它的原始格式无法再现,因此这里是link,可以下载有问题的数据。看起来像这样:

readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
  Name                  category
  <chr>                    <dbl>
1 Grundschule Kronsberg        1

readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
  Name                  value
  <chr>                 <dbl>
1 Grundschule Kronsberg     2

我想使用公共id变量Name将这些数据帧连接在一起。如您所见,看起来两个帧中的值完全相同。但是,当我应用 any 加入程序时,会发生这种情况:

library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")

joined.df
# A tibble: 2 x 3
  Name                  category value
  <chr>                    <dbl> <dbl>
1 Grundschule Kronsberg        1    NA
2 Grundschule Kronsberg       NA     2

在试图解决这个问题时,我试图从Name列中删除空格,但是使用标准过程只能对sourceA这样做。对于sourceB,似乎该过程未在“ Grundschule”和“ Kronsberg”之间切出空白。

joined.df %>%
  mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))

# A tibble: 2 x 4
  Name                  category value Name_test            
  <chr>                    <dbl> <dbl> <chr>                
1 Grundschule Kronsberg        1    NA GrundschuleKronsberg 
2 Grundschule Kronsberg       NA     2 Grundschule Kronsberg

奇怪的是,当使用stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", "")时,它起作用了:

joined.df %>%
  mutate(Name_test = stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", ""))

# A tibble: 2 x 4
  Name                  category value Name_test           
  <chr>                    <dbl> <dbl> <chr>               
1 Grundschule Kronsberg        1    NA GrundschuleKronsberg
2 Grundschule Kronsberg       NA     2 GrundschuleKronsberg

我对"\\p{WHITE_SPACE}"的查找与幕后的fixed(" ")的区别一无所知,但我认为这可能对这样做的人是个好主意。

1 个答案:

答案 0 :(得分:0)

在评论中进行了相当多的讨论之后,我得以解决问题。尽管Name变量看起来是相同的(并且dput()进行了相同的解析),但是将字符转换为ASCII代码时存在细微的差别:

library(gtools)

asc(sourceA$Name)
      Grundschule Kronsberg
 [1,]                    71
 [2,]                   114
 [3,]                   117
 [4,]                   110
 [5,]                   100
 [6,]                   115
 [7,]                    99
 [8,]                   104
 [9,]                   117
[10,]                   108
[11,]                   101
[12,]                    32
[13,]                    75
[14,]                   114
[15,]                   111
[16,]                   110
[17,]                   115
[18,]                    98
[19,]                   101
[20,]                   114
[21,]                   103

asc(sourceB$Name)
      Grundschule Kronsberg
 [1,]                    71
 [2,]                   114
 [3,]                   117
 [4,]                   110
 [5,]                   100
 [6,]                   115
 [7,]                    99
 [8,]                   104
 [9,]                   117
[10,]                   108
[11,]                   101
[12,]                   194
[13,]                   160
[14,]                    75
[15,]                   114
[16,]                   111
[17,]                   110
[18,]                   115
[19,]                    98
[20,]                   101
[21,]                   114
[22,]                   103
sourceB相比,

sourceA有一个额外的代码,并且在位置12和13处具有不同的值。使用chr()(同样来自gtools),我能够重新将ASCII码转换为字符:

    chr(asc(sourceA$Name))
 [1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"

chr(asc(sourceB$Name))
 [1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"

sourceB中,字符串中有一个额外的Â(ASCII十进制代码194),并且空格用十进制160而不是32编码。我仍然不知道为什么将这两个结合使用ASCII码显示为常规空格,但是能够通过简单地用" "

替换所有空格来解决此问题。
sourceB <- sourceB %>%
  mutate(Name = stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", " "))

full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
  Name                  category value
  <chr>                    <dbl> <dbl>
1 Grundschule Kronsberg        1     2

这(以某种方式)更改了ASCII码,以便它们现在彼此对齐:

chr(asc(sourceB$Name))
 [1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"