对于手头的项目,我被迫通过字符列连接多个数据框。有时由于(例如)尾随空白而造成问题,但也可以很容易地解决。但是,在这种情况下,连接无法正常工作,我无法弄清楚在用于连接的列中字符值的区别。
由于它的原始格式无法再现,因此这里是link,可以下载有问题的数据。看起来像这样:
readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
Name category
<chr> <dbl>
1 Grundschule Kronsberg 1
readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
Name value
<chr> <dbl>
1 Grundschule Kronsberg 2
我想使用公共id变量Name
将这些数据帧连接在一起。如您所见,看起来两个帧中的值完全相同。但是,当我应用 any 加入程序时,会发生这种情况:
library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")
joined.df
# A tibble: 2 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 NA
2 Grundschule Kronsberg NA 2
在试图解决这个问题时,我试图从Name
列中删除空格,但是使用标准过程只能对sourceA
这样做。对于sourceB
,似乎该过程未在“ Grundschule”和“ Kronsberg”之间切出空白。
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 Grundschule Kronsberg
奇怪的是,当使用stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", "")
时,它起作用了:
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 GrundschuleKronsberg
我对"\\p{WHITE_SPACE}"
的查找与幕后的fixed(" ")
的区别一无所知,但我认为这可能对这样做的人是个好主意。
答案 0 :(得分:0)
在评论中进行了相当多的讨论之后,我得以解决问题。尽管Name
变量看起来是相同的(并且dput()
进行了相同的解析),但是将字符转换为ASCII代码时存在细微的差别:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
与sourceB
相比, sourceA
有一个额外的代码,并且在位置12和13处具有不同的值。使用chr()
(同样来自gtools
),我能够重新将ASCII码转换为字符:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
在sourceB
中,字符串中有一个额外的Â(ASCII十进制代码194),并且空格用十进制160而不是32编码。我仍然不知道为什么将这两个结合使用ASCII码显示为常规空格,但是能够通过简单地用" "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\\p{WHITE_SPACE}", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
这(以某种方式)更改了ASCII码,以便它们现在彼此对齐:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"