Question

在R中，我有编码垃圾的字符串，例如

"based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"

是否有一种简单的方法来剥离编码垃圾的字符串，无论垃圾是什么？

Answer 1

使用gsub

x <- "based on the unique spectral \xfc\xbe\x8e\x93\xa0\xbc\xfc\xbe\x98\xa6\x90\xbc\xfc\xbe\x99\xa6\x8c\xbcfingerprints\xfc\xbe\x8e\x93\xa0\xbc of their biochemical composition"
gsub("[^[:print:]]", "", x)
# [1] "based on the unique spectral fingerprints of their biochemical composition"

Answer 2

我有同样的问题。我从meteostation获得了.dta格式的数据，类似于带有元数据的.csv。我不知道文档的编码，但是在R（运行于UTF8）中，我得到的垃圾与您相同。我在其中识别了捷克语字符，这是车站工作的地方。我用了这段代码。例如。

gsub(x = data, pattern = regex("\xfc\xbe\x8c\x96\x94\xbc"), replacement = "a")

所有错误的编码字符都具有相同的模式\ xfc \ xbe \ something \ something \ something \ xbc。在这里的代码中，它代替了很长的（á）。

如果您只是想摆脱它，那么str_extract包中的函数stringr对我来说很好。

R：从字符串中删除所有编码文本

2 个答案: