Question

我最近编写了一个函数，它使用grep和regex来查找无效的UTF-8代码点（因为我在mac上工作，我的语言环境也是UTF-8）。输入不必是UTF-8，因为它正在寻找无效的UTF-8字节。我写了这个函数用于工作，并想知道是否有人可以提供一些提示来概括/捕获我没有注意到的代码中的任何红色标记（例如使用基本代码而不是dplyr）。如果对您有用，请随意使用任何代码。

enc_check <- function(data) {

library(dplyr)

library(magrittr)

# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)

allBytes <- list(x_esc = '\\x',
               hex1 = as.character(c(seq(0,9),
                                     c('a','b','c','d','e','f'))),
               hex2 = as.character(c(seq(0,9),
                                     c('a','b','c','d','e','f')))
               ) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')

# Valid mixed alphanumeric bytes

validBytes1 <- list(x_esc = '\\x',
                 hexNum = as.character(c(seq(2,7))),     
                 hexAlpha = c('a','b','c','d','e','f')
                 ) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')

# Valid purely numeric bytes

validBytes2 <- list(x_esc = '\\x',
                 hexNum2 = as.character(seq(20,79))
                 ) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')

# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a

# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
  extract(not(is_in(., validBytes)))

# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
  a_vector <- data[,i]
  matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes =   TRUE))
} 

# Get rid of empty list elements
matches %<>%
  lapply(length) %$%
  extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]

return(matches)
}

编辑：这是已实施建议的更新代码。

enc_check <- function(dataset) {
library(dplyr)
library(magrittr)

rASCII <- c( '\n', '\r', '\t','\b',
           '\a', '\f', '\v', '\\', '\'', '\"', '\`')

validBytes <- paste0("\\x",
                     c(as.character(as.hexmode(32:126)),
                       sapply(rASCII, charToRaw))) %>%
  extract(not(duplicated(.)))

invalidBytes <- allBytes %>%
  extract(not(is_in(., validBytes)))

a_vector <- vector()
matches <- list()
  for (i in 1:dim(dataset)[2]) {
    a_vector <- dataset[,i]
    matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)

matches %<>%
  lapply(length) %$%
  extract(matches, . > 0)

return(matches)
}

第二编辑：更好的策略是使用iconv。假设你有一个带有一些无效字节的文件或对象，但通常是UTF-8。 Mac计算机通常就是这种情况，其默认语言环境设置似乎是UTF-8。此外，基于Mac的RStudio似乎在内部使用UTF-8，即使您将计算机的语言环境设置为不同的编码，也无法更改。无论如何，您可以使用iconv将所有无效字节（通常显示为十六进制字节）（例如“\ x8f”）替换为Unicode替换符号。然后，您可以搜索该符号，并返回带有该符号的data.frame列中的唯一观察列表。基于此，您可以使用“sub（）”将这些字符替换为所需的字符。需要注意的一点是，如果存在无效字节，将文件转换为另一种编码（例如latin-1）可能会产生意外结果。当我这样做时，我注意到一些无效字节被转换为Unicode控制字符，而其他无效字节显然与有效的latin-1字节匹配并显示为无意义的字符。在任何一种情况下，我都写了一个包来搜索这些字符的data.frames并返回一个列表，然后做一些替换。这个软件包并不像CRAN那样正式，但如果有人感兴趣，那么这里有一个指向存储库的链接：https://github.com/jkroes/FixEncoding。重要的是要注意包的“稳定”版本不在“主”分支上;它实际上在分支“iconv”上。在安装正确的分支后，可以通过“？FixEncoding”在R中搜索文档，然后查找其中列出的函数并搜索这些函数。

Answer 1

这会将十六进制数的所有alpha版本构造为“ff”：

allBytes <-  as.character( as.hexmode(0:255) )

或者像你似乎更喜欢的模式：

allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )

R识别出的“特殊”字符包括您已放弃的“\ n”，但还有一些列在?Quotes帮助页面上的字符：

rASCII <- c( '\n', '\r', '\t','\b',
             '\a', '\f', '\v', '\\', '\'', '\"', '\`')

您可以使用以下内容为“characters”“space to tilde（”〜“）创建一个有效grep模式的向量：

validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )

我担心使用此策略，因为我的R在尝试与它认为无效的输入字符串进行greppish匹配时会抛出错误。

> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)

Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale

您可能希望切换到grepRaw，因为它不会引发相同的错误：

> grepRaw("\xff", txt)
[1] 12

或者可以按照Duncan Murdoch的建议使用?tools::showNonASCII 4年前在Rhelp出现时：

 ?tools::showNonASCII
 # and the help page has a reproducible example of its use:

out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
"   This is a good line",
"   This has an \xfcmlaut in it.",
"   OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")

tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4:    This has an <fc>mlaut in it.

推广函数以返回具有无效UTF-8字节/代码点的data.frame列的列表

1 个答案: