Question

我正在尝试使用库ff中的read.table.ffdf方法读取一个相当大的文件。不幸的是，该表的列名包含空格，制表符和其他特殊字符。它看起来大致相似（但有大约400列）：

attribute_1;next attribute;who creates, these horrible) column&nämes
198705;RXBR ;2017-07-05 00:00:00

这不是很好，我知道，但是我被迫使用它，所以我必须将check.names设置为FALSE。

此外，我正在生成一个包含列类类型的列表，我喜欢这样：

path <- 'path_to_csv-file'
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2, check.names = FALSE)
#print(headset)
headclasses <- vector(mode = 'character', length = 0)


#heavily simplified version - switch_statement is  in an extra function
for(i in colnames(headset)){
  headclasses[[i]] <- switch (i,
                              'attribute_1' = 'numeric',
                              'next attribute' = 'factor',
                              'who creates, these horrible) column&nämes' = 'POSIXct'
                              )
}
#print(colnames(headset))
#print(headclasses)

现在，如果我打电话：

df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 1e4, VERBOSE = TRUE)

我收到以下错误：

repnam中的错误（colClasses，colnames（x），default = NA）：以下参数名称不匹配'下一个属性'，'（谁创建，这些可怕的列＆amp;nämes）'

为什么会出现此错误？我如何修复它以便我将更丑陋的字符串作为列名？

注意，在上一次调用中，check.names设置为FALSE。

到目前为止我的工作：

1。在调用read.table.ffdf

时尝试使用正确的名称，但检查错误的check.names选项

如果我让R选择正确的列名（即第一次调用read方法时check.names = TRUE）并相应地调整switch语句，即使是警告也没有任何错误（即使是警告）我在read.table.ffdf-method中设置了check.names = FALSE：

headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2)
print(headset)
headclasses <- vector(mode = 'character', length = 0)


#heavily simplified version - switch_statement is  in an extra function
for(i in colnames(headset)){
  headclasses[[i]] <- switch (i,
                              'attribute_1' = 'numeric',
                              'next.attribute' = 'factor',
                              'who.creates..these.horrible..column.nämes' = 'POSIXct'
                              )
}
print(colnames(headset))
print(headclasses)

my_df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2, VERBOSE = TRUE)
print(my_df)
print(colnames(my_df))

“attribute_1”“next.attribute”“who.creates..these.horrible..column.nämes”

警告讯息：   在read.table（na.strings = c（“\ N”，“”），sep =“;”，dec =“。”，colClasses＆gt; = list（：     并非所有在'colClasses'中命名的列都存在

这样可行，何时不应该？当然，在调用read.table.ffdf时以同样的方式省略check.names，所以某些地方会丢失。

2。检查read.table.ffdf源代码

我去了rdrr.io网站（read.table.ffdf-source-code）查看源代码并试图了解我做错了什么。简而言之，这就是我的文件：

rt.args <- list(na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2)
rt.args$file <- path
asffdf_args <- list()

FUN <- 'read.table'
dat <- do.call(FUN, rt.args)
x <- do.call("as.ffdf", c(list(dat), asffdf_args))
#print(colnames(dat))
#print(colnames(x))

，这会产生

“attribute_1”“下一个属性”“谁创造，这些可怕的”专栏＆amp;nämes“

“attribute_1”“next.attribute”“who.creates..these.horrible..column.nämes”

好的，这就是出错的地方。

我不知道哪个asffdf_args要传递，因为我是R的新手，我不确定除了某种check.names之外的其他内容。我已经通过

查看了as.ffdf.data.frame方法

getAnywhere(as.ffdf.data.frame)

但这并不能帮助我理解我应该投入的内容。 那么，我如何使用uglier列名进行read.table.ffdf-工作？哪个'asffdf_args'我必须传递给check.names = FALSE在上述方法中工作？

我可以调整我的switch语句（大约400列），用check.names = TRUE读取文件，read.table.ffdf完成后，我可以将列名设置为所需的列（因为我有以后使用更糟糕的名字）。但这对我来说是一种解决方法，根本不满足我。

这是我在这里的第一个问题，所以请对我保持温柔，如果我忽略了一些重要的事情，请随时向我推进正确的方向。

提前感谢您的帮助。

Answer 1

原样，您可能无法以您希望的方式传递参数。

as.ffdf.data.frame()在其最后一行打电话ffdf() ffdf依次调用make.names几次，而不检查任何参数。

如果您在功能的最后编辑ffdf()，注释掉 vnam <- make.names(vnam, unique = TRUE)行，那么as.ffdf.data.frame()将能够保留您的时髦列名。
我没有提供ffdf的修改版本，因为函数超过300行。

我已经使用新函数ffdf_new进行了测试，将其注入如下：

# save original version
orig <- ff::ffdf

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("ffdf", ffdf_new)

# simple test below
DF <- data.frame(
  'attribute_1' = 1:10,
  'next attribute' = 3:12,
  'who creates, these horrible) column&nämes' = 11:20,
  check.names = FALSE
)

as.ffdf.data.frame(DF)[["who creates, these horrible) column&nämes"]]
## ff (open) integer length=10 (10)
##  [1]  [2]  [3]  [4]  [5]  [6]  [7]  [8]  [9] [10] 
##   11   12   13   14   15   16   17   18   19   20 

# switch back
godmode:::assignAnywhere("ffdf", orig)

read.table.ffdf在给colClasses

1 个答案: