Question

我正在开发一个脚本来聚合多个系统生成的csv文件。下面是我遇到的错误，我相信这是因为csv文件是由14个标准化的列标题生成的，但是每隔一段时间就会出现没有标题的附加列中的数据。

我坚持如何将无标题列数据连接到第14列，因为它们似乎是额外的备忘录，需要保留。

错误：

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 521 did not have 14 elements

第522行的数据：

> scan("1428477090.csv", "character", skip=521, n=1, sep="\n")
Read 1 item
[1] "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"

代码：

serverPath = "C:/Users/*****/Desktop/Pason/"
filenames = list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.table(file = paste(serverPath, filenames[idx], sep = ""), header = T, sep =",", na.strings = "-999.25", check.names=F)

CSV格式和数据：

Hole Depth,Hook Load,Weight on Bit,Rotary RPM,Convertible Torque,On Bottom ROP,Total Pump Output,Differential Pressure,Standpipe Pressure,Rate Of Penetration,Time Of Penetration,YYYY/MM/DD,HH:MM:SS,Memos
2531.4,42.6,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:00,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:20,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:40,
2531.4,42.8,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:41:00,

Answer 1

1）我将您的CSV数据保存到“数据”文件夹中的文件“a.csv”。函数read.csv对我来说很好，只是最后一列填充了NA：

read.csv("./data/a.csv")

# Hole.Depth Hook.Load Weight.on.Bit Rotary.RPM Convertible.Torque On.Bottom.ROP Total.Pump.Output
# 1     2531.4      42.6             0          0                  0             0                 0
# 2     2531.4      42.5             0          0                  0             0                 0
# 3     2531.4      42.5             0          0                  0             0                 0
# 4     2531.4      42.8             0          0                  0             0                 0
# Differential.Pressure Standpipe.Pressure Rate.Of.Penetration Time.Of.Penetration YYYY.MM.DD HH.MM.SS
# 1               -1141.7                  0                   0                   0 2015/04/08 01:40:00
# 2               -1141.7                  0                   0                   0 2015/04/08 01:40:20
# 3               -1141.7                  0                   0                   0 2015/04/08 01:40:40
# 4               -1141.7                  0                   0                   0 2015/04/08 01:41:00
# Memos
# 1    NA
# 2    NA
# 3    NA
# 4    NA

# Warning message:
#     In read.table(file = file, header = header, sep = sep, quote = quote,  :
#                       incomplete final line found by readTableHeader on './data/a.txt'

2）我将您的字符串添加为文件a.csv中的最后一行：

bad_string <- "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"

以下代码也适用于我：

serverPath = "./data/"
list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.csv(file = paste(serverPath, filenames[idx], sep = ""),
              na.strings = "-999.25")

除了“备忘录”部分之外，连续第14个逗号后的字符串部分缺失。

3）“备忘录”部分下应该没有逗号，因此文件中的每一行（我们称之为“bad_string”）应包含13个逗号（因为您有14列）。在一行中，我建议用分号（或其他符号）替换所有逗号（数字大于13），或者将此代码（下面）合并到您的分析中。我认为，写一些更有效的东西是可能的，但这个也有效：

CtoS <- function(bad_string){
    # If in string bad_string there are more than 13 commas function 
    # CtoS (comma to semicolon), replaces all commas to semicolons which 
    # number is above 13.
    indices_of_commas <- which(strsplit(bad_string, "")[[1]]==",") # searching for indices of commas
    number_of_commas  <- length(indices_of_commas) # calculating number of commas
    if (number_of_commas >= 14) # if there are too many commas (i.e. additional commas in "Memos" section), they should be replaced:
    {
        indices_of_commas_to_replace <- c(indices_of_commas[14:number_of_commas])
        tmp<-unlist(strsplit(bad_string,""))
        tmp[indices_of_commas_to_replace]<-c(';')
        no_commas_in_Memos_section  <- paste0(tmp,collapse='')
        good_string <- no_commas_in_Memos_section
    }
    else {good_string <- bad_string; return(good_string)}
}


lines_from_file <- scan("./data/a.csv", "character", sep="\n")
# replace unnecessary commas by using function CtoS():
corrected_lines <- unlist(lapply(lines_from_file,CtoS))

应该有一种方法，将这些字符串直接转换为数据框。不幸的是，这超出了我的知识。我的解决方案在这里：

# NOTE!!! Always have have a copy of your original files in 
# a separate directory to prevent overwriting.

dir.create("./data copy/") # a new directory for processed files.

# Give name to a new file, that is different from  original filename.
# I gave the other extension (.txt instead of .csv) and created a new folder.
fileConnection<-file("./data copy/a.txt") # save to a new file.
writeLines(corrected_lines, fileConnection)
close(fileConnection)

将新文件作为数据框加载：

df = read.csv(file = "./data copy/a.txt", na.strings = "-999.25")
print(df)

此程序后的“备忘录”栏目：

                                                             Memos
1                                                                 
2                                                                 
3                                                                 
4                                                                 
5 BIT DEPTH CHANGED TO 116.0 FEET;HOLE DEPTH CHANGED TO 116.0 FEET

Answer 2

谢谢大家的意见。确定不需要保留额外的评论。我使用以下代码来省略从无头文件扫描错误创建的其他行：

#Remove rows with NA values
dfAllData <- na.omit(dfAllData)

扫描错误（文件，什么，nmax，sep，dec，quote，skip，nlines，na.strings，：第521行没有14个元素

2 个答案: