我正在开发一个脚本来聚合多个系统生成的csv文件。下面是我遇到的错误,我相信这是因为csv文件是由14个标准化的列标题生成的,但是每隔一段时间就会出现没有标题的附加列中的数据。
我坚持如何将无标题列数据连接到第14列,因为它们似乎是额外的备忘录,需要保留。
错误:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 521 did not have 14 elements
第522行的数据:
> scan("1428477090.csv", "character", skip=521, n=1, sep="\n")
Read 1 item
[1] "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"
代码:
serverPath = "C:/Users/*****/Desktop/Pason/"
filenames = list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.table(file = paste(serverPath, filenames[idx], sep = ""), header = T, sep =",", na.strings = "-999.25", check.names=F)
CSV格式和数据:
Hole Depth,Hook Load,Weight on Bit,Rotary RPM,Convertible Torque,On Bottom ROP,Total Pump Output,Differential Pressure,Standpipe Pressure,Rate Of Penetration,Time Of Penetration,YYYY/MM/DD,HH:MM:SS,Memos
2531.4,42.6,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:00,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:20,
2531.4,42.5,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:40:40,
2531.4,42.8,0.0,0,0.000,0.00,0,-1141.7,0,0.00,0.00,2015/04/08,01:41:00,
答案 0 :(得分:0)
1)我将您的CSV数据保存到“数据”文件夹中的文件“a.csv”。函数read.csv
对我来说很好,只是最后一列填充了NA:
read.csv("./data/a.csv")
# Hole.Depth Hook.Load Weight.on.Bit Rotary.RPM Convertible.Torque On.Bottom.ROP Total.Pump.Output
# 1 2531.4 42.6 0 0 0 0 0
# 2 2531.4 42.5 0 0 0 0 0
# 3 2531.4 42.5 0 0 0 0 0
# 4 2531.4 42.8 0 0 0 0 0
# Differential.Pressure Standpipe.Pressure Rate.Of.Penetration Time.Of.Penetration YYYY.MM.DD HH.MM.SS
# 1 -1141.7 0 0 0 2015/04/08 01:40:00
# 2 -1141.7 0 0 0 2015/04/08 01:40:20
# 3 -1141.7 0 0 0 2015/04/08 01:40:40
# 4 -1141.7 0 0 0 2015/04/08 01:41:00
# Memos
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# Warning message:
# In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on './data/a.txt'
2)我将您的字符串添加为文件a.csv
中的最后一行:
bad_string <- "207.4,64.6,1.6,70,0.970,169.50,281,0.4,68,175.40,0.37,2015/04/08,04:33:20,BIT DEPTH CHANGED TO 116.0 FEET,HOLE DEPTH CHANGED TO 116.0 FEET"
以下代码也适用于我:
serverPath = "./data/"
list.files(path = serverPath, pattern = '[.]csv')
idx=1
df = read.csv(file = paste(serverPath, filenames[idx], sep = ""),
na.strings = "-999.25")
除了“备忘录”部分之外,连续第14个逗号后的字符串部分缺失。
3)“备忘录”部分下应该没有逗号,因此文件中的每一行(我们称之为“bad_string”)应包含13个逗号(因为您有14列)。在一行中,我建议用分号(或其他符号)替换所有逗号(数字大于13),或者将此代码(下面)合并到您的分析中。我认为,写一些更有效的东西是可能的,但这个也有效:
CtoS <- function(bad_string){
# If in string bad_string there are more than 13 commas function
# CtoS (comma to semicolon), replaces all commas to semicolons which
# number is above 13.
indices_of_commas <- which(strsplit(bad_string, "")[[1]]==",") # searching for indices of commas
number_of_commas <- length(indices_of_commas) # calculating number of commas
if (number_of_commas >= 14) # if there are too many commas (i.e. additional commas in "Memos" section), they should be replaced:
{
indices_of_commas_to_replace <- c(indices_of_commas[14:number_of_commas])
tmp<-unlist(strsplit(bad_string,""))
tmp[indices_of_commas_to_replace]<-c(';')
no_commas_in_Memos_section <- paste0(tmp,collapse='')
good_string <- no_commas_in_Memos_section
}
else {good_string <- bad_string; return(good_string)}
}
lines_from_file <- scan("./data/a.csv", "character", sep="\n")
# replace unnecessary commas by using function CtoS():
corrected_lines <- unlist(lapply(lines_from_file,CtoS))
应该有一种方法,将这些字符串直接转换为数据框。不幸的是,这超出了我的知识。我的解决方案在这里:
# NOTE!!! Always have have a copy of your original files in
# a separate directory to prevent overwriting.
dir.create("./data copy/") # a new directory for processed files.
# Give name to a new file, that is different from original filename.
# I gave the other extension (.txt instead of .csv) and created a new folder.
fileConnection<-file("./data copy/a.txt") # save to a new file.
writeLines(corrected_lines, fileConnection)
close(fileConnection)
将新文件作为数据框加载:
df = read.csv(file = "./data copy/a.txt", na.strings = "-999.25")
print(df)
此程序后的“备忘录”栏目:
Memos
1
2
3
4
5 BIT DEPTH CHANGED TO 116.0 FEET;HOLE DEPTH CHANGED TO 116.0 FEET
答案 1 :(得分:0)
谢谢大家的意见。确定不需要保留额外的评论。我使用以下代码来省略从无头文件扫描错误创建的其他行:
#Remove rows with NA values
dfAllData <- na.omit(dfAllData)