我在Stackoverflow和网络上进行了搜索,找到了一些类似的解决方案,这些解决方案我认为这将是一个非常简单的问题,但没有任何解决方案。但是,也许我只是没有考虑正确的“ R”字眼,所以这里...请帮助。
我有一些奇怪的CSV文件,我每天都需要处理。
这里是输入数据的模型:
This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,
对原始文件的注释:
最后,我想以此结束(鉴于上面的初始数据):
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,NEWFIELD
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,Group1
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group1
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,Group1
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,Group1
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,Group2
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,Group2
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group2
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,Group2
我尝试使用一系列if / else语句将数据视为连接流,以执行标头,组的标识,添加新列等。但是我在将其放回表格中时遇到问题我可以使用适当的标题。
Group <- "Start"
processFile = function(datafilepath) {
con = file(datafilepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
print("EOF")
break
}
if (grepl("Header1", line) & Group == "Start") {
colnames(result) <- data.frame(paste(line,",","Group"))
print("Initial Headers found, Switching to Group1")
Group <- "Group1"
} else if (grepl("Systems.Name", line) & Group == "Group1") {
print("Switching to Group2")
Group <- "Group2"
} else if (Group == "Start") {print("At Start")}
if (Group != "Start") {
indresult <- (paste(line,",", Group))
result <- rbind(result, indresult)
}
}
return(result)
close(con)
}
此代码无法正确加载标头,并且我没有找到直接加载标头然后加载数据的方法。我相当确定,如果可以完成其他列的添加,那么列添加项也应该起作用,但是直到我克服了这一点,我才能验证结果数据将被视为完整的数据帧。
主要问题:这是解决此问题的正确方法吗?如果是,那么如何将数据放入数据帧以能够使用它?
谢谢
我当前正在使用的解决方案: 较早的fread解决方案是最接近的解决方案,但是我很难缠着它,而且:=赋值运算符在我的设置中无法识别。 因此,这就是我最终使用的:
#This line removes all rows before the appears of "Header1"
Data <- fread(paste(Folder, File, sep = ""), skip="Header1")
Group= "Group1"
#Add additional column to data frame to be filled in below
Data$Group= ""
#Loop through each row and add Group - I had tried using simply "Data" instead of 1:nrow(Data) but in that case R only took the initial column of Data and not each row itself.
for (dataline in 1:nrow(Data)) {
if (Data[dataline,]$"Header1" == "Header1" & Group == "Group1") {
#Reached second row of Headers indicating Group change
Group <- "Group2"
next
}
#Assign Group
Data[dataline,]$Group <- Group
}
#Remove Duplicate Header rows
Data <- Data[!(Data$Header == "Header1"),]
它很慢(大约需要4-5分钟来运行50,000行),但是至少它是自动的,可以满足我的需求。如果有加快速度的方法,请随时添加。谢谢!
答案 0 :(得分:2)
类似这样的东西:
x = 'This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,'
require(data.table)
require(zoo) # for na.locf
o = fread(x, skip = 5,sep= ',')
# count how many headers
nh = nrow(o[grepl('Header1', V1) & grepl('Header2', V2)])
# add header id
o[grepl('Header1', V1) & grepl('Header2', V2), group := 1:nh]
# fill down header
o[, group := na.locf(group, na.rm = FALSE)]
# remove rows containing 'Header*'
o = o[!grepl('Header1', V1) & !grepl('Header2', V2) ]
o
V1 V2 V3 V4 V5 V6 V7 V8 V9 group
1: 20345604 10.21.1151.12.0 Daisy Petal Stem Data Data Data NA 1
2: 20345627 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 1
3: 20345600 10.21.1151.12.0 Samson Petal Stem Data Data Data NA 1
4: 20345623 10.21.1151.12.0 Cloud Petal Stem Data Data Data NA 1
5: 20345704 10.21.1151.12.0 Simmons Petal Stem Data Data Data NA 2
6: 20345677 10.21.1151.12.0 Butle Petal Stem Data Data Data NA 2
7: 20347600 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 2
8: 20745623 10.21.1151.12.0 Unicorn Petal Stem Data Data Data NA 2
x
应该是您的csv文件的路径。
另外,请查看data.table::fread
以获得更多可能在此处有用的参数。
您可以进一步使用setnames()
来更改列名,并在原始数据集拥有的情况下将数据类型从字符更改为数字。