TL; dr - 我相信这实际上是一个简单的问题,只需要精心设计的解释来设置上下文。传递一次文件并构建data.frames列表
我有一个凌乱的.csv
文件,如下所示。它包含许多“垃圾”行 - 包含很少使用/兴趣的数据的行或带有嵌入空格的行,制表符等。有价值的行包括:
(a)详情
(b)子细节
(c)详细信息和子详细信息行后面的“数据框类”对象。
然而,(a),(b)和(c)之间的垃圾线的数量可以变化,例如示例(testing.csv
)。我想要返回的是一个data.frame
对象列表,例如results
或非常相似的内容(例如,我考虑了Detail
和SubDetail
的结果被捕获为data.frame
)中的其他列:
df1 <- data.frame(Item = 1:3, Val1 = c(50, 20, 30), Val2 = c(100, 30, 50))
df2 <- data.frame(Item = 1:2, Val1 = c(20, 30), Val2 = c(30, 50))
df3 <- data.frame(Item = 1:2, Val1 = c(10, 30), Val2 = c(20, 40))
df4 <- data.frame(Item = 1:3, Val1 = c(50, 30, 70), Val2 = c(30, 40, 80))
# One possible desired result structure
results <- list(list(Detail = "01", SubDetail = "ABC", data = df1),
list(Detail = "01", SubDetail = "XYZ", data = df2),
list(Detail = "02", SubDetail = "ABC", data = df3),
list(Detail = "02", SubDetail = "XYZ", data = df4))
str(results)
示例.csv
文件(testing.csv
)与此代码段类似:
xxx
xx
DETAIL: Detail 01
Sub-Detail: ABC
x
xxxx
x
Item, Val1, Val2
1, 50, 100
2, 20, 30
3, 30, 50
x
xx
xxx
x
DETAIL: Detail 01
Sub-Detail: XYZ
x
Item, Val1, Val2
1, 20, 30
2, 30, 50
x
x
DETAIL: Detail 02
Sub-Detail: ABC
Item, Val1, Val2
1, 10, 20
2, 30, 40
xxx
xx
x
x
DETAIL: Detail 02
Sub-Detail: XYZ
Item, Val1, Val2
1, 50, 30
2, 30, 40
3, 70, 80
x
xx
我们假设我已经有了识别文件中“坏线”的方法。这意味着,我可以有效地打印这样的行:
badLine <- function(line) grepl(pattern = "^$|^\\s|^\\t|^x", line)
con <- file("testing.csv", open = "r")
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
if (badLine(oneLine)) next else print(oneLine)
}
close(con)
哪个收益率:
# [1] "DETAIL: Detail 01"
# [1] "Sub-Detail: ABC"
# [1] "Item, Val1, Val2"
# [1] "1, 50, 100"
# [1] "2, 20, 30"
# [1] "3, 30, 50"
# [1] "DETAIL: Detail 01"
# [1] "Sub-Detail: XYZ"
# [1] "Item, Val1, Val2"
# [1] "1, 20, 30"
# [1] "2, 30, 50"
# [1] "DETAIL: Detail 02"
# [1] "Sub-Detail: ABC"
# [1] "Item, Val1, Val2"
# [1] "1, 10, 20"
# [1] "2, 30, 40"
# [1] "DETAIL: Detail 02"
# [1] "Sub-Detail: XYZ"
# [1] "Item, Val1, Val2"
# [1] "1, 50, 30"
# [1] "2, 30, 40"
# [1] "3, 70, 80"
如何在不重新传递文件的情况下构建results
对象(或类似对象)?
可以安全地假设可以利用以下帮助函数来识别它们各自的行:
detailLine <- function(line) grepl(pattern = "^DETAIL: ", line)
subDetailLine <- function(line) grepl(pattern = "^Sub-Detail: ", line)
dfHeaderLine <- function(line) grepl(pattern = "^Item", line)
dfLine <- function(line) grepl(pattern = "^[[:digit:]]", line)
答案 0 :(得分:1)
最好首先读入数据,然后应用过滤器,而不是在逐行阅读时应用过滤器。
#Read in data
alltext <- readLines("testing.csv")
#Apply filter to isolate headers and lines
onedf <- read.csv(text=alltext[dfHeaderLine(alltext) | dfLine(alltext)], stringsAsFactors=FALSE, header=FALSE)
#Split by header
alldfs <- split(onedf, cumsum(dfHeaderLine(onedf[,1])))
#Correct column names
alldfs <- lapply(alldfs, function(x) {names(x) <- unlist(x[1,]);x[-1,]})
#Make Detail and Subheader lists
dtl <- as.list(alltext[detailLine(alltext)])
sub <- as.list(alltext[subDetailLine(alltext)])
#Combine all lists
results <- Map(list, dtl, sub, alldfs)
# [[1]]
# [[1]][[1]]
# [1] "DETAIL: Detail 01"
#
# [[1]][[2]]
# [1] "Sub-Detail: ABC"
#
# [[1]][[3]]
# Item Val1 Val2
# 2 1 50 100
# 3 2 20 30
# 4 3 30 50
#
#
# [[2]]
# [[2]][[1]]
# [1] "DETAIL: Detail 01"
#
# [[2]][[2]]
# [1] "Sub-Detail: XYZ"
#
# [[2]][[3]]
# Item Val1 Val2
# 6 1 20 30
# 7 2 30 50
答案 1 :(得分:1)
以评论中的suggestion of @PierreLafortune为基础并使用data.table
包:
alltext <- readLines('testing.txt')
badLine <- function(line) grepl(pattern = "^$|^\\s|^\\t|^x", line)
library(data.table)
DT <- data.table(txt = alltext[!badLine(alltext)])
DT[, grp := cumsum(grepl('DETAIL', txt))
][, `:=` (detail = gsub('DETAIL: Detail ','', grep('DETAIL', txt, value = TRUE)),
subdetail = gsub('Sub-Detail: ','', grep('Sub-Detail', txt, value = TRUE))),
by = grp
][, .SD[4:.N], by = grp
][, c('Item','Val1','Val2') := tstrsplit(txt, ',', type.convert = TRUE)
][, c('grp','txt') := NULL][]
导致以下数据表:
detail subdetail Item Val1 Val2
1: 01 ABC 1 50 100
2: 01 ABC 2 20 30
3: 01 ABC 3 30 50
4: 01 XYZ 1 20 30
5: 01 XYZ 2 30 50
6: 02 ABC 1 10 20
7: 02 ABC 2 30 40
8: 02 XYZ 1 50 30
9: 02 XYZ 2 30 40
10: 02 XYZ 3 70 80
解释:
badLine
功能移除坏线后,将其转换为1列数据表data.table(txt = alltext[!badLine(alltext)])
。[, grp := cumsum(grepl('DETAIL', txt))]
创建一个分隔不同数据集的分组变量。 grepl('DETAIL', txt)
创建一个逻辑值,检测以DETAIL
开头的行(并指示新数据空间的开始)。使用cumsum
创建分组变量。detail = gsub('DETAIL: Detail ','', grep('DETAIL', txt, value = TRUE))
,您可以提取详细号码(以及subdetail
)。[, .SD[4:.N], by = grp]
,您可以删除每个组的前三行(因为它们不包含数据,并且在前面的步骤中已经提取了所需的信息)。[, c('Item','Val1','Val2') := tstrsplit(txt, ',', type.convert = TRUE)]
,您将txt
列中仍为文本格式的数据转换为三个数据列。 type.convert = TRUE
确保数据格式正确(在本例中为数字)。grp
删除txt
和[, c('grp','txt') := NULL]
列(因为不再需要它们)。要查看每个步骤的作用,您还可以使用以下代码:
DT[, grp := cumsum(grepl('DETAIL', txt))][]
DT[, `:=` (detail = gsub('DETAIL: Detail ','', grep('DETAIL', txt, value = TRUE)),
subdetail = gsub('Sub-Detail: ','', grep('Sub-Detail', txt, value = TRUE))),
by = grp][]
DT[, .SD[4:.N], by = grp][]
DT[, c('Item','Val1','Val2') := tstrsplit(txt, ',', type.convert = TRUE)][]
DT[, c('grp','txt') := NULL][]
向每一行添加[]
,确保将结果打印到控制台。