R统计:将不规则文件的子集读入数据帧

时间:2018-05-27 20:01:32

标签: r text subset read.table

我有一个文本文件,包含4个独立的组件(与数据集关联的源,用法和实际数据)。我想将每个组件读入一个单独的R对象。

以下是文件格式的示例。每个文件都有关键字SOURCE,STORY,USAGE和DATASET作为分隔符。

示例数据集

SOURCE
Boxofficemojo.com

STORY
These lines, of variable length and number, would contain the story behind the dataset.

USAGE
"Course"    "Year"  "Section"   "Exercise"
"Course1"   5   9   "ex 3"
"Course1"   5   9   "ex 4"
"Course1"   5   9   "ex 5"
"Course2"   5   9   "ex 3"
"Course2"   5   9   "ex 4"

DATASET
Dataset with headers follows. 

我的问题仅在于将USAGE部分作为数据框阅读。我写了一个快速的逐行解析器,它扫描文件中的关键字USAGE和DATASET并返回它们的行号。但是,此代码有效:

Usage <- read.table(Output.File, skip= 9, nrows = 6, header = TRUE)

但此代码不是

Usage <- read.table(Output.File, skip= Beginrow, nrows = Endrow - Beginr4w, header = TRUE)

如何使read.table()或任何其他函数允许使用变量skip和行数?或者,是否有更简单的方法将USAGE和DATASET之间的数据作为数据表读入?

USAGE将始终具有4列,具有与上述文件中相同的标题名称,但使用行数可以从1到任意任意数字。

2 个答案:

答案 0 :(得分:0)

这个想法是,首先你必须设法为你选择包含相关数据的字符串的所需部分,然后从你读取的子字符串中选择csv。在下面的解决方案中,strsplit函数用于获取USAGE和DATASE之间的部分,无论行数多少。我基本上把字符串分成了方便的部分。您可以在strsplit了解更多信息:

str <- 'SOURCE
Boxofficemojo.com

STORY
These lines, of variable length and number, would contain the story behind the dataset.

USAGE
"Course"    "Year"  "Section"   "Exercise"
"Course1"   5   9   "ex 3"
"Course1"   5   9   "ex 4"
"Course1"   5   9   "ex 5"
"Course2"   5   9   "ex 3"
"Course2"   5   9   "ex 4"

DATASET
Dataset with headers follows.'

# get the desired part of the string
datasetStr <- strsplit(paste0(strsplit(str, 'USAGE')[[1]][2]), 'DATASET')[[1]][1]
# read it as data frame
df <- read.csv(text = datasetStr, sep = '\t')

输出

> df
  Course....Year..Section...Exercise
1             Course1   5   9   ex 3
2             Course1   5   9   ex 4
3             Course1   5   9   ex 5
4             Course2   5   9   ex 3
5             Course2   5   9   ex 4

答案 1 :(得分:0)

这是一种有点可扩展的方法。首先,使用readLines将整个文件读入变量。我会在这里使用textConnection来表示再现性,但你应该从文件中读取。

x <- readLines(con=textConnection('
SOURCE
Boxofficemojo.com

STORY
These lines, of variable length and number, would contain the story behind the dataset.

USAGE
"Course"    "Year"  "Section"   "Exercise"
"Course1"   5   9   "ex 3"
"Course1"   5   9   "ex 4"
"Course1"   5   9   "ex 5"
"Course2"   5   9   "ex 3"
"Course2"   5   9   "ex 4"

DATASET
Dataset with headers follows.'))

过滤掉我介绍的前一个空行:

head(x)
# [1] ""                                                                                       
# [2] "SOURCE"                                                                                 
# [3] "Boxofficemojo.com"                                                                      
# [4] ""                                                                                       
# [5] "STORY"                                                                                  
# [6] "These lines, of variable length and number, would contain the story behind the dataset."
allcaps <- grep("^[A-Z]+$", x)
if (allcaps[1] > 1) x <- x[-(1:(allcaps[1]-1))]

我推断出只有大写字母的行表示“标题”。这也可以使用cumsum(x %in% c("USAGE",...))

完成
str( x2 <- split(x, cumsum(grepl("^[A-Z]+$", x))) )
# List of 4
#  $ 1: chr [1:3] "SOURCE" "Boxofficemojo.com" ""
#  $ 2: chr [1:3] "STORY" "These lines, of variable length and number, would contain the story behind the dataset." ""
#  $ 3: chr [1:8] "USAGE" "\"Course\"    \"Year\"  \"Section\"   \"Exercise\"" "\"Course1\"   5   9   \"ex 3\"" "\"Course1\"   5   9   \"ex 4\"" ...
#  $ 4: chr [1:2] "DATASET" "Dataset with headers follows."

(您也可以选择删除尾随空字符串,也许使用类似x2 <- lapply(x2, head, n=-1)的字符串,但最后会因为没有它而受到影响。使用Filter(nchar, x2)也可能有用,但它假定没有“故意”的空白行。对你说。)

下一步可能是装饰性的,但是将“标题”作为列表元素名称,后续行是其数据:

str( x3 <- setNames(lapply(x2, `[`, -1L),
                    sapply(x2, `[`, 1L)) )
# List of 4
#  $ SOURCE : chr [1:2] "Boxofficemojo.com" ""
#  $ STORY  : chr [1:2] "These lines, of variable length and number, would contain the story behind the dataset." ""
#  $ USAGE  : chr [1:7] "\"Course\"    \"Year\"  \"Section\"   \"Exercise\"" "\"Course1\"   5   9   \"ex 3\"" "\"Course1\"   5   9   \"ex 4\"" "\"Course1\"   5   9   \"ex 5\"" ...
#  $ DATASET: chr "Dataset with headers follows."

最后,您可以对嵌入元素执行任何操作:

x3$USAGE <- read.table(textConnection(x3$USAGE), header=TRUE)
str(x3)
# List of 4
#  $ SOURCE : chr [1:2] "Boxofficemojo.com" ""
#  $ STORY  : chr [1:2] "These lines, of variable length and number, would contain the story behind the dataset." ""
#  $ USAGE  :'data.frame':  5 obs. of  4 variables:
#   ..$ Course  : Factor w/ 2 levels "Course1","Course2": 1 1 1 2 2
#   ..$ Year    : int [1:5] 5 5 5 5 5
#   ..$ Section : int [1:5] 9 9 9 9 9
#   ..$ Exercise: Factor w/ 3 levels "ex 3","ex 4",..: 1 2 3 1 2
#  $ DATASET: chr "Dataset with headers follows."