如何加载字段内空格的空格分隔表?
简单案例数据:
Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular
注意最后一个元素在命名学校时是否有空格分隔(“Brentwood Elementary”而不是“Elm”)
以下查询失败:“第x行没有y元素”
dat = read.table("dat.txt",header=TRUE)
编辑: 数据点都是因子,包含设定值
修改:通过http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html提供的完整数据 感谢@AmandaMahto
答案 0 :(得分:4)
实际上,如果您可以使用Ananda找到的数据源,那么很容易,因为<pre>
区域是制表符分隔的:
library(rvest)
pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text()
dat <- read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)
dat[245:249,]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 245 girl 4 9 White Rural Sand Grades 1 3 2 4
## 246 girl 4 9 White Rural Sand Sports 3 2 1 4
## 247 girl 4 9 White Rural Sand Sports 3 2 1 4
## 248 girl 4 9 White Rural Sand Grades 2 1 3 4
## 249 girl 6 12 White Rural Brown Middle Popular 4 2 1 3
要真正回答您的问题(这有点像Ananda的答案),您需要知道问题列的位置并解决它。这个使用gsubfn
和该列的预定义值进行整体然后分割:
library(gsubfn)
# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc
lines <- readLines("awful.txt")
schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)
dat <- read.table(text=paste(lines, sep="", collapse="\n"),
header=TRUE, stringsAsFactors=FALSE)
dat$School <- gsub("_", " ", dat$School)
dat[c(1,34,94,198,255,324,377,433),]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 34 boy 4 10 White Suburban Brentwood Elementary Grades 2 1 3 4
## 94 girl 6 11 White Suburban Brentwood Middle Grades 3 4 1 2
## 198 boy 5 10 White Rural Ridge Sports 4 2 1 3
## 255 girl 6 12 Other Rural Brown Middle Grades 3 2 1 4
## 324 boy 4 9 Other Urban Main Grades 4 1 3 2
## 377 boy 4 9 White Urban Portage Popular 4 1 2 3
## 433 girl 6 11 White Urban Westdale Middle Popular 4 2 1 3
答案 1 :(得分:3)
不幸的是,这个问题的答案几乎是“这取决于你对数据集的了解程度。”
例如,在数据集的描述中,它指定了每个变量的可能值。在这里,我们知道只有少数学校有多个单词的名字,并且这些学校遵循可预测的“小学”和“中学”模式。
因此,您可以使用readLines
读取数据,并在使用read.table
重新读取数据之前找出插入分隔符的最不突兀的方式。
以下是一个例子:
示例数据:
cat("Grade Area School Goals Value",
"4 Rural Elm Popular 1",
"4 Rural Elm Sports 2",
"4 Rural Elm Grades 1",
"4 Rural Elm Popular 3",
"3 Rural Brentwood Elementary Sports 4",
"3 Rural Brentwood Middle Grades 3",
"3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")
以字符向量阅读:
x <- readLines("test.txt")
使用gsub
强制多字学校名称成为单个单词(以下划线分隔)。然后,使用read.table
获取data.frame
。
read.table(text = gsub(" (Elementary|Middle)", "_\\1", x), header = TRUE)
# Grade Area School Goals Value
# 1 4 Rural Elm Popular 1
# 2 4 Rural Elm Sports 2
# 3 4 Rural Elm Grades 1
# 4 4 Rural Elm Popular 3
# 5 3 Rural Brentwood_Elementary Sports 4
# 6 3 Rural Brentwood_Middle Grades 3
# 7 3 Suburban Ridge Popular 3