在R中加载表格,空格分隔

时间:2015-01-08 11:33:14

标签: r statistics loading

如何加载字段内空格的空格分隔表?

简单案例数据:

Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular

注意最后一个元素在命名学校时是否有空格分隔(“Brentwood Elementary”而不是“Elm”)

以下查询失败:“第x行没有y元素”

dat = read.table("dat.txt",header=TRUE)

编辑: 数据点都是因子,包含设定值

修改:通过http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html提供的完整数据 感谢@AmandaMahto

2 个答案:

答案 0 :(得分:4)

实际上,如果您可以使用Ananda找到的数据源,那么很容易,因为<pre>区域是制表符分隔的:

library(rvest)

pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text() 
dat <-  read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)

dat[245:249,]

##     Gender Grade Age  Race Urban.Rural       School   Goals Grades Sports Looks Money
## 245   girl     4   9 White       Rural         Sand  Grades      1      3     2     4
## 246   girl     4   9 White       Rural         Sand  Sports      3      2     1     4
## 247   girl     4   9 White       Rural         Sand  Sports      3      2     1     4
## 248   girl     4   9 White       Rural         Sand  Grades      2      1     3     4
## 249   girl     6  12 White       Rural Brown Middle Popular      4      2     1     3

要真正回答您的问题(这有点像Ananda的答案),您需要知道问题列的位置并解决它。这个使用gsubfn和该列的预定义值进行整体然后分割:

library(gsubfn)

# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc

lines <- readLines("awful.txt")

schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)

dat <- read.table(text=paste(lines, sep="", collapse="\n"), 
                  header=TRUE, stringsAsFactors=FALSE)

dat$School <- gsub("_", " ", dat$School)

dat[c(1,34,94,198,255,324,377,433),]

##     Gender Grade Age  Race Urban.Rural               School   Goals Grades Sports Looks Money
## 1      boy     5  11 White       Rural                  Elm  Sports      1      2     4     3
## 34     boy     4  10 White    Suburban Brentwood Elementary  Grades      2      1     3     4
## 94    girl     6  11 White    Suburban     Brentwood Middle  Grades      3      4     1     2
## 198    boy     5  10 White       Rural                Ridge  Sports      4      2     1     3
## 255   girl     6  12 Other       Rural         Brown Middle  Grades      3      2     1     4
## 324    boy     4   9 Other       Urban                 Main  Grades      4      1     3     2
## 377    boy     4   9 White       Urban              Portage Popular      4      1     2     3
## 433   girl     6  11 White       Urban      Westdale Middle Popular      4      2     1     3

答案 1 :(得分:3)

不幸的是,这个问题的答案几乎是“这取决于你对数据集的了解程度。”

例如,在数据集的描述中,它指定了每个变量的可能值。在这里,我们知道只有少数学校有多个单词的名字,并且这些学校遵循可预测的“小学”和“中学”模式。

因此,您可以使用readLines读取数据,并在使用read.table重新读取数据之前找出插入分隔符的最不突兀的方式。

以下是一个例子:

示例数据:

cat("Grade Area School Goals Value",
    "4 Rural Elm Popular 1",
    "4 Rural Elm Sports 2",
    "4 Rural Elm Grades 1",
    "4 Rural Elm Popular 3",
    "3 Rural Brentwood Elementary Sports 4",
    "3 Rural Brentwood Middle Grades 3",
    "3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")

以字符向量阅读:

x <- readLines("test.txt")

使用gsub强制多字学校名称成为单个单词(以下划线分隔)。然后,使用read.table获取data.frame

read.table(text = gsub(" (Elementary|Middle)", "_\\1", x), header = TRUE)
#   Grade     Area               School   Goals Value
# 1     4    Rural                  Elm Popular     1
# 2     4    Rural                  Elm  Sports     2
# 3     4    Rural                  Elm  Grades     1
# 4     4    Rural                  Elm Popular     3
# 5     3    Rural Brentwood_Elementary  Sports     4
# 6     3    Rural     Brentwood_Middle  Grades     3
# 7     3 Suburban                Ridge Popular     3