如何阅读不在格式表格中的文字?

时间:2013-12-04 06:17:53

标签: r

如何读取R中不属于格式表的文件?

数据包含某些值的空白数据。空白需要有价值。

“关于”和“名称”是唯一始终存在的值。

例如文本文件如下:

Name
Type
Color
About

Spiderman
Marvel
Red
Swings from webs

Superman
DC

Likes to fly around

Hulk 
Marvel
Green
I told you not top make him mad. 

Batman

Black
He is a good fighter and detective

Martian Manhunter
DC

He is from Mars

Deadpool

Black Red
Kinda Crazy

第一个条目是标题。 我想把它变成像

这样的数据框
Name      Type      Color      About
Spiderman Marvel    Red        Swings from webs
Superman  DC                   Likes to fly around
Hulk      Marvel    Green      I told you not top make him mad. 
Batman              Black      He is a good fighter and detective
Mar...ter DC                   He is from Mars
Deadpool            Black Red  Kinda Crazy

2 个答案:

答案 0 :(得分:7)

在多线模式下使用扫描(对于由空行分隔的三个项目的非常规的组):

filename="myPath/myFile.txt"
inp <- scan(filename, , what=as.list(rep("",3) ))
dinp <- as.data.frame(inp, stringsAsFactors=FALSE)
names(dinp) <- dinp[1,]  # use first set as the column names
dinp <- dinp[-1,]        # then remove from the data

第二次尝试(不同的问题)

dat <- readLines(filename)
# Matrices are column-major order, hence the t(). I suppose I could have used byrow=TRUE.
mydf <- as.data.frame( t(matrix(dat, nrow=5) )[-1,-5] )
names(mydf) <- dat[1:4]

#-----------------------------
> mydf
               Name   Type     Color                              About
1         Spiderman Marvel       Red                   Swings from webs
2          Superman     DC                          Likes to fly around
3             Hulk  Marvel     Green  I told you not top make him mad. 
4            Batman            Black He is a good fighter and detective
5 Martian Manhunter     DC                              He is from Mars
6          Deadpool        Black Red                        Kinda Crazy

答案 1 :(得分:0)

您列出的数据应该可以使用R read.table读取,而无需任何额外的参数。它会自动确定分隔符(在您的情况下为空格)并忽略空行。因此,如果您有一个名为test.txt的数据文件,其中包含

Name Type Color

Spiderman Marvel Red

Superman DC Blue

Hulk Marvel Green

然后你会做

> read.table('test.txt',header=TRUE)
       Name   Type Color
1 Spiderman Marvel   Red
2  Superman     DC  Blue
3      Hulk Marvel Green

请注意,read.table只是scan函数的包装器,如果您需要在读取数据时更加高兴,可以使用它。见http://stat.ethz.ch/R-manual/R-devel/library/base/html/scan.html