Question

我给出的数据集很大，所以我制作了样本集。

text    bool
H1  H2
exTable1    0
text    num num text
HEAD1   HEAD2   HEAD3   HEAD4
exTable2    098 987 exText1
text    bool    text
HEADER1 HEADER2 HEADER3
exTable3    1   exText2

正如您所看到的那样，表格是以制表符分隔的，每个表格前面都有一行描述每列中的数据类型。我尝试使用以下代码来读取表格并从第二行获取标题：

table1 <- read.table("tables.txt", sep="\t", skip=1, header=TRUE)

我收到了这个错误：

Error in read.table("tables.txt", sep = "\t",  : 
more columns than column names

当注意到有多个表时，第一个表的列数少于其他表。

Answer 1

解决方案并非如此简单。

第1步 使用readLines

读取整个文件tables.txt

con <- file("tables.txt", "r")
tables<-readLines(con)
close(con)

第2步 使用ad hoc函数清理它

clean<-function(row)
{
  out<-unlist(strsplit(row,split=" "))
  return(out[nchar(out)>0])
}

tables_cleaned<-lapply(tables,clean)

第3步 查找标识变量类型的行以及相应的文件中的不同表

find_header<-function(row,possible_types)
{
  return(as.logical(min(row %in% possible_types)))
}

possible_types<-c("text","num","bool")
is_header<-unlist(lapply(tables_cleaned,find_header,possible_types=possible_types))

n_files<-which(is_header==1

）

第4步 使用此信息逐步加载每个表

tab<-NULL
for (i in 1:length(n_files))
{
  con <- file("tables.txt", "r")
  if(i<length(n_files))
  {
    tab[[i]]<-read.table(con,skip=n_files[i],nrow=(n_files[i+1]-n_files[i])-2, sep="\t", header=TRUE)
  } else
  {
    tab[[i]]<-read.table(con,skip=n_files[i],nrow=length(tables), sep="\t", header=TRUE)
  }
    close(con)
}

输出

tab
[[1]]
         H1 H2
1 exTable11  0

[[2]]
     HEAD1 HEAD2 HEAD3   HEAD4
1 exTable2    98   987 exText1

[[3]]
   HEADER1 HEADER2 HEADER3
1 exTable3       1 exText2

Answer 2

好的我管理了一个工作，因为我发现了三件事：（1）第一列描述了每行包含的内容; （2）每个表的第一行描述该表的每一列包含并以单词TYPE开头; （3）每个表后面的行只在第一列上包含*，除了最后一个表后面没有任何内容的表。我在末尾添加了一行*，这样每个表都遵循相同的模式，因此我可以获得正确的索引。

为测试数据集修改的变通方法代码（它产生相同的结果）：

#Step 1: Read full data set

tables.df <- read.table("tablesTest2SampleDataSet.txt", header=FALSE, fill = TRUE, stringsAsFactors = FALSE)

#Append a row that starts with an * to the end of the file

tables.df <- rbind(tables.df, c("*"))

#Step 2: Establish identifier for the start and ending of each table in the data set

#Gets row names of the rows that start with the name TYPE

typeRows <- which(tables.df$V1 == "TYPE")

#Gets row names of the rows that start with *

starRows <- which(tables.df$V1 == "*")

#Gets column names of the slots in the TYPE rows that are empty
#Therefore i can use the first item in each of these to get the last column with data

for (i in 1:length(typeRows))
{
  assign(paste("emptyColumnsT", i, sep = ""), which(tables.df[typeRows[i],] == ""))
}

#Step 3: Create the tables

for (i in 1:length(typeRows))#One table per typeRows value
{
  if(length(get(paste("emptyColumnsT", i, sep = ""))) == 0)
  {
    #New frame with length = to original and height = to space between typeRows 
    #and starRows/end of file.

    istar <- starRows[i]-1 

    #If I use starRows[i]-1 instead of istar in the 
    #statement below it doesn't divide the table properly

    assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
        istar),c(1:length(tables.df))])        
  }else
  {
    #New frame with length = one slot prior to the first value of each emptyColumnT 
    #and height = to space between typeRows and starRows/end of file.

    istar <- starRows[i]-1 

    #If I use starRows[i]-1 instead of istar in the 
    #statement below it doesn't divide the table properly

    assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
        istar),c(1:get(paste("emptyColumnsT", i, sep = ""))[1]-1)])
  }
}

以下是我用于此测试的示例数据集：

TYPE    text    bool    num num
HEADERS HEAD1   HEAD2   HEAD3   HEAD4
DATA    abcd    1   123 456
*
TYPE    text    num num num num num num num num bool
HEADERS2    HT1 HN1 HN2 HN3 HN4 HN5 HN6 HN7 HN8 HB
DATA    efgh    789 098 765 432 112 358 132 134 0
*
TYPE    text    text    text    num num num
HEADERS3    H1  H2  H3  H4  H5  H6
DATA    ijkl    mnop    qrst    558 914 400

最后，我希望将文件分成尽可能多的表;在这种情况下3.每个表的行应从TYPE行开始，并在*行之前结束行。至于列，每个都应该没有空插槽。因此，此测试中的所有3个表都有不同的长度。

当第一个表的列少于其他表时，如何让R在一个文本文件中读取多个表

2 个答案: