当数据不在表格中时,如何将文本文件读入R.

时间:2011-12-07 21:38:08

标签: r

我有一个很长的电话记录作为文本文件,我试图将其读入R但是它确实没有用。文本有一个结构,但肯定不是一个表。其结构如下

  1. 每条记录由多行组成,因此readLines不太合适
  2. 每条记录的每一行都是一个单独的字段
  3. 某些记录在第二个字段
  4. 后面有一个附加字段
  5. 每个新记录都以空行标注。如果可以指定记录以“\ n \ n”分隔,并且字段(或列)以“\ n”分隔,则readLinesscan将起作用
  6. 以下是一个例子:

    TheInstitute 5467
      telephone line 4125526987 x 4567
      datetime 2011110516 12:56
      blay blay blah who knows what, but anyway it may have a comma
    
    TheInstitute 5467
      telephone line 4125526987 x 4567
      datetime 2011110516 12:58
      blay blay blah who knows what
    
    TheInstitute 5467
      telephone line 412552999 x 4999
      bump phone line 4125527777
      datetime 2011110516 12:59
      blay blay blah who knows what
    
    TheInstitute 5467
      telephone line 4125526987 x 4567
      bump phone line 4125527777
      datetime 2011110516 13:51
      blay blay blah who knows what, but anyway it may have a comma
    
    TheInstitute 5467
      telephone line 4125526987 x 4567
      datetime 2011110516 14:56
      blay blay blah who knows what
    

    我怎样才能在R中这样做?我已经尝试了扫描,粘贴,strsplit的技巧,但我在旋转。我可能必须将它放入列表中,因为它可以处理不相等数量的元素。我想让所有记录具有相同数量的字段,对于那些没有一个字段的记录(这里称为凹凸电话),我希望它们只是将NA作为该字段中的值。即使只是开始,我也会感谢帮助。从那里我可以玩和玩具。

1 个答案:

答案 0 :(得分:15)

scan函数中使用multi.line = TRUE时,记录应该以两个行尾结束。我在文件周围使用textConnection执行此操作,但您将使用有效的文件名:

inp <- scan(textConnection(txt), multi.line=TRUE, 
             what=list(place="character", tline1="character", 
             cline1="character", cline2 ="character", cline3="character"), sep="\n")
Read 5 records
> str(as.data.frame(inp))
'data.frame':   5 obs. of  5 variables:
 $ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
 $ tline1: Factor w/ 2 levels "  telephone line 4125526987 x 4567",..: 1 1 2 1 1
 $ cline1: Factor w/ 4 levels "  bump phone line 4125527777",..: 2 3 1 1 4
 $ cline2: Factor w/ 4 levels "  blay blay blah who knows what",..: 2 1 3 4 1
 $ cline3: Factor w/ 3 levels "","  blay blay blah who knows what",..: 1 1 2 3 1
> as.data.frame(inp)
              place                             tline1
1 TheInstitute 5467   telephone line 4125526987 x 4567
2 TheInstitute 5467   telephone line 4125526987 x 4567
3 TheInstitute 5467    telephone line 412552999 x 4999
4 TheInstitute 5467   telephone line 4125526987 x 4567
5 TheInstitute 5467   telephone line 4125526987 x 4567
                        cline1
1    datetime 2011110516 12:56
2    datetime 2011110516 12:58
3   bump phone line 4125527777
4   bump phone line 4125527777
5    datetime 2011110516 14:56
                                                           cline2
1   blay blay blah who knows what, but anyway it may have a comma
2                                   blay blay blah who knows what
3                                       datetime 2011110516 12:59
4                                       datetime 2011110516 13:51
5                                   blay blay blah who knows what
                                                           cline3
1                                                                
2                                                                
3                                   blay blay blah who knows what
4   blay blay blah who knows what, but anyway it may have a comma
5