在R中读取固定长度的文本文件

时间:2015-02-19 14:15:31

标签: r

我目前在SAS工作,但我已经使用R很长一段时间了。我有一些固定宽度的文本文件要读取。那些在SAS中很容易阅读,但我真的在R中经历了同样的事情。 文件看起来像这样:

    DP                  JAMES                    SILVA                    REY                                                                                                                                                             




       2014
           6
            0
                1723713652
           2
             0
DP                  ALEJANDRA                                         NARVAEZ                                                                                                                                                         




       2014
           6
            0
                1723713456
           6
             0
DP                  NANYER                                            PICHARDO                                                                                                                                                        




       2014
           6
            0
                1723713991
           1
             0
DP                  GABRIELA                 ANASI                    CASTILLO                                                                                                                                                        




       2014
           6
            0
                1723713240
           3
             0

目前尚不清楚,我已经附上,请找。

使用infile输入可以轻松读取SAS。

SAS代码:

infile  "filename.txt" lrecl=32767 ;

input

@001 park_cd      $5.

@006 Title $15.

@021 first_name  $25.

@046 middle_name $25.

@071 last_name   $25.

@096 suffix      $15.

@111 ADDRESS_1   $60.

@171 ADDRESS_2   $60.

@231 ADDRESS_3   $60.

@261 CITY  $30.

@291 STATE_PROVINCE    $2.

@293  ZIP    $9.

@302 Ticket_Year  $11.

@314 product_id  $12.

@327 UNIT_PRICE  $13.

@340 PURCHASE_DT $26.

@366 PURCHASE_QTY $12.

@378 TOTAL_PURCHASE_AMT $14. ;

run;

现在在R中做同样的事情,我一直在尝试各种各样的事情:

1)Atfirst read.fwf, 代码:

dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
 widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
header = FALSE, sep = "\t",fill = TRUE,
 skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
   "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
    " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
       "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")

但它只返回大多数字段的NA值,只返回错误位置的一些值。

Head(dat1)给出输出:

  park_cd           Title                first_name               middle_name
1   DP                    JAMES                     SILVA                    
2                                                                            
3                                                                        <NA>
4                    <NA>                      <NA>                      <NA>
5                                              <NA>                      <NA>
6                    2014                      <NA>                      <NA>
                  last_name          suffix
1 REY                                      
2                      <NA>            <NA>
3                      <NA>            <NA>
4                      <NA>            <NA>
5                      <NA>            <NA>
6                      <NA>            <NA>
                                                    ADDRESS_1.
1                                                             
2                                                         <NA>
3                                                         <NA>
4                                                         <NA>
5                                                         <NA>
6                                                         <NA>
                                                     ADDRESS_2 ADDRESS_3 CITY
1                                                                     NA   NA
2                                                         <NA>        NA   NA
3                                                         <NA>        NA   NA
4                                                         <NA>        NA   NA
5                                                         <NA>        NA   NA
6                                                         <NA>        NA   NA
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             NA    NA          NA         NA         NA          NA           NA
2             NA    NA          NA         NA         NA          NA           NA
3             NA    NA          NA         NA         NA          NA           NA
4             NA    NA          NA         NA         NA          NA           NA
5             NA    NA          NA         NA         NA          NA           NA
6             NA    NA          NA         NA         NA          NA           NA
  TOTAL_PURCHASE_AMT.
1                  NA
2                  NA
3                  NA
4                  NA
5                  NA
6                  NA

输出:

2)现在我使用Sascii包调用R中的SAS代码。 代码:

sas_imp <- "input
@001 park_cd      $5.
@006 Title $15.
@021 first_name  $25.
@046 middle_name $25.
@071 last_name   $25.
@096 suffix      $15.
@111 ADDRESS_1   $60.
@171 ADDRESS_2   $60.
@231 ADDRESS_3   $60.
@261 CITY  $30.
@291 STATE_PROVINCE    $2.
@293  ZIP    $9.
@302 Ticket_Year  $11.
@314 product_id  $12.
@327 UNIT_PRICE  $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf ) 

它也提供与上面相同的无用输出。

3)现在我使用Laf包和laf_open_fwf命令,如:

库(LAF)

data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
                     column_types=rep("character",18),
  column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
                  "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                  " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                  "PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
                     column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))

然后我把它转换成:

library(ffbase)
my.data <- laf_to_ffdf(data) 
head(as.data.frame(my.data))

但是它给出了输出:

park_cd Title first_name             middle_name                      last_name
1      DP            JAMES                   SILVA                            REY
2                                             \r\n                           \r\n
3   JANDR     A                            NARVAEZ                               
4                     \r\n                      \r \n  \r\n         \r\n       20
5                 PICHARDO                                                       
6          \r\n            \r\n  \r\n         \r\n       2014\r\n           6\r\n
             suffix
1                  
2 \r\n         \r\n
3                  
4            14\r\n
5                  
6             0\r\n
                                                            ADDRESS_1.
1                                                                     
2         2014\r\n           6\r\n            0\r\n                172
3                                                                     
4 6\r\n            0\r\n                1723713456\r\n           6\r\n
5                                                                     
6                   1723713991\r\n           1\r\n             0\r\nDP
                                                           ADDRESS_2 ADDRESS_3  CITY
1                                                                           \r *\003
2 3713652\r\n           2\r\n             0\r\nDP                  A         L *\003
3                                                               \r\n           *\003
4                                    0\r\nDP                  NANYER           *\003
5                                                               \r\n           *\003
6                                     GABRIELA                 ANASI           *\003
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
2             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
3             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
4             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
5             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
6             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
  TOTAL_PURCHASE_AMT.
1                \001
2                \001
3                \001
4                \001
5                \001
6                \001

4)最后read.table.ffdf喜欢

library(ff) 
library(stringr) 
my.data1  <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
                            FUN="read.fwf", 
                            widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
                            header=F, VERBOSE=TRUE, 
                            col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
                                          "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                                          " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                                          "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), 
                            fileEncoding = "UTF-8", 
                            transFUN=function(x){ 
                              z <- sapply(x, function(y) { 
                                y <- str_trim(y) 
                                y[y==""] <- NA 
                                factor(y)}) 
                              as.data.frame(z) 
                            } )

但结果是一样的。 我在此页面中找到的最后一个解决方案[http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1]

我做错了什么,我把宽度错了吗? 或者我的想法完全错了? 我在R中做了很多事情,并且不能相信SAS中的这么简单的事情在R中是如此的艰难。我必须错过一些简单的事情。如果您对这些类型有任何想法,请帮助我。请提前感谢。

2 个答案:

答案 0 :(得分:2)

您上传的文件不是固定宽度的文件:

enter image description here

我不是SAS用户,但是通过查看帖子中的SAS代码,代码中的列宽与文件中的列宽不匹配。

此外,有些行完全是空白的。

似乎有许多回车/新行不属于那里 - 特别是它们似乎在作为分隔符的地方使用。每行末尾都应该有一个CRLF,就是这样。

由于您说SAS打开它,我建议您在SAS中使用保存为CSV格式,然后在R中打开它。或者您可以使用一个好的文本编辑器/处理器删除多余的CRLF,只留下一个CRLF每行结束。由于看起来每个“真实”行以“DP”开头,您可以尝试用(比如)-tab替换-CRLF-DP然后删除所有-CRLF-s然后用-CRLF替换所有-tab-s - (这取决于他们在文件中没有-tab-s)

答案 1 :(得分:1)

更新

请参阅此处了解我此时使用的问题:

Faster way to read fixed-width files

对于后代,原始答案保留在下面作为绝望的引导解决方案的操作指南


这是FW - &gt;我用Python创建的.csv转换器来销毁这些可怕的文件:

它还包含checkLength函数,可帮助获取@RobertLong建议的内容,即您的基础文件可能有问题。如果是这种情况,如果它普遍存在,你可能会遇到麻烦。不可预测(即您的文件中没有一致的错误模式,您可以ctrl+H来修复。

请注意dictfile必须格式正确(我自己写的,不一定要尽可能健壮)

import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS

def checkLength(ffile):
    """
    Used to check that all lines in file have the same length (and so don't cause any issues below)
    """
    with open(ffile,'r') as ff:
        firstrow=1
        troubles=0
        for rows in ff:
            if firstrow:
                length=len(rows)
                firstrow=0
            elif len(rows) != length:
                print rows
                print len(rows)
                troubles=1
    return troubles

def fixed2csv(infile,outfile,dictfile):
    """
    This function takes a file name for a fixed-width dataset as input and 
    converts it to .csv format according to slices and column names specified in dictfile

    Parameters
    ==========
        infile: string of input file name from which fixed-width data is to be read
                e.g. 'fixed_width.dat'
        outfile: string of output file name to which comma-separated data is to be saved
                 e.g. 'comma_separated.csv'
        dictfile: .csv-formatted dictionary file name from which to read the following:
                     * widths: field widths
                     * column names: names of columns to be written to the output .csv
                     * types: object types (character, integer, etc)
                  column order must be: col_names,slices,types
    """
    with open(dictfile,'r') as dictf:
        fieldnames = ("col_names","widths","types") #types used in R later
        ddict = csv.DictReader(dictf,fieldnames)
        slices=[]
        colNames=[]
        wwidths=[]
        for rows in ddict:
            wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
            colNames.append(rows['col_names'])
        offset = 0
        for w in wwidths:
            slices.append(slice(offset,offset+w))
            offset+=w
    with open(infile,'r') as fixedf:
        with open(outfile,'w') as csvf:
            csvfile=csv.writer(csvf)
            csvfile.writerow(colNames)
            for rows in fixedf:
                csvfile.writerow([rows[s] for s in slices])

祝你好运,诅咒无论是谁正在扩散这些FW格式的数据文件。