我目前在SAS工作,但我已经使用R很长一段时间了。我有一些固定宽度的文本文件要读取。那些在SAS中很容易阅读,但我真的在R中经历了同样的事情。 文件看起来像这样:
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
目前尚不清楚,我已经附上,请找。
使用infile输入可以轻松读取SAS。
SAS代码:
infile "filename.txt" lrecl=32767 ;
input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;
run;
现在在R中做同样的事情,我一直在尝试各种各样的事情:
1)Atfirst read.fwf, 代码:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
但它只返回大多数字段的NA值,只返回错误位置的一些值。
Head(dat1)给出输出:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
输出:
2)现在我使用Sascii包调用R中的SAS代码。 代码:
sas_imp <- "input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
它也提供与上面相同的无用输出。
3)现在我使用Laf包和laf_open_fwf命令,如:
库(LAF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
然后我把它转换成:
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
但是它给出了输出:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4)最后read.table.ffdf喜欢
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
但结果是一样的。 我在此页面中找到的最后一个解决方案[http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1]。
我做错了什么,我把宽度错了吗? 或者我的想法完全错了? 我在R中做了很多事情,并且不能相信SAS中的这么简单的事情在R中是如此的艰难。我必须错过一些简单的事情。如果您对这些类型有任何想法,请帮助我。请提前感谢。
答案 0 :(得分:2)
您上传的文件不是固定宽度的文件:
我不是SAS用户,但是通过查看帖子中的SAS代码,代码中的列宽与文件中的列宽不匹配。
此外,有些行完全是空白的。
似乎有许多回车/新行不属于那里 - 特别是它们似乎在作为分隔符的地方使用。每行末尾都应该有一个CRLF,就是这样。
由于您说SAS打开它,我建议您在SAS中使用保存为CSV格式,然后在R中打开它。或者您可以使用一个好的文本编辑器/处理器删除多余的CRLF,只留下一个CRLF每行结束。由于看起来每个“真实”行以“DP”开头,您可以尝试用(比如)-tab替换-CRLF-DP然后删除所有-CRLF-s然后用-CRLF替换所有-tab-s - (这取决于他们在文件中没有-tab-s)
答案 1 :(得分:1)
请参阅此处了解我此时使用的问题:
Faster way to read fixed-width files
对于后代,原始答案保留在下面作为绝望的引导解决方案的操作指南
这是FW - &gt;我用Python创建的.csv转换器来销毁这些可怕的文件:
它还包含checkLength
函数,可帮助获取@RobertLong建议的内容,即您的基础文件可能有问题。如果是这种情况,如果它普遍存在,你可能会遇到麻烦。不可预测(即您的文件中没有一致的错误模式,您可以ctrl+H
来修复。
请注意dictfile
必须格式正确(我自己写的,不一定要尽可能健壮)
import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS
def checkLength(ffile):
"""
Used to check that all lines in file have the same length (and so don't cause any issues below)
"""
with open(ffile,'r') as ff:
firstrow=1
troubles=0
for rows in ff:
if firstrow:
length=len(rows)
firstrow=0
elif len(rows) != length:
print rows
print len(rows)
troubles=1
return troubles
def fixed2csv(infile,outfile,dictfile):
"""
This function takes a file name for a fixed-width dataset as input and
converts it to .csv format according to slices and column names specified in dictfile
Parameters
==========
infile: string of input file name from which fixed-width data is to be read
e.g. 'fixed_width.dat'
outfile: string of output file name to which comma-separated data is to be saved
e.g. 'comma_separated.csv'
dictfile: .csv-formatted dictionary file name from which to read the following:
* widths: field widths
* column names: names of columns to be written to the output .csv
* types: object types (character, integer, etc)
column order must be: col_names,slices,types
"""
with open(dictfile,'r') as dictf:
fieldnames = ("col_names","widths","types") #types used in R later
ddict = csv.DictReader(dictf,fieldnames)
slices=[]
colNames=[]
wwidths=[]
for rows in ddict:
wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
colNames.append(rows['col_names'])
offset = 0
for w in wwidths:
slices.append(slice(offset,offset+w))
offset+=w
with open(infile,'r') as fixedf:
with open(outfile,'w') as csvf:
csvfile=csv.writer(csvf)
csvfile.writerow(colNames)
for rows in fixedf:
csvfile.writerow([rows[s] for s in slices])
祝你好运,诅咒无论是谁正在扩散这些FW格式的数据文件。