我尝试使用pandas.read_fwf读取固定宽度的文件,请参阅下面的文件示例:
0000123456700123
0001234567800045
比如说,第0-11列是余额(格式为%12.2f),第11-16列是利率(格式为%6.2f)。所以我期望的输出数据框应如下所示:
Balance Int_Rate
0 12345.67 1.23
1 123456.78 0.45
这是我的代码,无需格式化即可阅读文件:
colspecs = [(0,11),(11,16)]
header = ['Balance','Int_Rate']
df = pd.read_fwf("dataset",colspecs=colspecs, names=header)
我已经检查了pandas.read_fwf的文档,但是在导入过程中似乎无法将列格式化为选项。我之后是否需要更新格式,或者有更好的方法吗?
答案 0 :(得分:1)
我曾经遇到同样的问题,我使用struct然后pandas
import struct
import pandas as pd
def parse_data_file(fieldwidths, fn):
#
# see https://docs.python.org/3.0/library/struct.html, for formatting and other info
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
umpack = fieldstruct.unpack_from
# this part will dissect your data, per your fieldwiths
parse = lambda line: tuple(s.decode() for s in umpack(line.encode()))
df = []
with open(fn, 'r') as f:
for line in f:
row = parse(line)
df.append(row)
return df
#
# test.txt file content, per below
# 6332 x102340 Darwin 080007Darwin 1101
# 6332 x102342 Sydney 200001Sydney 1101
file_location = "test.txt"
fieldwidths = (10 ,10 ,100 ,4 ,2 ,50 ,4) # negative widths represent ignored padding fields
column_names = ['ID', 'LocationID', 'LocationName', 'PostCode', 'StateID', 'Address', 'CountryID']
fields = parse_data_file(fieldwidths=fieldwidths, fn=file_location)
# Pandas options
pd.options.display.width=500
pd.options.display.colheader_justify='left'
# assigned list into dataframe
df = pd.DataFrame(fields)
df.columns = column_names
print(df)
输出
ID LocationID LocationName PostCode StateID Address CountryID 6332 x102340 Darwin 0800 07 Darwin 1101 6332 x102342 Sydney 2000 01 Sydney 1101