现在我正在尝试读取一个具有可变空白分隔符的表,并且还具有缺失/空白值。我想在python中读取表格并生成CSV文件。我已经尝试过NumPy,Pandas和CSV库,但不幸的是,变量空间和缺失数据一起让我几乎无法读取表格。我试图阅读的文件附在此处: goo.gl/z7S2Mo
如果有人可以帮我解决python中的解决方案,那真的很感激
答案 0 :(得分:2)
您需要将分隔符设置为两个或更多空格(而不是一个或多个空格)。这是一个解决方案:
import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')
结果:
>>> print(df.head())
0 1 2 3 4 5 \
0 ISHARES MORNINGSTAR MID GROWTH ETP 464288307 3892 41700 SH
1 ISHARES S&P MIDCAP 400 GROWTH ETP 464287606 4700 47600 SH
2 BED BATH & BEYOND Common Stock 075896100 870 15000 SH
3 CARBO CERAMICS INC Common Stock 140781105 950 7700 SH
4 CATALYST HEALTH SOLUTIONS IN Common Stock 14888B103 1313 25250 SH
6 7 8 9
0 Sole 41700 0 0
1 Sole 47600 0 0
2 Sole 15000 0 0
3 Sole 7700 0 0
4 Sole 25250 0 0
>>> print(df.dtypes)
0 object
1 object
2 object
3 int64
4 int64
5 object
6 object
7 int64
8 int64
9 int64
dtype: object
答案 1 :(得分:1)
numpy模块有一个功能就是这样做(见最后一行):
import numpy as np
path = "<insert file path here>/infotable.txt"
# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])
# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1
# Get column widths
widths = column_locations[1:] - column_locations[:-1]
data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)
根据您的确切用例,您可以使用不同的方法来获取列宽,但您可以理解。 dtype=None
确保numpy为您确定数据类型;这与遗漏dtype
论证非常不同。最后,autostrip=True
删除了前导和尾随空格。
输出(data
)是structured array。