在python中读取一个可变的空格分隔表

时间:2017-03-28 00:05:30

标签: python csv numpy

现在我正在尝试读取一个具有可变空白分隔符的表,并且还具有缺失/空白值。我想在python中读取表格并生成CSV文件。我已经尝试过NumPy,Pandas和CSV库,但不幸的是,变量空间和缺失数据一起让我几乎无法读取表格。我试图阅读的文件附在此处: goo.gl/z7S2Mo

This is how the table looks like

如果有人可以帮我解决python中的解决方案,那真的很感激

2 个答案:

答案 0 :(得分:2)

您需要将分隔符设置为两个或更多空格(而不是一个或多个空格)。这是一个解决方案:

import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')

结果:

>>> print(df.head())
                                0             1          2     3      4   5  \
0  ISHARES MORNINGSTAR MID GROWTH           ETP  464288307  3892  41700  SH   
1   ISHARES S&P MIDCAP 400 GROWTH           ETP  464287606  4700  47600  SH   
2               BED BATH & BEYOND  Common Stock  075896100   870  15000  SH   
3              CARBO CERAMICS INC  Common Stock  140781105   950   7700  SH   
4    CATALYST HEALTH SOLUTIONS IN  Common Stock  14888B103  1313  25250  SH   

      6      7  8  9  
0  Sole  41700  0  0  
1  Sole  47600  0  0  
2  Sole  15000  0  0  
3  Sole   7700  0  0  
4  Sole  25250  0  0  

>>> print(df.dtypes)
0    object
1    object
2    object
3     int64
4     int64
5    object
6    object
7     int64
8     int64
9     int64
dtype: object

答案 1 :(得分:1)

numpy模块有一个功能就是这样做(见最后一行):

import numpy as np

path = "<insert file path here>/infotable.txt"

# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])

# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1

# Get column widths
widths = column_locations[1:] - column_locations[:-1]

data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)

根据您的确切用例,您可以使用不同的方法来获取列宽,但您可以理解。 dtype=None确保numpy为您确定数据类型;这与遗漏dtype论证非常不同。最后,autostrip=True删除了前导和尾随空格。

输出(data)是structured array