pandas读取从dataframe.to_string生成的文本文件

时间:2016-11-09 22:22:13

标签: python csv pandas dataframe

我有一个包含此表的文本文件:

                   Ion  TheoWavelength         Blended_Set  
Line_Label                                                                                                                                             
H1_4340A    Hgamma_5_2        4340.471                None
He1_4472A     HeI_4471        4471.479                None
He2_4686A    HeII_4686        4685.710                None
Ar4_4711A       [ArIV]        4711.000                None
Ar4_4740A       [ArIV]        4740.000                None
H1_4861A     Hbeta_4_2        4862.683                None

此表是使用dataframe.to_string从pandas数据帧生成的,然后保存unicode变量。

我想使用pandas函数从这个文件创建一个数据框:

import pandas as pd
df = pd.read_csv('my_table_file.txt', delim_whitespace = True, header = 0, index_col = 0)

但是我收到此错误

Traceback (most recent call last):
  File 
    df = pd.read_csv(table, delim_whitespace = True, header = 0, index_col = 0)
  File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

我敢说这是由于索引列名称在其自己的行中引起的。

是否有避免此问题或导出表格而不包含此标签?

P.S。我尝试使用dataframe.to_csv表但据我所知它不允许你使用表格列格式,如果它们有不同的dtypes

2 个答案:

答案 0 :(得分:1)

在这种情况下我会使用HDF5格式 - 它会照顾你的索引。

除了它比CSV快得多之外,你可以有条件地选择数据(比如使用SQL DB),它支持压缩等。

演示:

In [2]: df
Out[2]:
                   Ion  TheoWavelength Blended_Set
Line_Label
H1_4340A    Hgamma_5_2        4340.471        None
He1_4472A     HeI_4471        4471.479        None
He2_4686A    HeII_4686        4685.710        None
Ar4_4711A       [ArIV]        4711.000        None
Ar4_4740A       [ArIV]        4740.000        None
H1_4861A     Hbeta_4_2        4862.683        None

In [3]: df.to_hdf('d:/temp/myhdf.h5', 'df', format='t', data_columns=True)

In [4]: x = pd.read_hdf('d:/temp/myhdf.h5', 'df')

In [5]: x
Out[5]:
                   Ion  TheoWavelength Blended_Set
Line_Label
H1_4340A    Hgamma_5_2        4340.471        None
He1_4472A     HeI_4471        4471.479        None
He2_4686A    HeII_4686        4685.710        None
Ar4_4711A       [ArIV]        4711.000        None
Ar4_4740A       [ArIV]        4740.000        None
H1_4861A     Hbeta_4_2        4862.683        None

你甚至可以查询你的HDF5文件,比如SQL DB:

In [20]: x2 = pd.read_hdf('d:/temp/myhdf.h5', 'df', where="TheoWavelength > 4500 and Ion == '[ArIV]'")

In [21]: x2
Out[21]:
               Ion  TheoWavelength Blended_Set
Line_Label
Ar4_4711A   [ArIV]          4711.0        None
Ar4_4740A   [ArIV]          4740.0        None

答案 1 :(得分:0)

考虑Python的内置s = "%/h > %/h Current value over threshold value" res = ' '.join(s.split()[3:]) ,从Python 3开始的Current value over threshold value 模块方法({2}中的StringIO作为自己的模块)来读取文本标量字符串。在大熊猫的io内调用它,然后操纵标题的第一行字符串内容:

StringIO

如果您需要从文件中读取,请使用read_table()从文件中读取,然后读取文本文件以提取标题:

from io import StringIO
import pandas as pd

data = '''
                   Ion  TheoWavelength         Blended_Set
Line_Label
H1_4340A    Hgamma_5_2        4340.471                None
He1_4472A     HeI_4471        4471.479                None
He2_4686A    HeII_4686        4685.710                None
Ar4_4711A       [ArIV]        4711.000                None
Ar4_4740A       [ArIV]        4740.000                None
H1_4861A     Hbeta_4_2        4862.683                None
'''

df = pd.read_table(StringIO(data), sep="\s+", header=None, skiprows=3, index_col=0)

headers = [item for line in data.split('\n')[0:3] for item in line.split()][0:4]
df.columns = headers[0:3]
df.index.name = headers[3]