我有一个包含此表的文本文件:
Ion TheoWavelength Blended_Set
Line_Label
H1_4340A Hgamma_5_2 4340.471 None
He1_4472A HeI_4471 4471.479 None
He2_4686A HeII_4686 4685.710 None
Ar4_4711A [ArIV] 4711.000 None
Ar4_4740A [ArIV] 4740.000 None
H1_4861A Hbeta_4_2 4862.683 None
此表是使用dataframe.to_string从pandas数据帧生成的,然后保存unicode变量。
我想使用pandas函数从这个文件创建一个数据框:
import pandas as pd
df = pd.read_csv('my_table_file.txt', delim_whitespace = True, header = 0, index_col = 0)
但是我收到此错误
Traceback (most recent call last):
File
df = pd.read_csv(table, delim_whitespace = True, header = 0, index_col = 0)
File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/home/user/anaconda/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
我敢说这是由于索引列名称在其自己的行中引起的。
是否有避免此问题或导出表格而不包含此标签?
P.S。我尝试使用dataframe.to_csv表但据我所知它不允许你使用表格列格式,如果它们有不同的dtypes
答案 0 :(得分:1)
在这种情况下我会使用HDF5格式 - 它会照顾你的索引。
除了它比CSV快得多之外,你可以有条件地选择数据(比如使用SQL DB),它支持压缩等。
演示:
In [2]: df
Out[2]:
Ion TheoWavelength Blended_Set
Line_Label
H1_4340A Hgamma_5_2 4340.471 None
He1_4472A HeI_4471 4471.479 None
He2_4686A HeII_4686 4685.710 None
Ar4_4711A [ArIV] 4711.000 None
Ar4_4740A [ArIV] 4740.000 None
H1_4861A Hbeta_4_2 4862.683 None
In [3]: df.to_hdf('d:/temp/myhdf.h5', 'df', format='t', data_columns=True)
In [4]: x = pd.read_hdf('d:/temp/myhdf.h5', 'df')
In [5]: x
Out[5]:
Ion TheoWavelength Blended_Set
Line_Label
H1_4340A Hgamma_5_2 4340.471 None
He1_4472A HeI_4471 4471.479 None
He2_4686A HeII_4686 4685.710 None
Ar4_4711A [ArIV] 4711.000 None
Ar4_4740A [ArIV] 4740.000 None
H1_4861A Hbeta_4_2 4862.683 None
你甚至可以查询你的HDF5文件,比如SQL DB:
In [20]: x2 = pd.read_hdf('d:/temp/myhdf.h5', 'df', where="TheoWavelength > 4500 and Ion == '[ArIV]'")
In [21]: x2
Out[21]:
Ion TheoWavelength Blended_Set
Line_Label
Ar4_4711A [ArIV] 4711.0 None
Ar4_4740A [ArIV] 4740.0 None
答案 1 :(得分:0)
考虑Python的内置s = "%/h > %/h Current value over threshold value"
res = ' '.join(s.split()[3:])
,从Python 3开始的Current value over threshold value
模块方法({2}中的StringIO
作为自己的模块)来读取文本标量字符串。在大熊猫的io
内调用它,然后操纵标题的第一行字符串内容:
StringIO
如果您需要从文件中读取,请使用read_table()
从文件中读取,然后读取文本文件以提取标题:
from io import StringIO
import pandas as pd
data = '''
Ion TheoWavelength Blended_Set
Line_Label
H1_4340A Hgamma_5_2 4340.471 None
He1_4472A HeI_4471 4471.479 None
He2_4686A HeII_4686 4685.710 None
Ar4_4711A [ArIV] 4711.000 None
Ar4_4740A [ArIV] 4740.000 None
H1_4861A Hbeta_4_2 4862.683 None
'''
df = pd.read_table(StringIO(data), sep="\s+", header=None, skiprows=3, index_col=0)
headers = [item for line in data.split('\n')[0:3] for item in line.split()][0:4]
df.columns = headers[0:3]
df.index.name = headers[3]