我的问题是我有多个大小为200mb +的文本文件,使用这种格式(非常少的例子):
john,smith,3;sasha,dilma,4;sofia,vergara,5;etc.
我需要阅读所有这些文件并分析信息,图表,总和等。
我一直在考虑使用不同的方法保存数据并在Python中使用它。然而,线路终结器';'每次我尝试将数据加载到DataBase或直接在Python中加载时(也尝试使用lineterminator参数)都会导致问题,例如:
import pandas as pd
userHeader = ['name', 'last_name', 'number']
users = pd.read_table('C:/prueba.txt', engine='python', sep=',', header=None, names=userHeader)
# print 3 first users
print '# 3 first users: \n%s' % users[:2]
结果:
# 3 first users:
name last_name number
0 john,smith,3 sasha,dilma,4 sofia,vergara,5
编辑。当我像这样实现lineterminator时:
users = pd.read_table('C:/prueba.txt', engine='python', sep=',', lineterminator=';', header=None, names=userHeader)
我得到以下内容:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-23a80631d090> in <module>()
1 import pandas as pd
2 userHeader = ['user_id', 'gender', 'age']
----> 3 users = pd.read_table('C:/prueba.txt', engine='python', sep=';', lineterminator=';', header=None, names=userHeader)
4
5 # print 5 first users
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472 skip_blank_lines=skip_blank_lines)
473
--> 474 return _read(filepath_or_buffer, kwds)
475
476 parser_f.__name__ = name
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
248
249 # Create the parser.
--> 250 parser = TextFileReader(filepath_or_buffer, **kwds)
251
252 if (nrows is not None) and (chunksize is not None):
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in __init__(self, f, engine, **kwds)
564 self.options['has_index_names'] = kwds['has_index_names']
565
--> 566 self._make_engine(self.engine)
567
568 def _get_options_with_defaults(self, engine):
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _make_engine(self, engine)
709 elif engine == 'python-fwf':
710 klass = FixedWidthFieldParser
--> 711 self._engine = klass(self.f, **self.options)
712
713 def _failover_to_python(self):
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in __init__(self, f, **kwds)
1420 # Set self.data to something that can read lines.
1421 if hasattr(f, 'readline'):
-> 1422 self._make_reader(f)
1423 else:
1424 self.data = f
C:\Users\molmos\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _make_reader(self, f)
1495 if sep is None or len(sep) == 1:
1496 if self.lineterminator:
-> 1497 raise ValueError('Custom line terminators not supported in '
1498 'python parser (yet)')
1499
ValueError: Custom line terminators not supported in python parser (yet)
您是否知道如何阅读和操作存储在文本文件中的所有信息?
感谢您的帮助。
答案 0 :(得分:2)
添加参数lineterminator=";"
。
import pandas as pd
import io
temp=u"""john,smith,3;sasha,dilma,4;sofia,vergara,5"""
userHeader = ['name', 'last_name', 'number']
users = pd.read_table(io.StringIO(temp), sep=',', lineterminator=";",header=None, names=userHeader)
print users
# name last_name number
#0 john smith 3
#1 sasha dilma 4
#2 sofia vergara 5
您必须省略engine='python'
,因为错误:
ValueError:python解析器中尚不支持自定义行终止符
Docs:
lineterminator :字符串(长度为1),默认为无,
将文件分成行的字符。仅对C解析器有效
答案 1 :(得分:1)
sep
是字段的分隔符。行终止符在lineterminator
中给出。
users = pd.read_table('C:/prueba.txt', engine='c', sep=',', lineterminator=';', header=None, names=userHeader)
答案 2 :(得分:0)
使用lineterminator
:
df = pd.read_table('C:/prueba.txt', sep=',', lineterminator=';', header=None, names=userHeader)
In [62]: df
Out[62]:
john smith 3
0 sasha dilma 4
1 sofia vergara 5