我有一个350MB的制表符分隔的文本文件。如果我尝试将其读入内存,则会出现内存不足异常。因此,我正在尝试遵循这些原则(即仅在几列中进行阅读):
import pandas as pd
input_file_and_path = r'C:\Christian\ModellingData\X.txt'
column_names = [
'X1'
# , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
raw_data = pd.concat([raw_data, chunk], ignore_index=True)
print(raw_data.head())
不幸的是,我得到了:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
在处理上述异常期间,发生了另一个异常:
Traceback (most recent call last):
File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
任何想法。顺便说一句,我通常如何处理大型文件并估算缺失的变量?最终,我需要阅读所有内容以确定例如要估算的中位数。
答案 0 :(得分:3)
在使用encoding="utf-8"
的同时使用pd.read_csv
他们在这里使用了这种编码。看看是否可行。 open(file path, encoding='windows-1252')
:
参考:'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
工作解决方案
使用编码encoding="ISO-8859-1"
答案 1 :(得分:2)
关于大文件问题,只需使用文件处理程序和上下文管理器即可:
with open("your_file.txt") as fileObject:
for line in fileObject:
do_something_with(line)
## No need to close file as 'with' automatically does that
这不会将整个文件加载到内存中。取而代之的是,它将一次加载一行,并且除非您存储引用,否则将“忘记”前几行。
另外,关于您的编码问题,只需在使用encoding="utf-8"
时使用pd.read_csv
。