I have a very simple csv, with the following data, compressed inside the tar.gz file. I need to read that in dataframe using pandas.read_csv.
A B
0 1 4
1 2 5
2 3 6
import pandas as pd
pd.read_csv("sample.tar.gz",compression='gzip')
However, I am getting error:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
Following are the set of read_csv commands and the different errors I get with them:
pd.read_csv("sample.tar.gz",compression='gzip', engine='python')
Error: line contains NULL byte
pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: line contains NULL byte
What's going wrong here? How can I fix this?
答案 0 :(得分:11)
df = pd.read_csv('sample.tar.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)
Note: error_bad_lines=False
will ignore the offending rows.
答案 1 :(得分:0)
您可以使用tarfile
module从tar.gz存档中读取特定文件(如this resolved issue中所述)。
如果存档中只有一个文件,则可以执行以下操作:
import tarfile
import pandas as pd
with tarfile.open("sample.tar.gz", "r:*") as tar:
csv_path = tar.getnames()[0]
df = pd.read_csv(tar.extractfile(csv_path), header=0, sep=" ")
读取模式r:*
适当地处理gz扩展名(或其他类型的压缩)。如果压缩的tar文件中包含多个文件,则可以执行类似csv_path = list(n for n in tar.getnames() if n.endswith('.csv'))[-1]
行的操作来获取存档文件夹中的最后一个csv文件。