用pandas编写时无法使用pandas打开某些csv文件 - CParserError

时间:2016-02-20 23:48:07

标签: python csv pandas

我正在使用tweepy获取要分析的推文我正在使用以下内容将数据写入csv:

df = pd.DataFrame(columns=['ID', 'ORIG', 'LAT', 'LONG', 'CITY', 'STATE', 'TIMESTAMP', 
                           'TWEET_ID', 'TWEET'])
"""Extracting tweet data using tweepy"""
df.loc[self.tweet_counter] = [user_id, 1, lat_long[1], lat_long[0], city, state, 
                             time_stamp, tweet_id, tweet_text]
df.to_csv('orig/{0}.csv'.format(tweet_date), mode='a', encoding='utf-8', header=False, 
          index=False)

到目前为止一切正常。收集数据几天后,我正在做一些分析,我打开文件:

for file in files:
    print(file)
    orig = pd.read_csv('orig\\'+file,names=['ID','ORIG','LAT','LONG','CITY','STATE','DATE',
                                            'TWEET_ID','TWEET'],
                        usecols=['CITY','STATE','DATE','TWEET_ID'],
                        encoding='utf-8')

files是我正在尝试分析的当前目录中的文件列表。我使用os模块获取了大约14个文件。

打开文件时出错

2016-02-05.csv
2016-02-06.csv
2016-02-07.csv
2016-02-08.csv
Traceback (most recent call last):
  File "C:/Users/Leb/Desktop/Python/Houston/comparison.py", line 30, in <module>
    encoding='utf-8')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 285, in _read
    return parser.read()
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 747, in read
    ret = self._engine.read(nrows)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1197, in read
    data = self._reader.read(nrows)
  File "pandas\parser.pyx", line 766, in pandas.parser.TextReader.read (pandas\parser.c:7988)
  File "pandas\parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)
  File "pandas\parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 1113330, saw 9

我在2016-02-08.csv之后的1113330之前和之后列出了2行。

#line 1113328#    2371126254,0.0,,,Rancho Cucamonga,CA,2016-02-08 07:27:09,696596108840034304,@winnjinn I miss it with youu
#line 1113329#    747612104,0.0,,,Portland,OR,2016-02-08 07:27:09,696596108907184130,"When the term hasn't even started yet, but multiple profs have already emailed you homework that's due on the first day#welcometocollege"
#line 1113330#    954897794,0.0,,,Manteca,CA,2016-02-08 07:27:09,696596108504510464,FUCKKKKKKK 6th grade was all bad https://t.co/3fShejyPv6
#line 1113331#    608779746,0.0,,,San Marcos,TX,2016-02-08 07:27:09,696596109385314305,@keeshawnnnnn  https://t.co/g9r8AX7zMn
#line 1113332#    4729317440,0.0,,,Seattle,WA,2016-02-08 07:27:09,696596109876027392,".@tedcruz aka the Antichrist CAMPAIGN GUY JEFF ROE DID SAME DIRTY TRICK N 2010, SAID CANDDTE DROPPED OUT! DW

错误原因

错误源于格式错误的数据(即额外分隔符)。但是,数据使用pandas正确写入,并且没有额外的分隔符。可能的解决方法是按照建议添加error_bad_lines = False here,但我不能忽略这些行,特别是如果没有格式错误的数据,我可能会丢失可用于分析的其他数据。

我还能如何解决这个问题?

0 个答案:

没有答案