循环文件pandas

时间:2015-07-13 02:48:51

标签: python pandas

我使用的代码来自:Comparing and replacing values inside DataFrames

main_df = pd.read_csv('main.txt', sep='|', encoding='utf-8')
data_df = pd.read_csv('data.csv', encoding='utf-8')

main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'FEATURE_NAME', 'STATE_ALPHA']]
main_df_part.columns = ['LAT', 'LONG', 'CITY', 'STATE']
main_df_part = main_df_part.set_index(['CITY', 'STATE'])
data_df = data_df.set_index(['CITY', 'STATE'])

data_df.update(main_df_part)

data_df.to_csv('data/new.csv', sep=',', mode='a')

我需要运行大约60个文件。 main_df,我尝试了以下内容:

总结

  1. 收集文件,但不断获取pandas.parser.CParserError: Error tokenizing data. C error: out of memory
  2. 使用chunksize,但这会将DataFrame转换为 pandas.io.parsers.TextFileReader制作我使用过的一些方法 无效
  3. 最后,我尝试迭代每个文件并放置正确的文件 名称代替main.txt但在执行此操作时仍然获得Exception: cannot handle a non-unique multi-index!
  4. 这是使用第三种方法:

    files = [f for f in os.listdir('./data') if os.path.isfile(os.path.join('./data', f))]
    
    for w in files:
        main_df = pd.read_csv(w, sep='|', low_memory=False, encoding='utf-8')
    

    如何修复多索引错误?

    扩展信息

    方法1出错:

    Traceback (most recent call last):
      File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 6, in <module>
        main_df = pd.read_csv('data.txt', sep='|', low_memory=False, encoding='utf-8')
      File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 474, in parser_f
        return _read(filepath_or_buffer, kwds)
      File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 260, in _read
        return parser.read()
      File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 721, in read
        ret = self._engine.read(nrows)
      File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 1170, in read
        data = self._reader.read(nrows)
      File "pandas\parser.pyx", line 772, in pandas.parser.TextReader.read (pandas\parser.c:7581)
      File "pandas\parser.pyx", line 858, in pandas.parser.TextReader._read_rows (pandas\parser.c:8532)
      File "pandas\parser.pyx", line 1742, in pandas.parser.raise_parser_error (pandas\parser.c:20715)
    pandas.parser.CParserError: Error tokenizing data. C error: out of memory
    

    方法2出错:

    Traceback (most recent call last):
      File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 11, in <module>
        main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC','FEATURE_NAME', 'STATE_ALPHA']]
    TypeError: 'TextFileReader' object is not subscriptable
    

    方法3出错:

    Traceback (most recent call last):
      File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 32, in <module>
        data_df.update(main_df_part)
      File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 3416, in update
        other = other.reindex_like(self)
      File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1564, in reindex_like
        return self.reindex(**d)
      File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2511, in reindex
        **kwargs)
      File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1773, in reindex
        method, fill_value, copy).__finalize__(self)
      File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2470, in _reindex_axes
        fill_value, limit)
      File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2477, in _reindex_index
        limit=limit)
      File "C:\Python34\lib\site-packages\pandas\core\index.py", line 4929, in reindex
        "cannot handle a non-unique multi-index!")
    Exception: cannot handle a non-unique multi-index!
    

1 个答案:

答案 0 :(得分:0)

与评论中一样,您用于更新data.csv的其中一个文件可能在其索引中有重复项。我在下面运行了一个示例代码。这是相当冗长的,但我希望它能显示出这种特殊情况。

In [1]: import pandas as pd

In [2]: main = pd.read_csv('Main.csv')
   ...: target1 = pd.read_csv('Target1.csv')
   ...: target2 = pd.read_csv('Target2.csv')

In [3]: main
Out[3]: 
          City State  Lat  Long
0           NY    NY  NaN   NaN
1       Albany    NY  NaN   NaN
2     Syracuse    NY  NaN   NaN
3     Columbia    MO  NaN   NaN
4  Kansas City    MO  NaN   NaN
5  Springfield    MO  NaN   NaN

In [4]: target1
Out[4]: 
   Lat  Long      City State
0  100   200        NY    NY
1  300   400    Albany    NY
2  500   600  Syracuse    NY

In [5]: target2
Out[5]: 
   Lat  Long         City State
0  100   200     Columbia    MO
1  300   400  Kansas City    MO
2  500   600  Springfield    MO
3  700   800  Springfield    MO

In [6]: m = main.set_index(['City','State'])
   ...: t1 = target1.set_index(['City','State'])
   ...: t2 = target2.set_index(['City','State'])

In [7]: m
Out[7]: 
                   Lat  Long
City        State           
NY          NY     NaN   NaN
Albany      NY     NaN   NaN
Syracuse    NY     NaN   NaN
Columbia    MO     NaN   NaN
Kansas City MO     NaN   NaN
Springfield MO     NaN   NaN

In [8]: t1
Out[8]: 
                Lat  Long
City     State           
NY       NY     100   200
Albany   NY     300   400
Syracuse NY     500   600

In [9]: t2
Out[9]: 
                   Lat  Long
City        State           
Columbia    MO     100   200
Kansas City MO     300   400
Springfield MO     500   600
            MO     700   800

特别注意上面的最后一行,[9]。请注意Springfield现在如何为自己分配两行值。

In [12]: m.update(t1)

In [13]: m
Out[13]: 
                   Lat  Long
City        State           
NY          NY     100   200
Albany      NY     300   400
Syracuse    NY     500   600
Columbia    MO     NaN   NaN
Kansas City MO     NaN   NaN
Springfield MO     NaN   NaN

In [14]: m.update(t2)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-14-f5f30165a245> in <module>()
----> 1 m.update(t2)

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in update(self, other, join, overwrite, filter_func, raise_conflict)
   3414             other = DataFrame(other)
   3415 
-> 3416         other = other.reindex_like(self)
   3417 
   3418         for col in self.columns:

C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex_like(self, other, method, copy, limit)
   1562                 method=method, copy=copy, limit=limit)
   1563 
-> 1564         return self.reindex(**d)
   1565 
   1566     def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, **kwargs)
   2509     def reindex(self, index=None, columns=None, **kwargs):
   2510         return super(DataFrame, self).reindex(index=index, columns=columns,
-> 2511                                               **kwargs)
   2512 
   2513     @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs)

C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs)
   1771         # perform the reindex on the axes
   1772         return self._reindex_axes(axes, level, limit,
-> 1773                                   method, fill_value, copy).__finalize__(self)
   1774 
   1775     def _reindex_axes(self, axes, level, limit, method, fill_value, copy):

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_axes(self, axes, level, limit, method, fill_value, copy)
   2468         if index is not None:
   2469             frame = frame._reindex_index(index, method, copy, level,
-> 2470                                          fill_value, limit)
   2471 
   2472         return frame

C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit)
   2475                        limit=None):
   2476         new_index, indexer = self.index.reindex(new_index, method, level,
-> 2477                                                 limit=limit)
   2478         return self._reindex_with_indexers({0: [new_index, indexer]},
   2479                                            copy=copy, fill_value=fill_value,

C:\Anaconda\Lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit)
   4927                 else:
   4928                     raise Exception(
-> 4929                         "cannot handle a non-unique multi-index!")
   4930 
   4931         if not isinstance(target, MultiIndex):

Exception: cannot handle a non-unique multi-index!

这会引发与您相同的错误。