我使用的代码来自:Comparing and replacing values inside DataFrames
main_df = pd.read_csv('main.txt', sep='|', encoding='utf-8')
data_df = pd.read_csv('data.csv', encoding='utf-8')
main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'FEATURE_NAME', 'STATE_ALPHA']]
main_df_part.columns = ['LAT', 'LONG', 'CITY', 'STATE']
main_df_part = main_df_part.set_index(['CITY', 'STATE'])
data_df = data_df.set_index(['CITY', 'STATE'])
data_df.update(main_df_part)
data_df.to_csv('data/new.csv', sep=',', mode='a')
我需要运行大约60个文件。 main_df
,我尝试了以下内容:
总结
pandas.parser.CParserError:
Error tokenizing data. C error: out of memory
。pandas.io.parsers.TextFileReader
制作我使用过的一些方法
无效main.txt
但在执行此操作时仍然获得Exception: cannot
handle a non-unique multi-index!
。这是使用第三种方法:
files = [f for f in os.listdir('./data') if os.path.isfile(os.path.join('./data', f))]
for w in files:
main_df = pd.read_csv(w, sep='|', low_memory=False, encoding='utf-8')
如何修复多索引错误?
扩展信息
方法1出错:
Traceback (most recent call last):
File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 6, in <module>
main_df = pd.read_csv('data.txt', sep='|', low_memory=False, encoding='utf-8')
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 260, in _read
return parser.read()
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 772, in pandas.parser.TextReader.read (pandas\parser.c:7581)
File "pandas\parser.pyx", line 858, in pandas.parser.TextReader._read_rows (pandas\parser.c:8532)
File "pandas\parser.pyx", line 1742, in pandas.parser.raise_parser_error (pandas\parser.c:20715)
pandas.parser.CParserError: Error tokenizing data. C error: out of memory
方法2出错:
Traceback (most recent call last):
File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 11, in <module>
main_df_part = main_df[['PRIM_LAT_DEC', 'PRIM_LONG_DEC','FEATURE_NAME', 'STATE_ALPHA']]
TypeError: 'TextFileReader' object is not subscriptable
方法3出错:
Traceback (most recent call last):
File "C:/Users/Leb/Desktop/Python/py-script/geo_pandas.py", line 32, in <module>
data_df.update(main_df_part)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 3416, in update
other = other.reindex_like(self)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1564, in reindex_like
return self.reindex(**d)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2511, in reindex
**kwargs)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1773, in reindex
method, fill_value, copy).__finalize__(self)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2470, in _reindex_axes
fill_value, limit)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2477, in _reindex_index
limit=limit)
File "C:\Python34\lib\site-packages\pandas\core\index.py", line 4929, in reindex
"cannot handle a non-unique multi-index!")
Exception: cannot handle a non-unique multi-index!
答案 0 :(得分:0)
与评论中一样,您用于更新data.csv
的其中一个文件可能在其索引中有重复项。我在下面运行了一个示例代码。这是相当冗长的,但我希望它能显示出这种特殊情况。
In [1]: import pandas as pd
In [2]: main = pd.read_csv('Main.csv')
...: target1 = pd.read_csv('Target1.csv')
...: target2 = pd.read_csv('Target2.csv')
In [3]: main
Out[3]:
City State Lat Long
0 NY NY NaN NaN
1 Albany NY NaN NaN
2 Syracuse NY NaN NaN
3 Columbia MO NaN NaN
4 Kansas City MO NaN NaN
5 Springfield MO NaN NaN
In [4]: target1
Out[4]:
Lat Long City State
0 100 200 NY NY
1 300 400 Albany NY
2 500 600 Syracuse NY
In [5]: target2
Out[5]:
Lat Long City State
0 100 200 Columbia MO
1 300 400 Kansas City MO
2 500 600 Springfield MO
3 700 800 Springfield MO
In [6]: m = main.set_index(['City','State'])
...: t1 = target1.set_index(['City','State'])
...: t2 = target2.set_index(['City','State'])
In [7]: m
Out[7]:
Lat Long
City State
NY NY NaN NaN
Albany NY NaN NaN
Syracuse NY NaN NaN
Columbia MO NaN NaN
Kansas City MO NaN NaN
Springfield MO NaN NaN
In [8]: t1
Out[8]:
Lat Long
City State
NY NY 100 200
Albany NY 300 400
Syracuse NY 500 600
In [9]: t2
Out[9]:
Lat Long
City State
Columbia MO 100 200
Kansas City MO 300 400
Springfield MO 500 600
MO 700 800
特别注意上面的最后一行,[9]
。请注意Springfield
现在如何为自己分配两行值。
In [12]: m.update(t1)
In [13]: m
Out[13]:
Lat Long
City State
NY NY 100 200
Albany NY 300 400
Syracuse NY 500 600
Columbia MO NaN NaN
Kansas City MO NaN NaN
Springfield MO NaN NaN
In [14]: m.update(t2)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-14-f5f30165a245> in <module>()
----> 1 m.update(t2)
C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in update(self, other, join, overwrite, filter_func, raise_conflict)
3414 other = DataFrame(other)
3415
-> 3416 other = other.reindex_like(self)
3417
3418 for col in self.columns:
C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex_like(self, other, method, copy, limit)
1562 method=method, copy=copy, limit=limit)
1563
-> 1564 return self.reindex(**d)
1565
1566 def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in reindex(self, index, columns, **kwargs)
2509 def reindex(self, index=None, columns=None, **kwargs):
2510 return super(DataFrame, self).reindex(index=index, columns=columns,
-> 2511 **kwargs)
2512
2513 @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs)
C:\Anaconda\Lib\site-packages\pandas\core\generic.pyc in reindex(self, *args, **kwargs)
1771 # perform the reindex on the axes
1772 return self._reindex_axes(axes, level, limit,
-> 1773 method, fill_value, copy).__finalize__(self)
1774
1775 def _reindex_axes(self, axes, level, limit, method, fill_value, copy):
C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_axes(self, axes, level, limit, method, fill_value, copy)
2468 if index is not None:
2469 frame = frame._reindex_index(index, method, copy, level,
-> 2470 fill_value, limit)
2471
2472 return frame
C:\Anaconda\Lib\site-packages\pandas\core\frame.pyc in _reindex_index(self, new_index, method, copy, level, fill_value, limit)
2475 limit=None):
2476 new_index, indexer = self.index.reindex(new_index, method, level,
-> 2477 limit=limit)
2478 return self._reindex_with_indexers({0: [new_index, indexer]},
2479 copy=copy, fill_value=fill_value,
C:\Anaconda\Lib\site-packages\pandas\core\index.pyc in reindex(self, target, method, level, limit)
4927 else:
4928 raise Exception(
-> 4929 "cannot handle a non-unique multi-index!")
4930
4931 if not isinstance(target, MultiIndex):
Exception: cannot handle a non-unique multi-index!
这会引发与您相同的错误。