pandas:在collections.Counter(甚至只是dict)对象的列上合并?

时间:2015-07-01 16:24:03

标签: python pandas merge

我需要使用带有collections.Counter个对象(https://docs.python.org/2/library/collections.html#collections.Counter)的列来合并两个pandas DataFrame。合并引发了一个奇怪的错误。请参阅下面的可执行代码示例。

import pandas as pd
from collections import Counter
a = pd.DataFrame([(120000.0, 120000.0, 0.0, 120000.0),
 (120000.0, 280000.0, 120000.0, 120000.0),
 (280000.0, 280000.0, 120000.0, 280000.0),
 (280000.0, 420000.0, 280000.0, 280000.0),
 (420000.0, 420000.0, 280000.0, 420000.0),
 (420000.0, 500000.0, 420000.0, 420000.0),
 (500000.0, 580000.0, 420000.0, 500000.0),
 (580000.0, 820000.0, 500000.0, 580000.0),
 (820000.0, 860000.0, 580000.0, 820000.0),
 (860000.0, 1160000.0, 820000.0, 860000.0),
 (1160000.0, 1160000.0, 860000.0, 1160000.0)])
b = pd.DataFrame([(120000.0, 120000.0, 0.0, 120000.0),
 (120000.0, 280000.0, 120000.0, 120000.0),
 (280000.0, 280000.0, 120000.0, 280000.0),
 (280000.0, 440000.0, 280000.0, 280000.0),
 (440000.0, 440000.0, 280000.0, 440000.0),
 (440000.0, 520000.0, 440000.0, 440000.0),
 (520000.0, 580000.0, 440000.0, 520000.0),
 (580000.0, 820000.0, 520000.0, 580000.0),
 (820000.0, 860000.0, 580000.0, 820000.0),
 (860000.0, 1120000.0, 820000.0, 860000.0),
 (1120000.0, 1160000.0, 860000.0, 1120000.0)])
a['ID'] = [Counter(i) for i in list(a.values)]
b['ID'] = [Counter(i) for i in list(b.values)]
pd.merge(a, b, on='ID')

返回:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 601, in runfile
    execfile(filename, namespace)
  File "/usr/local/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 73, in execfile
    builtins.execfile(filename, *where)
  File "/home/ilya/tmp/tmp_merge.py", line 33, in <module>
    pd.merge(a, b, on='ID')
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 38, in merge
    return op.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 186, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 273, in _get_join_info
    sort=self.sort, how=self.how)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 461, in _get_join_indexers
    llab, rlab, shape = map(list, zip( * map(fkeys, left_keys, right_keys)))
TypeError: type object argument after * must be a sequence, not itertools.imap

我尝试将Counter对象转换为普通的dicts(即

b['ID'] = [dict(Counter(i)) for i in list(b.values)]

),但没有帮助。这是正常的行为吗?如果是,我该如何规避这个错误?或者还有其他方法可以达到相同的最终结果吗?

我使用python 2.7和pandas 0.16.1(通常是ipython笔记本,但这也是在python中测试过的。)

编辑: 澄清所有这些是什么。 我需要根据两对列的值进行合并。在实际数据中,它们是Start1,End1,Start2,End2。 End2&gt; Start2,End1&gt; Start1。这个例子是我的真实值的一个子集。问题是在两个数据集中可能是(Start1_1,End1_1)==(Start2_2,End2_2)和(Start1_2,End1_2)==(Start2_1,End2_1)的情况;我想要合并这些行(第二个数字表示数据集)。我认为使用这样的计数器应该是最简单的解决方案,我很确定这种方式不会出现误报。

1 个答案:

答案 0 :(得分:4)

解决此问题的一种方法是为转换为可哈希类型的原始数据结构版本创建一个列(对于每个DataFrame)。

如,

<form id="step3_cityForm">
   <div class="form-group">
      <label>Select City</label>
      <input type="text" class="form-control" disabled name="cityInput" id="cityInput" >
   </div>
</form>

然后

$step3_cityValidator=$("#step3_cityForm").validate({
    rules:{
        cityInput:{
            required: true
        }
    }
});

之后,只需删除列。