Question

我有两个数据框D1和D2。我想要实现的是D1和D2中任何非int和非浮点类型的列对，我想使用公式计算距离度量

 |A intersect B|/ |A union B|

我首先定义了一个函数

def jaccard_d(series1, series2):
    if (series1.dtype is not (pd.np.dtype(int) or pd.np.dtype(float))) and     (series2.dtype is not (pd.np.dtype(int) or pd.np.dtype(float))):
        series1 = series1.drop_duplicates()
        series2 = series2.drop_duplicates()
        return len(set(series1).intersection(set(series2)))     /len(set(series1).union(set(series2)))
    else:
        return np.nan

然后我做的是首先遍历D1中的所有列，然后对D1中的每个固定列，我在apply函数上使用jaccard_d。我尽量避免编写2层循环。可能有更好的方法没有任何循环吗？

DC = dict.fromkeys(list(D1.columns))
INN = list(D2.columns)
for col in D1:
    DC[col] = dict(zip(INN, D2.apply(jaccard_d,D1[col])))

首先，我不确定我是否正确使用apply函数，即我的jaccard_d函数将2个系列作为输入，但是对于每次迭代，我都有D1[col] as一个系列，我想使用apply将D1[col]应用于D2的所有列

其次，我得到这个错误“'系列'对象是可变的，因此它们不能被散列”，我不太明白。任何评论都表示赞赏。

我试着写一个2层循环并使用我的函数jaccard_d来做到这一点。有用。但我想写更高效的代码。

Answer 1

因此，在徘徊，找到错误发生的确切位置并查看apply文档之后，我推断出您需要这样调用apply：

 D2.apply(jaccard_d, args=(D1[col],))

相反，你正在使用

 D2.apply(jaccard_d, axis=D1[col])

==================

我可以使用简单的数据框重现您的错误消息：

In [589]: df=pd.DataFrame(np.arange(12).reshape(6,2))
In [590]: df
Out[590]: 
    0   1
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
5  10  11

在set中放置一个系列，就像我们在set中放置一个列表一样：

In [591]: set(df[0]).union(set(df[1]))
Out[591]: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}

但是如果我尝试在集合中放入一个包含系列的列表，我就会收到你的错误。

In [592]: set([df[0]])
....
TypeError: 'Series' objects are mutable, thus they cannot be hashed

如果问题不在于set表达式，那么它会出现在dict()表达式中。

您没有指定错误发生的位置，也没有指定MVCe。

（但事实证明这是一个负面因素）

========================

好的，模拟你的代码：

In [606]: DC=dict.fromkeys(list(df.columns))
In [607]: DC
Out[607]: {0: None, 1: None}
In [608]: INN=list(df.columns)
In [609]: INN
Out[609]: [0, 1]
In [610]: for col in df:
     ...:     dict(zip(INN, df.apply(jaccard_d, df[col])))
    ....
----> 2     dict(zip(INN, df.apply(jaccard_d, df[col])))


/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   ...
-> 4125         axis = self._get_axis_number(axis)

/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in _get_axis_number(self, axis)
    326 
    327     def _get_axis_number(self, axis):
--> 328         axis = self._AXIS_ALIASES.get(axis, axis)
    ....        

TypeError: 'Series' objects are mutable, thus they cannot be hashed

问题在于

df.apply(jaccard_d, df[0])

问题与jaccard_d无关。如果我用简单的

替换它就会发生

def foo(series1, series2):
    print(series1)
    print(series2)
    return 1

======================

但请查看apply

的文档

df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

第二个参数，如果不是关键字，则是轴编号。所以我们一直在努力使用Series作为轴号！难怪它反对！如果我更仔细地阅读错误跟踪，那应该是显而易见的。

保留默认axis=0，让其他系列作为args传递：

In [632]: df.apply(jaccard_d,args=(df[1],))
Out[632]: 
0    0.0
1    1.0
dtype: float64

或在你的循环中：

In [643]: for col in df:
     ...:     DC[col] = dict(zip(INN, df.apply(jaccard_d,args=(df[col],))))  
In [644]: DC
Out[644]: {0: {0: 1.0, 1: 0.0}, 1: {0: 0.0, 1: 1.0}}

在pandas系列上使用apply方法获取TypeError'Series'对象是可变的，因此它们不能被散列

1 个答案: