Question

我试图执行以下操作：

从我的pandas DataFrame
将DataFrame子集设置为仅包含与2个特定列上随机选择的行匹配的行，＆＃39; Type＆＃39;和＆＃39; LocationID＆＃39;。

以下是相关的代码段：

import pandas as pd

train = pd.DataFrame(
    {'Type': ['Rad', 'Rad', 'Rad', 'Rad', 'Rad'], 
     'LocationID': ['6', '6', '6', '6', '6'], 
     'UserID': [0, 1, 2, 3, 4]})
u1 = train.sample(n=1)
group_feat = ['Type', 'LocationID']
for gf in group_feat:
    match = train[gf].apply(lambda x: x == u1[gf])
    train = train.loc[match]

我的代码在.loc函数的最后一行抛出错误：

ValueError：无法使用多维密钥进行索引

进一步调查显示变量match的类型不是Series，而是具有1列的DataFrame。我无法确定为什么apply函数在这种情况下不会简单地返回一个Series。我怎么能绕过这个？我无法使用通常的tolist()，因为该方法不适用于DataFrame。对大熊猫的一般直觉有什么了解导致我遇到这个错误？我之前和过去多次成功使用apply它返回了预期的类型。

修改： train.info()（为了简洁/隐私而删除了不相关的列）：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92529 entries, 0 to 92528
Data columns (total 93 columns):
Type                                              92529 non-null object
LocationID                                        92529 non-null object
 UserID                                            92529 non-null int64
 dtypes: float64(6), int64(55), object(32)
 memory usage: 66.4+ MB
 None

Answer 1

如果pandas.Series.apply(func)返回标量，

Series将返回func，或will return a DataFrame if func returns a Series。

u1[gf]是一个系列，因此lambda x: x == u1[gf]会返回一个布尔系列，因此match最终会成为一个DataFrame。

使用df.loc[key]时，key可以是切片，布尔序列或类似列表的索引器，但它不能是DataFrame。如果key是DataFrame，则会引发ValueError('Cannot index with multidimensional key')。

要解决此问题，您可以使用

match = train[gf].apply(lambda x: x == u1[gf].item())

由于u1[gf].item()是一个标量，因此lambda x: x == u1[gf].item()返回一个布尔标量（因此match最终成为一个系列。）

或者，为了获得更好的性能，更好的方法是编写

for gf in group_feat:
    train = train.loc[train[gf] == u1[gf].item()]

完全避免使用带有lambda函数的apply。

为了节省内存（并提高性能），请避免通过替换

来形成中间数据框架

group_feat = ['Type', 'LocationID']
for gf in group_feat:
    match = train[gf].apply(lambda x: x == u1[gf])
    train = train.loc[match]

带

mask = (train['Type'] = u1['Type'].item()) 
        and (train['LocationID'] = u1['LocationID'].item())
train = train.loc[mask]

或更一般地说，

group_feat = ['Type', 'LocationID']
mask = np.logical_and.reduce([train[col] == u1[col].item() for col in group_feat])
train = train.loc[mask]

当group_feat很长时，后者尤其有用。

例如，

import numpy as np
import pandas as pd

train = pd.DataFrame(
    {'Type': ['Rad', 'Rad', 'Rad', 'Rad', 'Rad'], 
     'LocationID': ['6', '6', '6', '6', '6'], 
     'UserID': [0, 1, 2, 3, 4]})
u1 = train.sample(n=1)
group_feat = ['Type', 'LocationID']
mask = np.logical_and.reduce([train[col] == u1[col].item() for col in group_feat])
train = train.loc[mask]

在返回系列

1 个答案: