我的搜索技巧必定让我失望,因为这必须是一个常见的问题。我有一个带有嵌套列表的数据框,我试图删除那些没有最长列表的数据:
df = pd.DataFrame(data = [["a", "b", "c", ["d", "e"]],
["a", "b", "c", ["e"]],
["l", "m", "n", ["o"]],
columns = ["c1", "c2", "c3", "c4"])
# max doesn't evaluate length ~ this is wrong
df.groupby(by=["c1", "c2", "c3"])["c4"].apply(max)
c1 c2 c3
a b c [e]
l m n [o]
Name: c4, dtype: object
# but length does ~ but using an int to equate to another row isn't guaranteed
df.groupby(by=["c1", "c2", "c3"])["c4"].apply(len)
c1 c2 c3
a b c 2
l m n 1
Name: c4, dtype: int64
这些必须首先分组,因为这三列中的每一列构成一个我需要最长列表的唯一主密钥。每个组也有不同的长度列表,对于大多数它的大小是1,对于其他组,它可以高达5.最终目标应该是这样的新数据帧:
c1 c2 c3 c4
a b c ["d", "e"]
l m n ["o"]
答案 0 :(得分:3)
这个怎么样:
df = pd.DataFrame(data =[["a", "b", "c", ["d", "e"]],
["a", "b", "c", ["e"]],
["l", "m", "n", ["o"]]],
columns = ["c1", "c2", "c3", "c4"])
df['len'] = df['c4'].apply(len)
max_groups = df[df.groupby(['c1', 'c2', 'c3'])['len'].transform(max) == df['len']]
我们在c4
中添加一个与列表长度相对应的额外列,然后将数据帧过滤到c4
长度与{{1}的最大长度相同的记录分组。这会将c4
返回为:
max_groups