删除没有最长列表

时间:2018-03-22 15:29:07

标签: python pandas

我的搜索技巧必定让我失望,因为这必须是一个常见的问题。我有一个带有嵌套列表的数据框,我试图删除那些没有最长列表的数据:

df = pd.DataFrame(data = [["a", "b", "c", ["d", "e"]],
                          ["a", "b", "c", ["e"]],
                          ["l", "m", "n", ["o"]], 
                  columns = ["c1", "c2", "c3", "c4"])

# max doesn't evaluate length ~ this is wrong
df.groupby(by=["c1", "c2", "c3"])["c4"].apply(max)
c1  c2  c3
a   b   c        [e]
l   m   n        [o]
Name: c4, dtype: object

# but length does ~ but using an int to equate to another row isn't guaranteed
df.groupby(by=["c1", "c2", "c3"])["c4"].apply(len)
c1  c2  c3
a   b   c     2
l   m   n     1
Name: c4, dtype: int64

这些必须首先分组,因为这三列中的每一列构成一个我需要最长列表的唯一主密钥。每个组也有不同的长度列表,对于大多数它的大小是1,对于其他组,它可以高达5.最终目标应该是这样的新数据帧:

c1  c2  c3  c4
a   b   c   ["d", "e"]
l   m   n   ["o"]

1 个答案:

答案 0 :(得分:3)

这个怎么样:

df = pd.DataFrame(data =[["a", "b", "c", ["d", "e"]],
                         ["a", "b", "c", ["e"]],
                         ["l", "m", "n", ["o"]]],
                  columns = ["c1", "c2", "c3", "c4"])

df['len'] = df['c4'].apply(len)

max_groups = df[df.groupby(['c1', 'c2', 'c3'])['len'].transform(max) == df['len']]

我们在c4中添加一个与列表长度相对应的额外列,然后将数据帧过滤到c4长度与{{1}的最大长度相同的记录分组。这会将c4返回为:

max_groups