包含对象列表的pandas列,根据键名拆分此列,并将值存储为逗号分隔值

时间:2017-11-15 04:40:27

标签: python json list pandas dataframe

我有一个包含列的数据框:

A
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}]
[{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}]
[{"A": 28, "B": "abc"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "xyz"}]

输出应该是:

A              B
28,29,30       abc,def,hij
31,32          hij,abc
28             abc
28,29,30       abc,def,hij
28,29,30       abc,klm,nop
28,29          abc,xyz

如何根据键名将对象列表拆分为列,并将它们存储为逗号分隔值,如上所示。

3 个答案:

答案 0 :(得分:5)

使用stack然后groupby

df.A.apply(pd.Series).stack().\
     apply(pd.Series).groupby(level=0).\
        agg(lambda x :','.join(x.astype(str)))
Out[457]: 
          A            B
0  28,29,30  abc,def,hij
1     31,32      hij,abc
2        28          abc
3  28,29,30  abc,def,hij
4  28,29,30  abc,klm,nop

数据输入:

df=pd.DataFrame({'A':[[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}],
[{"A": 28, "B": "abc"}],[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}]]})

从csv

中读取您的其他问题
import ast
df=pd.read_csv(r'your.csv',dtype={'A':object})

df['A'] = df['A'].apply(ast.literal_eval)

答案 1 :(得分:4)

我假设A是一个词典列表

A = [
    [{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
    [{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}],
    [{"A": 28, "B": "abc"}],
    [{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
    [{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}],
    [{"A": 28, "B": "abc"},{"A": 29, "B": "xyz"}]
]

我要做的第一件事是使用理解来创建一个新词典。然后在','.join

groupby
B = {
    (i, j, k): v
    for j, row in enumerate(A)
    for i, d in enumerate(row)
    for k, v in d.items()
}

pd.Series(B).astype(str).groupby(level=[1, 2]).apply(','.join).unstack()

          A            B
0  28,29,30  abc,def,hij
1     31,32      hij,abc
2        28          abc
3  28,29,30  abc,def,hij
4  28,29,30  abc,klm,nop
5     28,29      abc,xyz

答案 2 :(得分:3)

我以为我会对此有所了解。首先,从不使用RewriteCond %{HTTP_HOST} !.*example\.com$ RewriteCond %{HTTP_HOST} ^(?:www\.)?([a-z0-9-]+\.[a-z]+) [NC],您可以避免它。更好的解决方案是使用eval

ast

接下来,展平您的列:

import ast
df.A = df.A.apply(ast.literal_eval)

现在,使用i = df.A.str.len().cumsum() # we'll need this later df = pd.DataFrame.from_dict(np.concatenate(df.A).tolist()) df.A = df.A.astype(str) df A B 0 28 abc 1 29 def 2 30 hij 3 31 hij 4 32 abc 5 28 abc 6 28 abc 7 29 def 8 30 hij 9 28 abc 10 29 klm 11 30 nop 12 28 abc 13 29 xyz

制作的时间间隔执行groupby
i

得到了Bharath here的一点帮助。

idx = pd.cut(df.index, bins=np.append([0], i), include_lowest=True, right=False) df = df.groupby(idx, as_index=False).agg(','.join) df A B 0 28,29,30 abc,def,hij 1 31,32 hij,abc 2 28 abc 3 28,29,30 abc,def,hij 4 28,29,30 abc,klm,nop 5 28,29 abc,xyz proposed by Wen)的一个很酷的替代方法是使用IntervalIndex

np.put

性能

i = df.A.str.len().cumsum()  
df = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df.A = df.A.astype(str)

v = pd.Series(0, index=df.index)
np.put(v, i-1, [1] * len(i))

df = df.groupby(v[::-1].cumsum()).agg(','.join)[::-1].reset_index(drop=True)

df

          A            B
0  28,29,30  abc,def,hij
1     31,32      hij,abc
2        28          abc
3  28,29,30  abc,def,hij
4  28,29,30  abc,klm,nop
5     28,29      abc,xyz
df = pd.concat([df] * 1000, ignore_index=True)
%%timeit 
df.A.apply(pd.Series).stack().\
     apply(pd.Series).groupby(level=0).\
        agg(lambda x :','.join(x.astype(str)))

1 loop, best of 3: 8.76 s per loop
%%timeit 
A = df.A.values.tolist()
B = {
    (i, j, k): v
    for j, row in enumerate(A)
    for i, d in enumerate(row)
    for k, v in d.items()
}    
pd.Series(B).astype(str).groupby(level=[1, 2]).apply(','.join).unstack()

1 loop, best of 3: 2.08 s per loop
%%timeit
i = df.A.str.len().cumsum() 
df2 = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df2.A = df2.A.astype(str)
idx = pd.cut(df2.index, bins=np.append([0], i), include_lowest=True, right=False)
df2.groupby(idx, as_index=False).agg(','.join)

1 loop, best of 3: 810 ms per loop