Question

所以我有一个来自csv文件的pandas数据框如下所示：

year,month,day,list
2017,09,01,"[('United States of America', 12345), (u'Germany', 54321), (u'Switzerland', 13524), (u'Netherlands', 24135), ... ]
2017,09,02,"[('United States of America', 6789), (u'Germany', 9876), (u'Switzerland', 6879), (u'Netherlands', 7968), ... ]

每行第4列中的国家/地区计数对数不相同我想在第4列中展开列表，并将数据帧转换为如下所示：

year,month,day,country,count
2017,09,01,'United States of America',12345
2017,09,01,'Germany',54321
2017,09,01,'Switzerland',13524
2017,09,01,'Netherlands',24135
...
2017,09,02,'United States of America',6789
2017,09,02,'Germany',9876
2017,09,02,'Switzerland',6879
2017,09,02,'Netherlands',7968
...

我的想法是生成2个独立的列，然后将它们连接到原始数据帧。也许是这样的：

country = df.apply(lambda x:[x['list'][0]]).stack().reset_index(level=1, drop=True)
count  = df.apply(lambda x:[x['list'][1]]).stack().reset_index(level=1, drop=True)
df.drop('list', axis=1).join(country).join(count)

上面的代码肯定不起作用（我希望它可以帮助表达我的想法），我也不知道如何扩展日期列。
任何帮助或建议都非常感谢。

Answer 1

解决问题的最简单方法可能是迭代数据框中包含的元组，并创建一个新元组。您可以使用两个嵌套的for循环来完成它。

df_new = []
for i in df.itertuples():
    for l in i.list:
        df_new.append([i.year, i.month, i.day, l[0], l[1]])

df_new = pd.DataFrame(df_new, columns=['year', 'month', 'day', 'country', 'count'])

如果列表的第四个字段不是实际列表而是字符串（数据框示例中的双引号让我有些疑惑），您可以使用literal_eval库中的ast函数：Converting a string representation of a list into an actual list object

Answer 2

使用：

import ast
#convert strings to lists of tuples
df['list'] = df['list'].apply(ast.literal_eval)
#create reshaped df from column list
df1 =pd.DataFrame([dict(x) for x in df['list'].values.tolist()]).stack().reset_index(level=1)
df1.columns = ['country','count']
#join to original
df = df.drop('list', 1).join(df1).reset_index(drop=True)
print (df)
   year  month  day                   country  count
0  2017      9    1                   Germany  54321
1  2017      9    1               Netherlands  24135
2  2017      9    1               Switzerland  13524
3  2017      9    1  United States of America  12345
4  2017      9    2                   Germany   9876
5  2017      9    2               Netherlands   7968
6  2017      9    2               Switzerland   6879
7  2017      9    2  United States of America   6789

Answer 3

因此，您需要将具有值列表的列转换为多行。一种解决方案是创建一个新的数据框并执行左join：

df = pd.DataFrame({'A':['a','b'],'B':['x','y'],
                   'C':[['a1', 'a2'],['b1', 'b2', 'b3']]})

df
#    A  B               C
# 0  a  x      [[a1, a2]]
# 1  b  y  [[b1, b2, b3]]

dfr=df['C'].apply(lambda k: pd.Series(k)).stack().reset_index(level=1, drop=True).to_frame('C')

dfr
#     C
# 0  a1
# 0  a2
# 1  b1
# 1  b2
# 1  b3

df[['A','B']].join(dfr, how='left')
#    A  B   C
# 0  a  x  a1
# 0  a  x  a2
# 1  b  y  b1
# 1  b  y  b2
# 1  b  y  b3

最后，使用reset_index()

df[['A','B']].join(dfr, how='left').reset_index(drop=1)
#    A  B   C
# 0  a  x  a1
# 1  a  x  a2
# 2  b  y  b1
# 3  b  y  b2
# 4  b  y  b3

信用：https://stackoverflow.com/a/39955283/2314737

Python Pandas Dataframe：如何从数据框中的现有列表创建列？

3 个答案: