熊猫数据框单元格中的嵌套列表如何提取?

时间:2018-10-22 00:03:16

标签: python pandas pivot pivot-table nested-lists

我遇到一个问题,即我在数据框的多列中嵌套了列表。下图供参考-

image of dataframe

SELECT
  data_import.time AS date,
  count(distinct data_import.distinct_id) AS num_installs_on_install_date,
  count(distinct future_activity.distinct_id) AS num_occur_7_days_after,
  count(distinct future_activity.distinct_id) / count(distinct data_import.distinct_id)::float AS percent_used_7_days_after
FROM data_import
LEFT JOIN data_import AS future_activity ON
  data_import.distinct_id = future_activity.distinct_id
    AND
  DATE(data_import.time) = DATE(future_activity.time) - INTERVAL '7 days'
    AND
  data_import.time = ( SELECT
                             time
                           FROM
                             data_import
                           WHERE
                             distinct_id = future_activity.distinct_id
                           ORDER BY
                             time
                           limit
                             1 )
GROUP BY DATE(data_import.time)

我希望“ Subreddit”列中的单元格成为新列,并希望“时间提及”列中的单元格填写新单元格。 “产品名称”列将是新索引。

我尝试使用数据透视表-

df_final['Product Name'] = ('dr. jart+ mask heroes face savers', 
                       'moon juice beauty shroom exfoliating acid 
                        potion',
                       'laneige kiss and make up set')

df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty', 
                   'AsianBeautyAdvice','SkincareAddiction', 
                   'abdiscussion'])

df_final['Times Mentioned'] = (None, [1], [4,1,1,1])

这成功地将“ Subreddit”列的所有嵌套列表转换为新列,但是“提及时间”仅对每列重复第一个数字(下面的示例)

wrong cell fillers

这应该是4、1、1、1,就像原始图像一样。有谁知道如何解决这一问题?

提前谢谢!

1 个答案:

答案 0 :(得分:0)

DF中的某些单元格包含一个列表

['AsianBeauty', 'AsianBeautyAdvice','SkincareAddiction', 'abdiscussion']

,这是一个单元格,需要分解为同一列(Product Name)的单独行。但是,必须在保留Product Name列与其他2列(其中包含必须展开的行)之间的关联的同时进行此操作。我使用了this SO post来做到这一点,同时又保持了关联的完整性。这是我使用的方法,代码中的注释和顶级说明分开显示

OP中的原始DF

import pandas as pd


df_final = pd.DataFrame()
df_final['Product Name'] = ('dr. jart+ mask heroes face savers', 
                           'moon juice beauty shroom exfoliating acid potion',
                           'laneige kiss and make up set')

df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty', 
                       'AsianBeautyAdvice','SkincareAddiction', 
                       'abdiscussion'])

df_final['Times Mentioned'] = (None, [1], [4,1,1,1])
print(df_final)

原始数据(df_final

                                       Product Name                                                          Subreddit Times Mentioned
0                 dr. jart+ mask heroes face savers                                                               None            None
1  moon juice beauty shroom exfoliating acid potion                                                  [scacjdiscussion]             [1]
2                      laneige kiss and make up set  [AsianBeauty, AsianBeautyAdvice, SkincareAddiction, abdiscussion]    [4, 1, 1, 1]

原始数据列dtypes

print(df_final.dtypes)
Product Name       object
Subreddit          object
Times Mentioned    object
dtype: object

爆炸行并创建最终DF的代码

exploded_dfs = []
for _, row in df_final.iterrows():
    if all(row): # if a row does contain all non-None values
        # Put 1st pair of columns into single DF, exploding single
        # cell into multiple rows as needed
        df1 = pd.concat([pd.Series(row['Product Name'], row['Subreddit'][:])])\
                          .reset_index()
        # Assign column names
        df1.columns = ['Subreddit', 'Product Name']
        # Put 2nd pair of columns into single DF, exploding single
        # cell into multiple rows as needed
        df2 = pd.concat([pd.Series(row['Product Name'], row['Times Mentioned'][:])])\
                           .reset_index()
        # Assign column names
        df2.columns = ['Times Mentioned', 'Product Name']
        # Perform INNER JOIN on DFs with exploded row contents
        # & drop duplicated column
        merged = pd.concat([df1, df2], axis=1)
        merged = merged.loc[:,~merged.columns.duplicated()]
        # Swap 1st and 2nd columns
        cols = list(merged)
        cols.insert(0, cols.pop(cols.index('Product Name')))
        merged = merged.loc[:, cols]
    else: # if a row does not contain all non-None values
        # Create single row DF with no changes
        merged = pd.DataFrame(columns=['Product Name', 'Subreddit',
                                      'Times Mentioned'])
        # Append row to DF
        merged.loc[0] = row
    exploded_dfs.append(merged)

# Vertically concatenate DFs in list
print(pd.concat(exploded_dfs, axis=0).reset_index(drop=True))

这是输出

                                       Product Name          Subreddit Times Mentioned
0                 dr. jart+ mask heroes face savers               None            None
1  moon juice beauty shroom exfoliating acid potion    scacjdiscussion               1
2                      laneige kiss and make up set        AsianBeauty               4
3                      laneige kiss and make up set  AsianBeautyAdvice               1
4                      laneige kiss and make up set  SkincareAddiction               1
5                      laneige kiss and make up set       abdiscussion               1

步骤简要说明

  • 在所有行上重复
    • 请注意,如果该行包含任何None值,它将照原样使用,因此假定该行不需要清理:该行将仅为appended to a single row DF
  • 对于原始DF中不包含所有None的第一行
    • 如有必要,将具有列表(Subreddit)的第一列的单元格分解为列(在this question中进行解释)
    • 水平合并爆炸单元格(现在为多行),其中无列的行来自列表(Product Name);这给出了已清理的DF df1
    • 使用带有列表(Times Mentioned)的第二列重复上述最后2个步骤;这给出了已清理的DF df1
    • 将2个已清理的DF水平连接到名为merged的新DF中
    • 对原始DF中的所有行重复上述过程,并将清理后的DF附加到空白列表中
    • 使用列表中所有DF的垂直串联组装最终DF

包裹信息

pandas==0.23.4

Python版本

Python 2.7.15rc1