我遇到一个问题,即我在数据框的多列中嵌套了列表。下图供参考-
SELECT
data_import.time AS date,
count(distinct data_import.distinct_id) AS num_installs_on_install_date,
count(distinct future_activity.distinct_id) AS num_occur_7_days_after,
count(distinct future_activity.distinct_id) / count(distinct data_import.distinct_id)::float AS percent_used_7_days_after
FROM data_import
LEFT JOIN data_import AS future_activity ON
data_import.distinct_id = future_activity.distinct_id
AND
DATE(data_import.time) = DATE(future_activity.time) - INTERVAL '7 days'
AND
data_import.time = ( SELECT
time
FROM
data_import
WHERE
distinct_id = future_activity.distinct_id
ORDER BY
time
limit
1 )
GROUP BY DATE(data_import.time)
我希望“ Subreddit”列中的单元格成为新列,并希望“时间提及”列中的单元格填写新单元格。 “产品名称”列将是新索引。
我尝试使用数据透视表-
df_final['Product Name'] = ('dr. jart+ mask heroes face savers',
'moon juice beauty shroom exfoliating acid
potion',
'laneige kiss and make up set')
df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty',
'AsianBeautyAdvice','SkincareAddiction',
'abdiscussion'])
df_final['Times Mentioned'] = (None, [1], [4,1,1,1])
这成功地将“ Subreddit”列的所有嵌套列表转换为新列,但是“提及时间”仅对每列重复第一个数字(下面的示例)
这应该是4、1、1、1,就像原始图像一样。有谁知道如何解决这一问题?
提前谢谢!
答案 0 :(得分:0)
DF中的某些单元格包含一个列表
['AsianBeauty', 'AsianBeautyAdvice','SkincareAddiction', 'abdiscussion']
,这是一个单元格,需要分解为同一列(Product Name
)的单独行。但是,必须在保留Product Name
列与其他2列(其中包含必须展开的行)之间的关联的同时进行此操作。我使用了this SO post来做到这一点,同时又保持了关联的完整性。这是我使用的方法,代码中的注释和顶级说明分开显示
OP中的原始DF
import pandas as pd
df_final = pd.DataFrame()
df_final['Product Name'] = ('dr. jart+ mask heroes face savers',
'moon juice beauty shroom exfoliating acid potion',
'laneige kiss and make up set')
df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty',
'AsianBeautyAdvice','SkincareAddiction',
'abdiscussion'])
df_final['Times Mentioned'] = (None, [1], [4,1,1,1])
print(df_final)
原始数据(df_final
)
Product Name Subreddit Times Mentioned
0 dr. jart+ mask heroes face savers None None
1 moon juice beauty shroom exfoliating acid potion [scacjdiscussion] [1]
2 laneige kiss and make up set [AsianBeauty, AsianBeautyAdvice, SkincareAddiction, abdiscussion] [4, 1, 1, 1]
原始数据列dtypes
print(df_final.dtypes)
Product Name object
Subreddit object
Times Mentioned object
dtype: object
爆炸行并创建最终DF的代码
exploded_dfs = []
for _, row in df_final.iterrows():
if all(row): # if a row does contain all non-None values
# Put 1st pair of columns into single DF, exploding single
# cell into multiple rows as needed
df1 = pd.concat([pd.Series(row['Product Name'], row['Subreddit'][:])])\
.reset_index()
# Assign column names
df1.columns = ['Subreddit', 'Product Name']
# Put 2nd pair of columns into single DF, exploding single
# cell into multiple rows as needed
df2 = pd.concat([pd.Series(row['Product Name'], row['Times Mentioned'][:])])\
.reset_index()
# Assign column names
df2.columns = ['Times Mentioned', 'Product Name']
# Perform INNER JOIN on DFs with exploded row contents
# & drop duplicated column
merged = pd.concat([df1, df2], axis=1)
merged = merged.loc[:,~merged.columns.duplicated()]
# Swap 1st and 2nd columns
cols = list(merged)
cols.insert(0, cols.pop(cols.index('Product Name')))
merged = merged.loc[:, cols]
else: # if a row does not contain all non-None values
# Create single row DF with no changes
merged = pd.DataFrame(columns=['Product Name', 'Subreddit',
'Times Mentioned'])
# Append row to DF
merged.loc[0] = row
exploded_dfs.append(merged)
# Vertically concatenate DFs in list
print(pd.concat(exploded_dfs, axis=0).reset_index(drop=True))
这是输出
Product Name Subreddit Times Mentioned
0 dr. jart+ mask heroes face savers None None
1 moon juice beauty shroom exfoliating acid potion scacjdiscussion 1
2 laneige kiss and make up set AsianBeauty 4
3 laneige kiss and make up set AsianBeautyAdvice 1
4 laneige kiss and make up set SkincareAddiction 1
5 laneige kiss and make up set abdiscussion 1
步骤简要说明
None
值,它将照原样使用,因此假定该行不需要清理:该行将仅为appended to a single row DF None
的第一行
Subreddit
)的第一列的单元格分解为列(在this question中进行解释)Product Name
);这给出了已清理的DF df1
Times Mentioned
)的第二列重复上述最后2个步骤;这给出了已清理的DF df1
merged
的新DF中
包裹信息
pandas==0.23.4
Python版本
Python 2.7.15rc1