Question

我有一个具有以下结构的数据框：

df.columns
Index(['first_post_date', 'followers_count', 'friends_count',
       'last_post_date','min_retweet', 'retweet_count', 'screen_name',
       'tweet_count',  'tweet_with_max_retweet', 'tweets', 'uid'],
        dtype='object')

在推文系列中，每个单元格都是包含用户所有推文的另一个数据框。

df.tweets[0].columns
Index(['created_at', 'id', 'retweet_count', 'text'], dtype='object')

我想将此数据帧转换为多索引帧，主要是通过破坏包含推文的单元格。一个索引是 uid ，另一个索引是推文中的 id 。

我该怎么做？

link to sample data

Answer 1

所以从df开始，你有推文列包含df的推文，所以我创建了一个tweets_df数据帧并将推文中的所有df连接到tweets_df，添加uid列以了解推文的哪个uid属于，然后将uid的信息合并到tweets_df，以便在需要时进一步处理。如果您需要进一步修改，请评论。很难获得您的样本数据并转换为json。所以我在猜测时这样做，希望它仍然能给你一些想法。

import pandas as pd

df = .... #your df

tweets_df = pd.DataFrame() #create blank df to contain tweets

# explode tweets to df
## loop each uid
for uid in df['uid']:
    temp = df.loc[df['uid']==uid, :] # select df by uid
    temp = temp['tweets'].iloc[0] # select tweets column -> df
    temp['uid'] = uid # add uid column to know tweets belong to which uid
    tweets_df = pd.concat([results, temp], ignore_index=True) # concat to container df

# get a uid info df from starting df
uid_info_column = df.columns
uid_info_column.remove('tweets')
uid_info_df = df.loc[:, uid_info_column]


# merge info on uid with tweets_df
final = pd.merge(left=tweets_df, right=uid_info_df, on='uid', how='outer')

将单个索引pandas数据帧转换为多索引

1 个答案: