Question

我有一个看起来像这样的Pandas DataFrame：

   user_id  item_timestamp                item_cashtags                                       item_sectors                                    item_industries
0   406225      1483229353                          SPY                                          Financial                               Exchange Traded Fund
1   406225      1483229353                          ERO                                          Financial                               Exchange Traded Fund
2   406225      1483229350  CAKE|IWM|SDS|SPY|X|SPLK|QQQ  Services|Financial|Financial|Financial|Basic M...  Restaurants|Exchange Traded Fund|Exchange Trad...
3   619769      1483229422                         AAPL                                         Technology                                 Personal Computers
4   692735      1483229891                         IVOG                                          Financial                               Exchange Traded Fund

我想用|划分现金标签，行业和行业列。每个现金标签对应一个行业，对应一个行业，因此它们的金额相等。

我希望输出结果是每个现金标签，行业和行业都有自己的行，并复制item_timestamp和user_id，即：

   user_id  item_timestamp                item_cashtags              item_sectors                                    item_industries
2   406225      1483229350               CAKE|IWM|SDS               Services|Financial|Financial        Restaurants|Exchange Traded Fund|Exchange Traded Fund

将成为：

 user_id  item_timestam       item_cashtags         item_sectors              item_industries
406225      1483229350          CAKE                Services                    Restaurants
406225      1483229350          IWM                 Financial                   Exchange Traded Fund
406225      1483229350          SDS                 Financial                   Exchange Traded Fund

我的问题是这是一个有条件的拆分，我不确定在熊猫中该怎么做

Answer 1

如果框架不是很大，一个简单的选择是仅循环浏览各行。但我同意，这不是最pandamic的方式，而且绝对不是最有效的方式。

from copy import copy
result = []  
for idx, row in df.iterrows():
    d = dict(row)
    for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')): 
        # here you can add an if to filter on certain categories 
        dd = copy(d)
        dd['cat1'] = cat1
        dd['cat2'] = cat2
        result.append(dd)
pd.DataFrame(result)  # convert back

Answer 2

好吧，我不知道它的性能如何，但这是另一种方法

# test_data
df_dict = {
    "user_id": [406225, 406225],
    "item_timestamp": [1483229350, 1483229353],
    "item_cashtags": ["CAKE|IWM|SDS", "SPY"],
    "item_sectors": ["Services|Financial|Financial", "Financial"],
    "item_industries": [
        "Restaurants|Exchange Traded Fund|Exchange Traded Fund",
        "Exchange Traded Fund"
    ]
}
df = pd.DataFrame(df_dict)

# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]

# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}

# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
    new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()

# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)

# new_df
#   item_cashtags item_sectors item_industries      user_id item_timestamp
# 0 CAKE          Services     Restaurants          406225  1483229350
# 1 IWM           Financial    Exchange Traded Fund 406225  1483229350
# 2 SDS           Financial    Exchange Traded Fund 406225  1483229350
# 3 SPY           Financial    Exchange Traded Fund 406225  1483229353

熊猫：条件行拆分

2 个答案: