我有一个这样的数据框
id_1 id_desc cat_1 cat_2
111 ask ele phone
222 ask hr ele phone
333 ask hr dk ele phone
444 askh ele phone
如果cat_1
,cat_2
对于多个id_1
是相同的,则需要将该关联捕获为新列。
需要这样的输出
id_1 id_desc cat_1 cat_2 id_2
111 ask ele phone 222
111 ask ele phone 333
111 ask ele phone 444
222 ask hr ele phone 111
222 ask hr ele phone 333
222 ask hr ele phone 444
333 ask hr dk ele phone 111
333 ask hr dk ele phone 222
333 ask hr dk ele phone 444
如何在python中完成此操作?
答案 0 :(得分:0)
我无法提出任何特别优雅的方法,但这应该可以完成工作:
import pandas as pd
import numpy as np
df = pd.DataFrame([[111, 'ask', 'ele', 'phone'],
[222, 'ask_hr', 'ele', 'phone'],
[333, 'ask_hr_dk', 'ele', 'phone'],
[444, 'askh', 'ele', 'phone']],
columns=['id_1', 'id_desc', 'cat_1', 'cat_2'])
grouped = df.groupby(by=['cat_1', 'cat_2']) # group by the columns you want to be identical
data = [] # a list to store all unique groups
# In your example, this loop is not needed, but this generalizes to more than 1 pair
# of cat_1 and cat_2 values
for group in grouped.groups:
n_rows = grouped.get_group(group).shape[0] # how many unique id's in a group
all_data = np.tile(grouped.get_group(group).values, (n_rows, 1)) # tile the data n_row times
ids = np.repeat(grouped.get_group(group)['id_1'].values, n_rows) # repeat the ids n_row times
data += [np.c_[all_data, ids]] # concat the two sets of data and add to list
df_2 = pd.DataFrame(np.concatenate(data), columns=['id_1', 'id_desc', 'cat_1', 'cat_2', 'id_2'])
基本思想是按照cat_1
和cat_2
列对数据进行分组(使用groupby
),使用np.tile
创建每个组的副本的次数要多该组中有id_1
的唯一值,并将结果与唯一的id_1
值(每组数据一个值)连接起来。
如果您不希望id_1
与id_2
相同,只需选择不匹配的行:
df_2 = df_2[df_2['id_1'] != df_2['id_2']]
如果希望它们按id_1
排序:
df_2.sort_values('id_1', inplace=True)