我有一些数据可以跟踪公司名称随时间的变化。但是,我不想将每个名称都放在一行上,而是希望将它们全部串联在一个字段中。
输入数据可以使用:
#Import the modules:
import pandas as pd
import numpy as np
#Create the empty data frame:
df = pd.DataFrame(columns=['dt','old_name','new_name'])
#Populate the data frame:
df.loc[len(df)] = ['01/01/2001', 'AAA', 'BBB']
df.loc[len(df)] = ['02/02/2002', 'BBB', 'CCC']
df.loc[len(df)] = ['03/03/2003', 'CCC', 'DDD']
#View the output:
df
可以使用以下方法创建输出的外观:
#Create the empty data frame:
end_df = pd.DataFrame(columns=['dt','name'])
#Populate:
end_df.loc[len(end_df)] = ['01/01/2001', 'AAA-BBB-CCC-DDD']
end_df.loc[len(end_df)] = ['02/02/2002', 'AAA-BBB-CCC-DDD']
end_df.loc[len(end_df)] = ['03/03/2003', 'AAA-BBB-CCC-DDD']
#View the output:
end_df
编辑: 我正在使用pandas数据框在Pyspark2中运行此代码-以防对语法造成任何影响。 而且,我的数据集中有多组名称。我的意思是,有更多的名称更改组与第一个名称需要连接的组无关。
样本分组输入:
#Create the empty data frame:
df = pd.DataFrame(columns=['dt','old_name','new_name'])
#Populate the data frame:
df.loc[len(df)] = ['01/01/2001', 'AAA', 'BBB']
df.loc[len(df)] = ['02/02/2002', 'BBB', 'CCC']
df.loc[len(df)] = ['03/03/2003', 'CCC', 'DDD']
df.loc[len(df)] = ['02/01/2001', 'XXX', 'YYY']
df.loc[len(df)] = ['03/02/2002', 'YYY', 'ZZZ']
样本分组输出:
#Create the empty data frame:
end_df = pd.DataFrame(columns=['dt','name'])
#Populate:
end_df.loc[len(end_df)] = ['01/01/2001', 'AAA-BBB-CCC-DDD']
end_df.loc[len(end_df)] = ['02/02/2002', 'AAA-BBB-CCC-DDD']
end_df.loc[len(end_df)] = ['03/03/2003', 'AAA-BBB-CCC-DDD']
end_df.loc[len(end_df)] = ['02/01/2001', 'XXX-YYY-ZZZ']
end_df.loc[len(end_df)] = ['03/02/2002', 'XXX-YYY-ZZZ']
让我知道是否需要进一步说明。
答案 0 :(得分:3)
您需要np.flatten and np.unique
import numpy as np
end_df = pd.DataFrame(columns=['dt','name'])
end_df['dt']=df['dt'].copy()
flat=df[df.columns[1:]].values.flatten()
end_df['name']='-'.join(np.unique(flat))
print(end_df)
dt name
0 01/01/2001 AAA-BBB-CCC-DDD
1 02/02/2002 AAA-BBB-CCC-DDD
2 03/03/2003 AAA-BBB-CCC-DDD
答案 1 :(得分:0)
创建了两个dicts
:old_new_dict
从旧名称遍历到新名称和old_new_dict_rev
从新名称遍历到旧名称:
old_new_dict = {k:v for k,v in zip(df.old_name,df.new_name)}
old_new_dict_rev = {v:k for k,v in zip(df.old_name,df.new_name)}
函数find_tree
,在两个方向上遍历并将它们结合在一起以创建名称的完整路径。
def find_tree(name):
left_list = []
right_list = []
name_l, name_r = name, name
while(name_l in old_new_dict_rev):
left_list.append(old_new_dict_rev[name_l])
name_l = old_new_dict_rev[name_l]
left_list.reverse()
while(name_r in old_new_dict):
right_list.append(old_new_dict[name_r])
name_r = old_new_dict[name_r]
return "-".join(left_list + [name] + right_list)
将完整路径添加为数据帧name
中的df
列:
df['name'] = df['old_name'].apply(lambda x: find_tree(x))
end_df = df.drop(['old_name','new_name'], axis = 1)
end_df
# dt name
#0 01/01/2001 AAA-BBB-CCC-DDD
#1 02/02/2002 AAA-BBB-CCC-DDD
#2 03/03/2003 AAA-BBB-CCC-DDD
#3 02/01/2001 XXX-YYY-ZZZ
#4 03/02/2002 XXX-YYY-ZZZ