我的初始数据框是这样的:
import pandas as pd
df = pd.DataFrame({'serialNo':['aaaa','aaaa','cccc','ffff'],
'Date':['2018-09-15','2018-09-16','2018-09-15','2018-09-19'],
'moduleLocation': ['face','head','stomach','legs'],
'moduleName': ['singing', 'dance','booze', 'vocals'],
'warning': [4402, 3747 ,5555,8754],
'failed':[0,3462,5161,3262]})
我已经执行了以下功能来清理数据,首先是将所有数据类型设置为字符串:
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
后面是执行某些串联的功能:
def concatenate(diagnostics, field, target):
diagnostics.sort_values(by=['serialNo',field],inplace=True)
diagnostics.drop_duplicates(inplace=True)
diagnostics[target] = \
diagnostics.groupby(['serialNo'], as_index=False)[field].transform(lambda s: ','.join(filter(None, s)))
diagnostics.drop([field],axis=1,inplace=True)
diagnostics.drop_duplicates(inplace=True)
return diagnostics
module = concatenate(df[['serialNo','moduleName']], 'moduleName', 'Module')
Warn = concatenate(df[['serialNo','warning']], 'warning', 'Warn')
Err = concatenate(df[['serialNo','failed']], 'failed', 'Err')
Location = concatenate(df[['serialNo','moduleLocation']], 'moduleLocation', 'Location')
diag_final = pd.merge(module,Warn,on=['serialNo'],how='inner')
diag_final = pd.merge(diag_final,Err,on=['serialNo'],how='inner')
diag_final = pd.merge(diag_final,Location,on=['serialNo'],how='inner')
现在的问题是,diag_final数据框中的日期列不再存在,我希望拥有它。我不想更改现有功能,而只是确保我具有相应的日期。我应该如何实现?
答案 0 :(得分:0)
每个序列号可能有多个值。因此,您将必须连接值,类似于对moduleLocation和moduleName所做的操作。
dates = concatenate(df[['serialNo','Date']], 'Date', 'Date_cat')
diag_final = pd.merge(diag_final,dates,on=['serialNo'],how='inner')