我有一个包含6列的数据框:
Field Type Dataset Year Month Day
0 DATA_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
1 OWNER_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
2 PRODUCTION_DAY string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
3 OBJECT_ID string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
4 CODE string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
5 NAME string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
6 OBJECT_START_DATE string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
7 OBJECT_END_DATE string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
8 MASTER_SYS_CODE string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
9 MASTER_SYS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
10 NODE_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
11 SORT_ORDER double ECKERNEL_BOL.RV_PWEL_RESULT 2019 9 5
8000 AFE_NO string EDMREAD.PDE_QTCO_EVENTS 2019 9 27
8001 AFE_TOTAL double EDMREAD.PDE_QTCO_EVENTS 2019 9 27
8002 POLICY_ID string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8003 PROJECT_ID string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8004 EVENT_ID string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8005 EVENT_CODE string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8006 EVENT_TYPE string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8007 EQUIP_TYPE string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8008 EVENT_START_DATE string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
8009 EVENT_END_DATE string EDMREAD.PDE_QTCO_EVENTS 2019 9 28
我想基于每个数据集为年,月和日作为不同字段的每个数据集创建一个新行,并删除这些列,因为我对它们的值不感兴趣。
结果应如下所示:
Field Type Dataset
0 DATA_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT
1 OWNER_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT
2 PRODUCTION_DAY string ECKERNEL_BOL.RV_PWEL_RESULT
3 OBJECT_ID string ECKERNEL_BOL.RV_PWEL_RESULT
4 CODE string ECKERNEL_BOL.RV_PWEL_RESULT
5 NAME string ECKERNEL_BOL.RV_PWEL_RESULT
6 OBJECT_START_DATE string ECKERNEL_BOL.RV_PWEL_RESULT
7 OBJECT_END_DATE string ECKERNEL_BOL.RV_PWEL_RESULT
8 MASTER_SYS_CODE string ECKERNEL_BOL.RV_PWEL_RESULT
9 MASTER_SYS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT
10 NODE_CLASS_NAME string ECKERNEL_BOL.RV_PWEL_RESULT
11 SORT_ORDER double ECKERNEL_BOL.RV_PWEL_RESULT
12 Year string ECKERNEL_BOL.RV_PWEL_RESULT
13 Day string ECKERNEL_BOL.RV_PWEL_RESULT
14 Month string ECKERNEL_BOL.RV_PWEL_RESULT
8000 AFE_NO string EDMREAD.PDE_QTCO_EVENTS
8001 AFE_TOTAL double EDMREAD.PDE_QTCO_EVENTS
8002 POLICY_ID string EDMREAD.PDE_QTCO_EVENTS
8003 PROJECT_ID string EDMREAD.PDE_QTCO_EVENTS
8004 EVENT_ID string EDMREAD.PDE_QTCO_EVENTS
8005 EVENT_CODE string EDMREAD.PDE_QTCO_EVENTS
8006 EVENT_TYPE string EDMREAD.PDE_QTCO_EVENTS
8007 EQUIP_TYPE string EDMREAD.PDE_QTCO_EVENTS
8008 EVENT_START_DATE string EDMREAD.PDE_QTCO_EVENTS
8009 EVENT_END_DATE string EDMREAD.PDE_QTCO_EVENTS
8010 Year string EDMREAD.PDE_QTCO_EVENTS
8011 Day string EDMREAD.PDE_QTCO_EVENTS
8012 Month string EDMREAD.PDE_QTCO_EVENTS
答案 0 :(得分:1)
您可以通过执行以下操作添加这些行:
df1 = df[['Field','Type','Dataset']] #remove date columns since they have no useful data
for i in set(df.Dataset): # for each dataset,
for j in ['Year','Month','Day']: # for Y,M and D, make a row and add it to the frame.
row = pd.DataFrame({'Field':[j],'Type':['string'],'Dataset':[i]})
df1 = df1.append(row, sort=False, ignore_index=True)
df1 = df1.sort_values(by='Dataset') #sort by dataframe to make pretty
这仅在您不关心索引的情况下。
答案 1 :(得分:1)
我设法获得了另一个选择,比以前的选择快了一点:
df = pd.read_csv(path)
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
dataset_names = df['Dataset']
dataset_names.drop_duplicates(inplace=True)
dataset_names.reset_index(drop=True, inplace=True)
def df_new_column(list_names, value):
row_list = []
for dataset in list_names:
dictionary = {}
row = {'Field': value,
'Type': 'string',
'Dataset': dataset}
dictionary.update(row)
row_list.append(dictionary)
data = pd.DataFrame(row_list)
return data
year = df_new_column(dataset_names, 'YEAR')
year = df.append(year, ignore_index=True, verify_integrity=True, sort=True)
month = df_new_column(dataset_names, 'MONTH')
month = year.append(month, ignore_index=True, verify_integrity=True, sort=True)
day = df_new_column(dataset_names, 'DAY')
day = month.append(day, ignore_index=True, verify_integrity=True, sort=True)
时间:
以上解决方案:
real 0m0,433s
user 0m0,722s
sys 0m0,605s
real 0m0,546s
user 0m0,907s
sys 0m0,531s