如何基于列中的值添加新行

时间:2019-09-30 14:35:05

标签: python-3.x pandas

我有一个包含6列的数据框:

                Field    Type                      Dataset  Year  Month  Day
0     DATA_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
1    OWNER_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
2      PRODUCTION_DAY  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
3           OBJECT_ID  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
4                CODE  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
5                NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
6   OBJECT_START_DATE  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
7     OBJECT_END_DATE  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
8     MASTER_SYS_CODE  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
9     MASTER_SYS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
10    NODE_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
11         SORT_ORDER  double  ECKERNEL_BOL.RV_PWEL_RESULT  2019      9    5
8000           AFE_NO  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    27
8001        AFE_TOTAL  double  EDMREAD.PDE_QTCO_EVENTS      2019      9    27
8002        POLICY_ID  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8003       PROJECT_ID  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8004         EVENT_ID  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8005       EVENT_CODE  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8006       EVENT_TYPE  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8007       EQUIP_TYPE  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8008 EVENT_START_DATE  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28
8009   EVENT_END_DATE  string  EDMREAD.PDE_QTCO_EVENTS      2019      9    28

我想基于每个数据集为年,月和日作为不同字段的每个数据集创建一个新行,并删除这些列,因为我对它们的值不感兴趣。

结果应如下所示:

                Field    Type                      Dataset
0     DATA_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT
1    OWNER_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT
2      PRODUCTION_DAY  string  ECKERNEL_BOL.RV_PWEL_RESULT
3           OBJECT_ID  string  ECKERNEL_BOL.RV_PWEL_RESULT
4                CODE  string  ECKERNEL_BOL.RV_PWEL_RESULT
5                NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT
6   OBJECT_START_DATE  string  ECKERNEL_BOL.RV_PWEL_RESULT
7     OBJECT_END_DATE  string  ECKERNEL_BOL.RV_PWEL_RESULT
8     MASTER_SYS_CODE  string  ECKERNEL_BOL.RV_PWEL_RESULT
9     MASTER_SYS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT
10    NODE_CLASS_NAME  string  ECKERNEL_BOL.RV_PWEL_RESULT
11         SORT_ORDER  double  ECKERNEL_BOL.RV_PWEL_RESULT
12               Year  string  ECKERNEL_BOL.RV_PWEL_RESULT
13                Day  string  ECKERNEL_BOL.RV_PWEL_RESULT
14              Month  string  ECKERNEL_BOL.RV_PWEL_RESULT
8000           AFE_NO  string  EDMREAD.PDE_QTCO_EVENTS
8001        AFE_TOTAL  double  EDMREAD.PDE_QTCO_EVENTS
8002        POLICY_ID  string  EDMREAD.PDE_QTCO_EVENTS
8003       PROJECT_ID  string  EDMREAD.PDE_QTCO_EVENTS
8004         EVENT_ID  string  EDMREAD.PDE_QTCO_EVENTS
8005       EVENT_CODE  string  EDMREAD.PDE_QTCO_EVENTS
8006       EVENT_TYPE  string  EDMREAD.PDE_QTCO_EVENTS
8007       EQUIP_TYPE  string  EDMREAD.PDE_QTCO_EVENTS
8008 EVENT_START_DATE  string  EDMREAD.PDE_QTCO_EVENTS
8009   EVENT_END_DATE  string  EDMREAD.PDE_QTCO_EVENTS
8010             Year  string  EDMREAD.PDE_QTCO_EVENTS
8011              Day  string  EDMREAD.PDE_QTCO_EVENTS
8012            Month  string  EDMREAD.PDE_QTCO_EVENTS

2 个答案:

答案 0 :(得分:1)

您可以通过执行以下操作添加这些行:

df1 = df[['Field','Type','Dataset']] #remove date columns since they have no useful data
for i in set(df.Dataset): # for each dataset,
    for j in ['Year','Month','Day']: # for Y,M and D, make a row and add it to the frame.
        row = pd.DataFrame({'Field':[j],'Type':['string'],'Dataset':[i]})
        df1 = df1.append(row, sort=False, ignore_index=True)
df1 = df1.sort_values(by='Dataset') #sort by dataframe to make pretty

这仅在您不关心索引的情况下。

答案 1 :(得分:1)

我设法获得了另一个选择,比以前的选择快了一点:

df = pd.read_csv(path)
df.drop(columns=['Year', 'Month', 'Day'], inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
dataset_names = df['Dataset']
dataset_names.drop_duplicates(inplace=True)
dataset_names.reset_index(drop=True, inplace=True)


def df_new_column(list_names, value):
    row_list = []
    for dataset in list_names:
        dictionary = {}
        row = {'Field': value,
               'Type': 'string',
               'Dataset': dataset}
        dictionary.update(row)
        row_list.append(dictionary)
    data = pd.DataFrame(row_list)
    return data

year = df_new_column(dataset_names, 'YEAR')
year = df.append(year, ignore_index=True, verify_integrity=True, sort=True)
month = df_new_column(dataset_names, 'MONTH')
month = year.append(month, ignore_index=True, verify_integrity=True, sort=True)
day = df_new_column(dataset_names, 'DAY')
day = month.append(day, ignore_index=True, verify_integrity=True, sort=True)

时间:

以上解决方案:

real    0m0,433s
user    0m0,722s
sys     0m0,605s

Jim Eisenberg's答案:

real    0m0,546s
user    0m0,907s
sys     0m0,531s