根据行的值在熊猫数据框中进行选择和组合操作

时间:2019-06-27 04:32:06

标签: python pandas dataframe

我有一个非常大的数据框。我想在此数据框中进行选择和组合操作。我想做的是压缩列VL中两行的值,列STATUS中两行的上下必须是0和1的关系。此外,一系列选择和组合必须使用相同的ID(列ID)。

这是我的解决方案,(1)使用ID方法选择groupby的所有值; (2)对每个元素进行循环,其中元素为ID; (3)通过定义一个函数来选择行的索引; (4)循环所有索引并通过定义另一个函数选择行; (5)转换为数据框对象。

这里是样本数据,仅具有I​​D 1和2。

将熊猫作为pd导入

# ID 1 and 2, and there are more than 1 million data.
vl = np.array([[55, '1', 0],
               [55, '1', 1],
               [55, '1', 0],
               [55, '1', 1],
               [55, '1', 0],
               [55, '1', 0],
               [55, '1', 0],
               [55, '1', 1],
               [55, '1', 0],
               [55, '1', 1],
               [27, '1', 1],
               [54, '2', 0],
               [54, '2', 1],
               [54, '2', 1],
               [51, '2', 0],
               [31, '2', 1],
               [22, '2', 0],
               [22, '2', 1],
               [30, '2', 1],
               [30, '2', 0],
               [30, '2', 1],
               [30, '2', 0],
               [22, '2', 1],
               [30, '2', 0],
               [40, '2', 1]])

sample = pd.DataFrame(vl,columns=['VL','ID','STATUS'])

sample

    VL  ID  STATUS
0   55  1   0
1   55  1   1
2   55  1   0
3   55  1   1
4   55  1   0
5   55  1   0
6   55  1   0
7   55  1   1
8   55  1   0
9   55  1   1
10  27  1   1
11  54  2   0
12  54  2   1
13  54  2   1
14  51  2   0
15  31  2   1
16  22  2   0
17  22  2   1
18  30  2   1
19  30  2   0
20  30  2   1
21  30  2   0
22  22  2   1
23  30  2   0
24  40  2   1

这是代码。

bike_id= sample.groupby(by='ID').count().index
bike_id = pd.Series(bike_id)

def process_dt(df):

    for i in bike_id:
        sample = df[df['ID'] == i]       # select bike id
        sample.reset_index(inplace=True)  

        def get_dt(ser):
            """ 
            Ser is a pandas series, which the indexes will be choosen 
            according to values. In this sample, it is sample['STATUS'].

            """
            ids = []         # empty list to store the indexes of dataframe.
            dt = ser.values  # get the values of series

            # An algorithm to select indexes, which values are 0 and 1.
            i = 0
            while i < len(ser)-1:   
                try:
                    if dt[i] == '0' and dt[i+1] == '1':
                        ids.append([i,i+1])
                        i += 2
                    if dt[i] == '0' and dt[i+1] == '0':
                        i += 1
                    if dt[i] == '1':
                        i += 1
                except:
                    pass

            print(ids)
            return ids # the index selected.  

        def get_pd(df,x):
            """ Define another function to select data according to indexes"""
            lst = []  
            for idsg in x:
                dt = {}

                dt['vl_org'] = '{}'.format(df['VL'][[idsg][0][0]])
                dt['vl_des'] = '{}'.format(df['VL'][[idsg][0][1]])

                lst.append(dt)
            print(lst)
            return lst

        dv = pd.DataFrame(get_pd(sample, get_dt(sample['STATUS'])))
        yield dv

concat dv:

dz = pd.concat(process_dt(sample))

[[0, 1], [2, 3], [6, 7], [8, 9]]
[{'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}]
[[0, 1], [3, 4], [5, 6], [8, 9], [10, 11], [12, 13]]
[{'vl_org': '54', 'vl_des': '54'}, {'vl_org': '51', 'vl_des': '31'}, {'vl_org': '22', 'vl_des': '22'}, {'vl_org': '30', 'vl_des': '30'}, {'vl_org': '30', 'vl_des': '22'}, {'vl_org': '30', 'vl_des': '40'}]

我想要的是这个。

dz

    vl_des  vl_org
0   55      55
1   55      55
2   55      55
3   55      55
0   54      54
1   31      51
2   22      22
3   30      30
4   22      30
5   40      30

此方法效率很低。有没有更有效的方法?

1 个答案:

答案 0 :(得分:1)

我将尝试用移位后的副本水平连接该数据框,并保持行共享相同的ID,并且原始行的状态为0,而移位后的行的状态为1。

代码可能是:

resul = pd.concat([sample, sample.shift(-1).rename(columns=lambda x: x+'_2')],
                axis=1)
resul = resul[(resul.STATUS=='0')&(resul.STATUS_2=='1')&(resul.ID==resul.ID_2)]
resul = resul[['VL', 'VL_2']].rename(columns={'VL': 'vl_des', 'VL_2': 'vl_org'})

给予:

   vl_des vl_org
0      55     55
2      55     55
6      55     55
8      55     55
11     54     54
14     51     31
16     22     22
19     30     30
21     30     22
23     30     40