我有一个非常大的数据框。我想在此数据框中进行选择和组合操作。我想做的是压缩列VL
中两行的值,列STATUS
中两行的上下必须是0和1的关系。此外,一系列选择和组合必须使用相同的ID(列ID
)。
这是我的解决方案,(1)使用ID
方法选择groupby
的所有值; (2)对每个元素进行循环,其中元素为ID
; (3)通过定义一个函数来选择行的索引; (4)循环所有索引并通过定义另一个函数选择行; (5)转换为数据框对象。
这里是样本数据,仅具有ID 1和2。
将熊猫作为pd导入
# ID 1 and 2, and there are more than 1 million data.
vl = np.array([[55, '1', 0],
[55, '1', 1],
[55, '1', 0],
[55, '1', 1],
[55, '1', 0],
[55, '1', 0],
[55, '1', 0],
[55, '1', 1],
[55, '1', 0],
[55, '1', 1],
[27, '1', 1],
[54, '2', 0],
[54, '2', 1],
[54, '2', 1],
[51, '2', 0],
[31, '2', 1],
[22, '2', 0],
[22, '2', 1],
[30, '2', 1],
[30, '2', 0],
[30, '2', 1],
[30, '2', 0],
[22, '2', 1],
[30, '2', 0],
[40, '2', 1]])
sample = pd.DataFrame(vl,columns=['VL','ID','STATUS'])
sample
VL ID STATUS
0 55 1 0
1 55 1 1
2 55 1 0
3 55 1 1
4 55 1 0
5 55 1 0
6 55 1 0
7 55 1 1
8 55 1 0
9 55 1 1
10 27 1 1
11 54 2 0
12 54 2 1
13 54 2 1
14 51 2 0
15 31 2 1
16 22 2 0
17 22 2 1
18 30 2 1
19 30 2 0
20 30 2 1
21 30 2 0
22 22 2 1
23 30 2 0
24 40 2 1
这是代码。
bike_id= sample.groupby(by='ID').count().index
bike_id = pd.Series(bike_id)
def process_dt(df):
for i in bike_id:
sample = df[df['ID'] == i] # select bike id
sample.reset_index(inplace=True)
def get_dt(ser):
"""
Ser is a pandas series, which the indexes will be choosen
according to values. In this sample, it is sample['STATUS'].
"""
ids = [] # empty list to store the indexes of dataframe.
dt = ser.values # get the values of series
# An algorithm to select indexes, which values are 0 and 1.
i = 0
while i < len(ser)-1:
try:
if dt[i] == '0' and dt[i+1] == '1':
ids.append([i,i+1])
i += 2
if dt[i] == '0' and dt[i+1] == '0':
i += 1
if dt[i] == '1':
i += 1
except:
pass
print(ids)
return ids # the index selected.
def get_pd(df,x):
""" Define another function to select data according to indexes"""
lst = []
for idsg in x:
dt = {}
dt['vl_org'] = '{}'.format(df['VL'][[idsg][0][0]])
dt['vl_des'] = '{}'.format(df['VL'][[idsg][0][1]])
lst.append(dt)
print(lst)
return lst
dv = pd.DataFrame(get_pd(sample, get_dt(sample['STATUS'])))
yield dv
concat dv:
dz = pd.concat(process_dt(sample))
[[0, 1], [2, 3], [6, 7], [8, 9]]
[{'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}, {'vl_org': '55', 'vl_des': '55'}]
[[0, 1], [3, 4], [5, 6], [8, 9], [10, 11], [12, 13]]
[{'vl_org': '54', 'vl_des': '54'}, {'vl_org': '51', 'vl_des': '31'}, {'vl_org': '22', 'vl_des': '22'}, {'vl_org': '30', 'vl_des': '30'}, {'vl_org': '30', 'vl_des': '22'}, {'vl_org': '30', 'vl_des': '40'}]
我想要的是这个。
dz
vl_des vl_org
0 55 55
1 55 55
2 55 55
3 55 55
0 54 54
1 31 51
2 22 22
3 30 30
4 22 30
5 40 30
此方法效率很低。有没有更有效的方法?
答案 0 :(得分:1)
我将尝试用移位后的副本水平连接该数据框,并保持行共享相同的ID,并且原始行的状态为0,而移位后的行的状态为1。
代码可能是:
resul = pd.concat([sample, sample.shift(-1).rename(columns=lambda x: x+'_2')],
axis=1)
resul = resul[(resul.STATUS=='0')&(resul.STATUS_2=='1')&(resul.ID==resul.ID_2)]
resul = resul[['VL', 'VL_2']].rename(columns={'VL': 'vl_des', 'VL_2': 'vl_org'})
给予:
vl_des vl_org
0 55 55
2 55 55
6 55 55
8 55 55
11 54 54
14 51 31
16 22 22
19 30 30
21 30 22
23 30 40