我正在努力加速我的代码。我的代码如下:
import pandas as pd
df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000",
"320000-320000, 340000-320000, 320000-340000",
], 'id':[1,2,3,4,5,],})
def most_common(lst):
return max(set(lst), key=lst.count)
def split_list(lines):
return '-'.join('%s' % id for id in lines).split('-')
df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)
import time
start = time.clock()
for ix in col_ix:
col = df['line'][ix]
col_split = split_list(col)
even_col_split = col_split[0:][::2]
even_col_split_most = most_common(even_col_split)
df['line_start'][ix] = even_col_split_most
odd_col_split = col_split[1:][::2]
odd_col_split_most = most_common(odd_col_split)
df['line_destination'][ix] = odd_col_split_most
end = time.clock()
print('time\n',str(end-start))
del df['line']
我想要做的是,首先,根据line
拆分列-
;其次,根据奇偶校验指数将line
分为两列;第三,找到两列的最大元素。
Input
:
df
id line
0 1 320000-320000, 340000-320000, 320000-340000
1 2 380000-320000
2 3 380000-320000,380000-310000
3 4 370000-320000,370000-320000,320000-320000
4 5 320000-320000, 340000-320000, 320000-340000
根据-
分割df:
df
id line
0 1 [320000, 320000, 340000, 320000, 320000, 340000]
1 2 [380000, 320000]
2 3 [380000, 320000, 380000, 310000]
3 4 [370000, 320000, 370000, 320000, 320000, 320000]
4 5 [320000, 320000, 340000, 320000, 320000, 340000]
根据奇偶校验指数分割df:
df
id line \
0 1 [320000, 320000, 340000, 320000, 320000, 340000]
1 2 [380000, 320000]
2 3 [380000, 320000, 380000, 310000]
3 4 [370000, 320000, 370000, 320000, 320000, 320000]
4 5 [320000, 320000, 340000, 320000, 320000, 340000]
line_start line_destination
0 [320000, 340000, 320000] [320000, 320000, 340000]
1 [380000] [320000]
2 [380000, 380000] [320000, 310000]
3 [370000, 370000, 320000] [320000, 320000, 320000]
4 [320000, 340000, 320000] [320000, 320000, 340000]
找到列line_start
和line_destination
以及del line
(也是我的Output
)的最大元素:
df
id line_start line_destination
0 1 320000 320000
1 2 380000 320000
2 3 380000 310000
3 4 370000 320000
4 5 320000 320000
现在我希望有一种方法可以更快地完成任务。
答案 0 :(得分:1)
这是一个选项:
DataFrame
。max
。parity
。以下是代码:
import pandas as pd
#import scipy.stats as stats # if you meant 'mode'
#import numpy as np # if you meant 'mode'
df1 = df.line.str.split('-|,').apply(pd.Series).stack().reset_index()
# Determine the parity for each line
df1['level_1'] = df1.level_1%2
# Determine the max for each id-parity group and rename properly
df1[0]= pd.to_numeric(df1[0]) # So max works properly
df1 = df1.groupby(['level_0', 'level_1'])[0].max().reset_index()
# If you instead meant 'mode' replace the above with this:
#df1 = df1.groupby(['level_0', 'level_1'])[0].apply(lambda x: stats.mode(np.sort(x))[0][0]).reset_index()
df1['level_1'] = df1.level_1.map({0: 'line_start', 1: 'line_destination'})
# Pivot to the form you want, bring back the index
df1 = df1.pivot(index= 'level_0', columns='level_1', values=0)
df1['id'] = df.id #aligns on index, which was preserved
df1.index.name=None
df1.columns.name=None
df1
现在是您的期望(至少基于您规定的规则):
line_destination line_start id
0 340000 340000 1
1 320000 380000 2
2 320000 380000 3
3 320000 370000 4
4 340000 340000 5
使用mode
代替max
得出结果。注意,我必须在采取模式之前进行排序,以便在出现平局时获得31,000的所需输出。
line_destination line_start id
0 320000 320000 1
1 320000 380000 2
2 310000 380000 3
3 320000 370000 4
4 320000 320000 5