我还处于学习Python的早期阶段。如果这个问题听起来很愚蠢,请提前道歉。
我有这组数据(以表格格式),我想添加几个计算列。基本上我有一些位置lon / lat和目标lon / lat,以及各自的数据时间,我计算每对之间的平均速度。
示例数据如下所示:
print(data_all.head(3))
id lon_evnt lat_evnt event_time \
0 1 -179.942833 41.012467 2017-12-13 21:17:54
1 2 -177.552817 41.416400 2017-12-14 03:16:00
2 3 -175.096567 41.403650 2017-12-14 09:14:06
dest_data_generate_time lat_dest lon_dest \
0 2017-12-13 22:33:37.980 37.798599 -121.292193
1 2017-12-14 04:33:44.393 37.798599 -121.292193
2 2017-12-14 10:33:51.629 37.798599 -121.292193
address_fields_dest \
0 {'address': 'Nestle Way', 'city': 'Lathrop...
1 {'address': 'Nestle Way', 'city': 'Lathrop...
2 {'address': 'Nestle Way', 'city': 'Lathrop...
然后我将lon / lat压缩在一起:
data_all['ping_location'] = list(zip(data_all.lon_evnt, data_all.lat_evnt))
data_all['destination'] = list(zip(data_all.lon_dest, data_all.lat_dest))
然后我想计算每对位置ping之间的距离,并从字符串中获取一些地址信息(基本上采用子字符串),然后计算速度:
for idx, row in data_all.iterrows():
dist = gcd.dist(row['destination'], row['ping_location'])
data_all.loc[idx, 'gc_distance'] = dist
temp_idx = str(row['address_fields_dest']).find(":")
pos_start = temp_idx + 3
pos_end = str(row['address_fields_dest']).find(",") - 2
data_all.loc[idx, 'destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]
##### calculate velocity which is: v = d/t
## time is the difference btwn destination time and the ping creation time
timediff = abs(row['dest_data_generate_time'] - row['event_time'])
data_all.loc[idx, 'velocity km/hr'] = 0
## check if the time dif btwn destination and event ping is more than a minute long
if timediff > datetime.timedelta(minutes=1):
data_all.loc[idx, 'velocity km/hr'] = dist / timediff.total_seconds() * 3600.0
好了,这个程序花了我近7个小时来执行333k行数据! :(我有windows 10 2核16gb ram ...这不是很多,但7个小时绝对不行:(
如何让程序更高效地运行?我想的一种方法是,由于数据及其计算是相互独立的,我可以利用并行处理。
我已经阅读了很多帖子,但似乎大多数并行处理方法都适用于我只使用一个简单的函数;但是我在这里添加了多个新列。
非常感谢任何帮助!或者告诉我,这是不可能让大熊猫做并行处理(我相信我已经在某处读过这样的说法,但我还不完全确定它是否100%真实)。
示例帖子已读入:
Large Pandas Dataframe parallel processing
python pandas dataframe to dictionary
How do I parallelize a simple Python loop?
How to do parallel programming in Python
以及更多不在stackoverflow上的内容....
https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html
答案 0 :(得分:0)
这是一个快速解决方案 - 我根本没有尝试优化您的代码,只是将其提供给多处理池。这将分别在每一行上运行您的函数,返回包含新属性的行,并从此输出创建一个新的数据框。
import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count())
def func( arg ):
idx,row = arg
dist = gcd.dist(row['destination'], row['ping_location'])
row['gc_distance'] = dist
temp_idx = str(row['address_fields_dest']).find(":")
pos_start = temp_idx + 3
pos_end = str(row['address_fields_dest']).find(",") - 2
row['destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]
##### calculate velocity which is: v = d/t
## time is the difference btwn destination time and the ping creation time
timediff = abs(row['dest_data_generate_time'] - row['event_time'])
row['velocity km/hr'] = 0
## check if the time dif btwn destination and event ping is more than a minute long
if timediff > datetime.timedelta(minutes=1):
row['velocity km/hr'] = dist / timediff.total_seconds() * 3600.0
return row
new_rows = pool.map( func, [(idx,row) for idx,row in data_all.iterrows()])
data_all_new = pd.concat( new_rows )