Python:使用多处理模块作为可能的解决方案来提高我的功能速度

时间:2013-01-07 18:59:37

标签: python multithreading performance optimization multiprocessing

我在Python 2.7中编写了一个函数(在Window OS 64bit上),以便计算参考多边形(Ref)和ESRI中一个或多个分段(Seg)多边形的交叉区域的平均值{ {3}}。代码非常慢,因为我有超过2000个参考多边形,并且对于每个Ref_polygon,函数每次运行所有Seg多边形(超过7000)。对不起,但功能是原型。

我想知道shapefile format是否可以帮助我提高循环速度,或者有更多性能解决方案。如果多处理可能是一个可能的解决方案,我希望知道优化我的以下功能的最佳方法

import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal

def AreaInter(reference,segmented,outFile):
     # open shapefile
     ref = osgeo.ogr.Open(reference)
     if ref is None:
          raise SystemExit('Unable to open %s' % reference)
     seg = osgeo.ogr.Open(segmented)
     if seg is None:
          raise SystemExit('Unable to open %s' % segmented)
     ref_layer = ref.GetLayer()
     seg_layer = seg.GetLayer()
     # create outfile
     if not os.path.split(outFile)[0]:
          file_path, file_name_ext = os.path.split(os.path.abspath(reference))
          outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
          file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
     else:
          file_path_name, file_ext = os.path.splitext(outFile)
          file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
     # For each reference objects-i
     for index in xrange(ref_layer.GetFeatureCount()):
          ref_feature = ref_layer.GetFeature(index)
          # get FID (=Feature ID)
          FID = str(ref_feature.GetFID())
          ref_geometry = ref_feature.GetGeometryRef()
          pts = ref_geometry.GetGeometryRef(0)
          points = []
          for p in xrange(pts.GetPointCount()):
               points.append((pts.GetX(p), pts.GetY(p)))
          # convert in a shapely polygon
          ref_polygon = Polygon(points)
          # get the area
          ref_Area = ref_polygon.area
          # create an empty list               
          Area_seg, Area_intersect = ([] for _ in range(2))
          # For each segmented objects-j
          for segment in xrange(seg_layer.GetFeatureCount()):
               seg_feature = seg_layer.GetFeature(segment)
               seg_geometry = seg_feature.GetGeometryRef()
               pts = seg_geometry.GetGeometryRef(0)
               points = []
               for p in xrange(pts.GetPointCount()):
                    points.append((pts.GetX(p), pts.GetY(p)))
               seg_polygon = Polygon(points)
               seg_Area.append = seg_polygon.area
               # intersection (overlap) of reference object with the segmented object
               intersect_polygon = ref_polygon.intersection(seg_polygon)
               # area of intersection (= 0, No intersection)
               intersect_Area.append = intersect_polygon.area
          # Avarage for all segmented objects (because 1 or more segmented polygons can  intersect with reference polygon)
          seg_Area_average = numpy.average(seg_Area)
          intersect_Area_average = numpy.average(intersect_Area)
          file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
     file_out.close()

2 个答案:

答案 0 :(得分:6)

您可以使用multiprocessing包,尤其是Pool类。首先创建一个函数,在for循环中完成你想要做的所有事情,并且只将索引作为参数:

def process_reference_object(index):
      ref_feature = ref_layer.GetFeature(index)
      # all your code goes here
      return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")

注意这不会写入文件本身 - 因为你有多个进程同时写入同一个文件会很麻烦。相反,它返回需要写入的字符串。另请注意,此功能中的对象(例如ref_layerref_geometry)需要以某种方式触及它 - 这取决于您如何操作(您可以将process_reference_object作为方法用它们初始化的类,或者它可能像在全局中定义它们一样丑陋。)

然后,您创建一个流程资源池,并使用Pool.imap_unordered运行所有索引(它本身会根据需要将每个索引分配给不同的流程):

from multiprocessing import Pool
p = Pool()  # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
    file_out.write(l)

这将并行处理跨多个进程的参考对象的独立处理,并将它们写入文件(以任意顺序,注意)。

答案 1 :(得分:2)

线程在某种程度上有所帮助,但首先应该确保不能简化算法。如果你检查2000个参考多边形中的每一个对7000个分段多边形(也许我误解了),那么你应该从那里开始。在O(n 2 )运行的东西会变慢,所以也许你可以修剪掉绝对不会交叉的东西或者找到其他方法来加快速度。否则,运行多个进程或线程只会在数据以几何方式增长时线性地改善。