Question

我有多个组的HDF5文件，其中每个组包含一个＆gt; = 2500万行的数据集。在模拟的每个时间步骤，每个代理输出他/她在该时间步骤感测到的其他代理。场景中有大约2000个代理和数千个时间步骤;输出的O（n ^ 2）性质解释了大量的行。

我对计算感兴趣的是按类别划分的独特目击数量。例如，代理属于一方，红色，蓝色或绿色。我想制作一个二维表，其中第i行，第j列是类别i中至少一个代理感知的类别j中的代理数量。（我在这个代码示例中使用了Sides，但我们也可以通过其他方式对代理进行分类，例如通过他们拥有的武器或者他们携带的传感器。）

这是一个示例输出表;请注意，模拟不会输出蓝/蓝感觉，因为它需要大量的空间，我们对它们不感兴趣。绿色绿色相同）

      blue     green      red
blue  0      492       186
green 1075    0     186
red   451    498      26

列是

tick - time step
sensingAgentId - 代理进行感知的ID
sensedAgentId - 感知代理的ID
detRange - 两个代理之间的范围（米）
senseType - 用于执行何种感知的枚举类型

这是我目前用来完成此任务的代码：

def createHeatmap():
  h5file = openFile("someFile.h5")
  run0 = h5file.root.run0.detections

  # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
  classHeat = emptyDict(sides)

  # Interested in Per Category Unique Detections
  seenClass = {}

  # Initially each side has seen no one    
  for theSide in sides:
    seenClass[theSide] = []

  # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
  # are filtered to 4,409,176  
  classifications = run0.where('senseType == 3')

  # Iterate and filter 
  for row in classifications:
    sensedId = row['sensedAgentId']
    # side is a function that returns the string representation of the side of agent
    # with that id.
    sensedSide = side(sensedId)
    sensingSide = side(row['sensingAgentId'])

    # The side has already seen this agent before; ignore it
    if sensedId in seenClass[sensingSide]:
      continue
    else:
      classHeat[sensingSide][sensedSide] += 1
      seenClass[sensingSide].append(sensedId)


  return classHeat

注意：我有Java背景，所以如果这不是Pythonic，我会道歉。请指出并提出改进此代码的方法，我希望能够更熟练地使用Python。

现在，这非常慢：执行此迭代和成员资格检查大约需要50秒，这是最严格的成员资格条件集（其他检测类型还有更多行要迭代）。

我的问题是，是否可以将工作从python中移出并进入内核中的搜索查询？如果是这样，怎么样？我错过了一些明显的加速吗？我需要能够在一组运行（~30）和每组标准（~5）中为每次运行运行此函数，因此如果可以加速这将是很好的。

最后的注意事项：我尝试使用psyco，但这几乎没有什么区别。

Answer 1

如果您有N = ~2k代理，我建议将所有目击放入一个大小为NxN的numpy数组中。这很容易适合内存（整数约为16兆）。只要在发生瞄准的地方存储1个。

假设您有一个数组sightings。第一个坐标是Sensing，第二个坐标是Sensed。假设您还有一维索引数组，列出哪些代理在哪一侧。你可以通过这种方式得到B方面的目击数量：

sideAseesB = sightings[sideAindices, sideBindices]
sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum()

您可能需要在第一步中使用sightings.take(sideAindices, axis=0).take(sideBindices, axis=1)，但我对此表示怀疑。

Python，PyTables - 利用内核搜索

1 个答案: