Question

我在Python中有一个字典列表。列表的每个元素都对应一天，字典的每个元素都有关于用户每分钟活动的信息。

示例：

<div class="table-responsive">
  <table class="table table-bordered table-hover" style="width: 80%;">
    <thead>
      <tr>
        <th>ID</th>
        <th>Gender</th>
        <th>FirstName</th>
        <th>LastName</th>
        <th>EMail</th>
        <th>CompanyName</th>
        <th>JobTitle</th>
        <th>Phone</th>
        <th>Avatar</th>
        <th>Actions</th>
      </tr>
    </thead>
    <tbody>
      <tr ng-repeat="item in contacts">
        <td>{{item.Id}}</td>
        <td>{{item.Gender}}</td>
        <td>{{item.FirstName}}</td>
        <td>{{item.LastName}}</td>
        <td>{{item.EMail}}</td>
        <td>{{item.CompanyName}}</td>
        <td>{{item.JobTitle}}</td>
        <td>{{item.Phone}}</td>
        <td>
          <img src="{{ item.Avatar }}" />
        </td>
        <td>
          <table>
            <tbody>
              <tr>
                <td>
                  <button ng-model="$scope.Contact" ng-click="edit(contacts[$index])" class="btn btn-primary">Edit</button>
                </td>
                <td>
                  <button ng-click="delete($index)" class="btn btn-primary">Delete</button>
                </td>
              </tr>
            </tbody>
          </table>
        </td>
      </tr>
    </tbody>
  </table>

现在，我希望对数据进行不同的聚合。例如，我想计算一周中每小时每小时的活动计数。我可以使用以下代码：

list_of_dicts = [
    {u'activity': 
        {u'values': [
            [1407729600, 3.0],
            [1407729660, 2.0],
            [1407729720, 2.0],
            [1407729780, 3.0],
            [1407729840, 1.0],
            [1407729900, 4.0],
            [1407729960, 2.0],
            [1407730020, 5.0],
            [1407730080, 6.0],
            [1407730140, 2.0],
            [1407730200, 1.0],
            [1407730260, 2.0],
            [1407730320, 1.0],
            [1407730380, 2.0],
            [1407730440, 1.0]]}},
    {u'activity': 
        {u'values': [
            [1407788340, 2.0],
            [1407788400, 2.0],
            [1407788460, 3.0],
            [1407788520, 2.0],
            [1407788580, 2.0],
            [1407788640, 2.0],
            [1407788700, 2.0],
            [1407788760, 2.0],
            [1407788820, 2.0],
            [1407788880, 3.0],
            [1407788940, 2.0],
            [1407789000, 3.0],
            [1407789060, 2.0],
            [1407789120, 3.0],
            [1407789180, 3.0],
            [1407789240, 2.0],
            [1407789300, 3.0],
            [1407789360, 3.0],
            [1407789420, 2.0],
            [1407789480, 3.0],
            [1407789540, 2.0]]}}]

虽然这没关系，但我还需要进行其他聚合，因为这是后续ML步骤的特征向量生成的一部分。作为一个例子，我希望在所有七天内都有活动周数等等。这些也可以通过各种计数器单独完成，再次阅读新计数器的词典列表。但是，这将是非常耗时的，因为字典列表很大并且正在为1M +用户运行（通过PySpark）。我们最好不要多次阅读这个大的dicts列表。有没有办法在单词列表的一次通过中计算这些度量？

Answer 1

如果我错了，请纠正我，但我会总结你的问题如下：

有没有办法在一个dicts列表的单次传递中计算多个操作？

在一般情况下，这应该不重要。如果您正在计算15个操作，那么您将不得不对每个元素进行15次不同的计算。

如果您对每个指标的操作有详细了解，则可以将某些操作分解出来以消除冗余工作。例如，您可能需要考虑多个指标中的每分钟平均值以进行标准化。由您来编写函数以便它们可以共享这个平均值：先计算它然后再将它传递给每个函数。

不同的计数器在一系列的dicts中

1 个答案: