Question

当前，我正在构建一个PHP命令，该命令可以更新我的ElasticSearch索引。

但是，我注意到的一件大事是，当我的数组中包含超过10000个实体时，对其进行序列化会占用太多时间。我以为这是线性的，但是6或9k实体要花一分钟的时间（6或9k之间相差不大），但是当您超过10k时，它会减慢至最多花费10分钟的时间。 >

...
                // we iterate on the documents previously requested to the sql database
                foreach($entities as $index_name => $entity_array) {
                    $underscoreClassName = $this->toUnderscore($index_name); // elasticsearch understands underscored names
                    $camelcaseClassName = $this->toCamelCase($index_name); // sql understands camelcase names

                    // we get the serialization groups for each index from the config file
                    $groups = $indexesInfos[$underscoreClassName]['types'][$underscoreClassName]['serializer']['groups']; 

                    foreach($entity_array as $entity) {
                        // each entity is serialized as a json array
                        $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));
                        // each serialized entity as json is converted as an Elastica document
                        $documents[$index_name][] = new \Elastica\Document($entityToFind[$index_name][$entity->getId()], $data);
                    }
                }
...

周围有一堂课，但这是大部分时间。

我知道序列化是一项繁重的操作，需要时间，但是为什么6、7、8或9k之间几乎没有区别，但是当实体数超过1万时，花费大量时间呢？

PS：I've opened an issue on github供参考。

编辑：

为了解释我正在尝试做的更精确的操作，我们在Symfony项目上有一个SQL数据库，使用Doctrine链接这两个数据库，并使用ElasticSearch（并将FOSElastica和Elastica捆绑在一起）将数据索引到ElasticSearch中

问题是，虽然FOSElastica负责更新SQL数据库中更新的数据，但它不会更新包含该数据的每个索引。（例如，如果您有一位作者和他写的两本书，则在ES中，您将拥有两本书，其中包含作者，然后是作者。FOSElastica仅更新作者，而不更新两本书中有关作者的信息）。

因此，为解决这一问题，我正在编写一个脚本，以侦听通过Doctrine完成的每次更新，该更新将获取与该更新有关的每个ElasticSearch文档，并对其进行更新。这行得通，但是对我的压力测试来说太长了，需要更新10000多个大文档。

编辑：

要添加有关我尝试过的内容的更多信息，在使用FOSElastica中的“填充”命令时，我也会遇到同样的问题。当它为9k时，一切都很好且流畅；当它为10k时，则需要很长时间。

目前，我正在进行测试，以减小脚本中数组的大小并将其重置，到目前为止还算不上成功。

Answer 1

我认为，您应该检查内存消耗：您正在构建一个大型数组，其中列出了很多对象。

您有两种解决方案：使用生成器避免构建该数组，或者尝试每隔“ x”次迭代推送文档并重置数组。

我希望这会让您对如何处理此类迁移有所了解。

顺便说一句，我几乎忘了告诉你避免使用ORM / ODM存储库来检索数据（在迁移脚本中）。问题在于它们将使用对象并对其进行水化处理，说实话，在庞大的迁移脚本中，您将永远等待一无所有。如果可能的话，只需使用数据库对象，这可能足以满足您的需求。

Answer 2

我已经改变了算法的工作方式，首先获取了所有需要更新的id，然后以500-1000的批次从数据库中获取它们（我正在运行测试）。

                    /*
                    * to avoid creating arrays with too much objects, we loop on the ids and split them by DEFAULT_BATCH_SIZE
                    * this way we get them by packs of DEFAULT_BATCH_SIZE and add them by the same amount
                    */ 
                    for ($i = 0 ; $i < sizeof($idsToRequest) ; $i++) {
                        $currentSetOfIds[] = $idsToRequest[$i]; 

                        // every time we have DEFAULT_BATCH_SIZE ids or if it's the end of the loop we update the documents
                        if ($i % self::DEFAULT_BATCH_SIZE == 0 || $i == sizeof($idsToRequest)-1) {
                            if ($currentSetOfIds) {

                                // retrieves from the database a batch of entities
                                $entities = $thatRepo->findBy(array('id' => $currentSetOfIds)); 

                                // serialize and create documents with the entities we got earlier
                                foreach($entities as $entity) {
                                    $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));
                                    $documents[] = new \Elastica\Document($entityToFind[$indexName][$entity->getId()], $data);
                                }

                                // update all the documents serialized
                                $elasticaType->updateDocuments($documents);

                                // reset of arrays
                                $currentSetOfIds = [];
                                $documents = [];
                            }
                        }
                    }

我正在更新它们相同的数量，但是它仍然不能提高序列化方法的性能。我真的不明白与序列化程序有什么不同，我不知道它有9k或10k实体...

超过10000个条目的JMS序列化器性能问题

2 个答案: