elasticsearch mapperParsingException on bulk import

时间:2016-09-30 23:27:53

标签: elasticsearch

我在尝试上传大型json文件时获得MapperParsingException。以下是我从elasticsearch返回的完整错误:

on [[sample][4]]
MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: START_OBJECT];
    at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:156)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:309)
    at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:529)
    at org.elasticsearch.index.shard.IndexShard.prepareCreateOnPrimary(IndexShard.java:506)
    at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:214)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:223)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:157)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:657)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:287)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: START_OBJECT
    at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:141)
    ... 17 more

我试图更好地理解为什么我试图输入的数据确实是错误的,我该怎么做才能更好地调试这种情况?

编辑这是一个包含2亿个示例的大型文档,但这是一个示例数据点 {"company":"E-Corp","title":"Sith lord","people":[{"id":"12345","name":"Darth Vader","title":"The Sith Lord"}]}

3 个答案:

答案 0 :(得分:1)

确保每个奇数行都是唯一的ID行:

public class SeriesRepositoryImpl implements SeriesRepositoryCustom {

    private final MongoOperations operations;

    @Autowired
    public SeriesRepositoryImpl(MongoOperations operations) {
        this.operations = operations;
    }

    @Override
    @RequestMapping(method = RequestMethod.POST)
    public ResponseEntity<Void> createSeries(@RequestBody Series series) {
        // ... implementation
    }
}

每个偶数行都是数据:

{ "index": {}}

并使用{ "index": {}} {"company":"E-Corp","title":"Sith lord","people":[{"id":"12345","name":"Darth Vader","title":"The Sith Lord"}]} ,所以在添加到Elastic:

_bulk

从您的日志中猜测错误消息的原因:POST /index/type/_bulk { "index": {}} {"company":"E-Corp","title":"Sith lord","people":[{"id":"12345","name":"Darth Vader","title":"The Sith Lord"}]} { "index": {}} {"company":"E-Corp","title":"Sith lord","people":[{"id":"12345","name":"Darth Vader","title":"The Sith Lord"}]} { "index": {}} {"company":"E-Corp","title":"Sith lord","people":[{"id":"12345","name":"Darth Vader","title":"The Sith Lord"}]}

答案 1 :(得分:0)

您是否指定了映射? 如果不是,则elasticsearch将基于第一个文档创建映射。现在,如果任何其他文档的值没有映射到这些特定字段,则可能会出错。

https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

例如,plotly可能会被映射为字符串,但如果文档中包含该字段中的数字或日期,则可能会引发错误。

你也有嵌套文件(人) - 我也会研究它。你能尝试一些示例文档 - 比如前10个,看看你是否可以使用批量api索引它们。

或者您可以为每个字段创建自己的映射,因为每个文档似乎没有很多字段。

答案 2 :(得分:0)

您可以出现此错误

  

&#34;格式错误的内容,在解析后发现了额外的数据:START_OBJECT&#34; }&#34;   如果你的网址没有包含/ _bulk,则由ElasticSearch发回   结束。

然后ElasticSearch不希望在最后一个正确关闭的花括号之后找到换行和额外数据并丢弃额外的数据 特别是在通过curl发出呼叫时,即使用

curl_easy_setopt(curl, CURLOPT_URL, str)
str应该是良好形成的例子 &#39; http://localhost:9200/_bulk&#39;而不是&#39; http://localhost:9200&#39;