mlcp不会在目录中加载大量文件

时间:2015-07-15 14:59:16

标签: marklogic mlcp

见以下编辑

我们使用MarkLogic Content Pump将数据加载到ML8数据库中。 我们有一个开发环境,其中一切正常,并且mlcp将无法通过评估要处理的文件数量。

我们要加载210万个JSON文档。

在开发服务器(ML8 + CentOS6)上,我们看到了这一点:

15/07/13 13:19:35 INFO contentpump.ContentPump: Hadoop library version: 2.0.0-alpha
15/07/13 13:19:35 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/13 13:19:35 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
15/07/13 13:23:06 INFO input.FileInputFormat: Total input paths to process : 2147329
15/07/13 13:24:08 INFO contentpump.LocalJobRunner:  completed 0%
15/07/13 13:34:43 INFO contentpump.LocalJobRunner:  completed 1%
15/07/13 13:43:42 INFO contentpump.LocalJobRunner:  completed 2%
15/07/13 13:51:15 INFO contentpump.LocalJobRunner:  completed 3%

结束了,数据加载正常。

现在我们在不同的机器上使用相同的数据我们得到的prod服务器(ML8 + CentOS 7

15/07/14 17:02:21 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/14 17:02:21 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.

除了不同的操作系统,我们在de prod服务器2.6.0上还有一个更新版本的mlcp而不是2.0.0。如果我们使用相同的命令导入只有2000个文件的目录,它就可以在prod ...

上工作

在计算要处理的文件数时,作业会卡住......

可能是什么问题?

开始编辑 我们将mlcp放入DEBUG并使用一个小的samle.zip进行测试

结果:

[ashraf@77-72-150-125 ~]$ mlcp.sh import -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true  -mode local -output_uri_replace  "\".*,''\"" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Command: IMPORT
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Arguments: -host localhost -port 8140 -username ashraf -password duurz44m -input_file_path /home/ashraf/sample2.zip -input_compressed true -mode local -output_uri_replace ".*,''" -output_uri_prefix incoming/linkedin/ -output_collections incoming,incoming/linkedin -output_permissions slush-dikw-node-role,read 
15/07/16 16:36:31 INFO contentpump.ContentPump: Hadoop library version: 2.6.0
15/07/16 16:36:31 DEBUG contentpump.ContentPump: Running in: localmode
15/07/16 16:36:31 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/07/16 16:36:32 DEBUG contentpump.LocalJobRunner: Thread pool size: 4
15/07/16 16:36:32 INFO input.FileInputFormat: Total input paths to process : 1
15/07/16 16:36:33 DEBUG contentpump.LocalJobRunner: Thread Count for Split#0 : 4
15/07/16 16:36:33 DEBUG contentpump.CompressedDocumentReader: Starting file:/home/ashraf/sample2.zip
15/07/16 16:36:33 DEBUG contentpump.MultithreadedMapper: Running with 4 threads
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:33 DEBUG mapreduce.ContentWriter: Connect to localhost
15/07/16 16:36:34 INFO contentpump.LocalJobRunner:  completed 0%
15/07/16 16:36:39 INFO contentpump.LocalJobRunner:  completed 100%
2015-07-16 16:39:11.483 WARNING [19] (AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''
15/07/16 16:39:12 DEBUG contentpump.CompressedDocumentReader: Closing file:/home/ashraf/sample2.zip
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: com.marklogic.contentpump.ContentPumpStats: 
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: ATTEMPTED_INPUT_RECORD_COUNT: 1993
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: SKIPPED_INPUT_RECORD_COUNT: 0
15/07/16 16:39:12 INFO contentpump.LocalJobRunner: Total execution time: 160 sec

只有第一个json文件在数据库中,其余的是丢弃/丢失了吗?

是否存在JSON文件中的换行问题?

(AbstractRequestController.runRequest): Error parsing HTTP headers: Premature EOF, partial header line read: ''

任何提示都会很棒。

雨果

1 个答案:

答案 0 :(得分:1)

我无法确切地说出发生了什么。我认为支持会对此案感兴趣。你能给他们发一封包含更多详细信息的邮件(也许还有文件)。

作为一种解决方法:在开发服务器上使用与开发服务器相同的 MLCP 版本并不困难,只需将其放在另一个(或任何您喜欢的地方)旁边),并确保你引用那个(提示:在Roxy中你有mlcp-home设置)。

您也可以考虑压缩json文档并使用-input_compressed选项。

HTH!