Question

我正在使用html2text来解析本地.html文件，它运行正常。

但是，如果我按Hadoop Streaming运行它来解析存储在HDFS中的同一文件：

hadoop jar /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/jars/hadoop-streaming-2.6.0-cdh5.8.0.jar -D mapreduce.job.reduces=0  -input /user/root/mapreduce/input2/xxx.html -output /user/root/mapreduce/output8  -mapper html2text.py

hdfs上结果的第二部分包含正常的结果。但是，初始部分包含一些应该被删除的元素，如下所示：

if（document.URL.indexOf（＆＃39; tv.sohu.com＆＃39;）＆lt; = 0）{delete this.rules [＆＃34; sohu＆＃34;];   } var handler = this.animationsHandler.bind（this）;
  document.body.addEventListener（＆＃39; webkitAnimationStart＆＃39;，handler，   假）; document.body.addEventListener（＆＃39; msAnimationStart＆＃39;，handler，

我的问题是：当html2text以本地模式运行时，为什么没有出现此部分？以及如何删除它们？

为什么我通过Hadoop获得此结果？

0 个答案: