Question

我有一个网址（http://someurl/test.zip）。压缩文件的大小约为56M。首先，我不想获取/解析大于5MB的文件。尝试获取此URL时，出现“正在中断50个挂起的线程”。我正在使用默认值的小爬网脚本。

打印：

-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=1, fetchQueues.getQueueCount=1
Aborting with 50 hung threads.
Thread #0 hung while processing https://someurl/test.zip
Thread #1 hung while processing null
Thread #2 hung while processing null
Thread #3 hung while processing null
Thread #4 hung while processing null
Thread #5 hung while processing null
Thread #6 hung while processing null
Thread #7 hung while

我将http.content.limit设置为65kb。

nutch-site.xml：

<property>
    <name>http.content.limit</name>
    <value>65536</value>
</property>

如何排除包含大文件的网址？以及为什么它被挂起的线程中止了？

胡桃夹子提取器因挂起线程而中止

0 个答案: