我找到了这个话题How do i exclude everything but text/html from a heritrix crawl?
我已将bean更改为此
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^application/pdf.*"/>
</bean>
</property>
</bean>
但是heritrix仍然保存每个文件以镜像dir。
答案 0 :(得分:0)
我相信你错过了接受规则之上的拒绝规则。我有以下工作:
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^application/pdf.*"/>
</bean>
</list>
</property>
</bean>
</property>
拒绝所有内容,然后接受以下规则中列出的所有内容。