〜/ runtime / local / bin / urls / seed.txt>>
http://nutch.apache.org/
〜/ runtime / local / conf / nutch-site.xml>>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>http.timeout</name>
<value>99999999</value>
<description></description>
</property>
<property>
<value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
scoring-opic|urlnormalizer-(pass|regex|basic)|index-more
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
</description>
</property>
</configuration>
〜/ runtime / local / conf / regex-urlfilter.txt&gt;&gt;
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*
如果我爬行,就会这样说。
/home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch crawl urls -dir newCrawl/ -depth 3 -topN 3
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: newCrawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 3
Injector: starting at 2014-07-18 11:35:36
Injector: crawlDb: newCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-07-18 11:35:39, elapsed: 00:00:02
Generator: starting at 2014-07-18 11:35:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl
无论网址是什么,它总是说没有网址可以获取。 我正在努力解决这个问题3天。请帮助!!!!