Question

我正在开展一个网络项目。我的目的是拿一个网页，并决定它是否是博客。为了执行此操作，我必须使用crawler并使用apache nutch。我在Apache Nutch Tutorial Page执行所有订单，但我失败了。对于bin/nutch crawl urls -dir crawl -depth 3 -topN 50命令，我的结果是：

solrUrl is not set, indexing will be skipped...
2014-04-25 15:29:11.324 java[4405:1003] Unable to load realm info from SCDynamicStore
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 50
Injector: starting at 2014-04-25 15:29:11
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-04-25 15:29:13, elapsed: 00:00:02
Generator: starting at 2014-04-25 15:29:13
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

我的网址/ seed.txt文件是：

http://yahoo.com/

我的regex-urlfilter.txt文件是：

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.)*yahoo.com/

我的nutch-site.xml文件是：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>

    <property>
        <name>http.agent.name</name>
        <value>Baris Spider</value>
        <description>This crawler is used to fetch text documents from
            web pages that will be used as a corpus for Part-of-speech-tagging
        </description>
    </property>



</configuration>

有什么问题？

Answer 1

不推荐使用

bin/nutch crawl urls -dir crawl -depth 3，而是使用bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

例如bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/your_core

当我读你的日志时，你没有提供soll的url（nutch需要它来发送已爬行的文件）。因此，在使用这些命令之前，您必须下载并启动 Solr 。 follow this apache solr link to read how to setup and install solr

Apache Nutch中的爬行问题

1 个答案: