我已按照以下网址成功完成,直到分步:反向链接
https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website
但我没有得到任何有关他们的数据
我是这个技术的新手,
如果有人在成功之前完成了,请提供 steps / demo / site / example 。 和 请不要给出粗略的步骤。
答案 0 :(得分:0)
首先安装nutch:
在<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
在你的nutch-default.xml下:添加
<property>
<name>http.robot.rules.whitelist</name>
<value>http://nihilent.com/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
在regex-urlfilter.txt下:
# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/
并评论
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
然后运行以下命令
bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
现在检查crawl / crawldb文件夹中的数据&amp;其他成功。
答案 1 :(得分:0)
下面是一些可以帮助你以各种方式做Nutch的命令
bin/nutch inject crawl/crawldb dmoz bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s4=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename
bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1
bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext
bin / nutch parsechecker -dumpText http://nihilent.com/
bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3
bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn
bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs
hadoop fs -copyFromLocal
hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs
因为避免三明治数据而添加了新答案