我刚开始使用 Nutch 1.11 和 Solr 5.3.1 。
我希望使用Nutch 抓取数据,然后索引并准备使用Solr进行搜索。
我知道如何使用Nutch
的{{1}}命令从网上抓取数据,并成功从我当地的网站获取了大量数据。
我还使用bin/crawl
根文件夹下的命令在本地启动了一个新的Solr
服务器,
Solr
并使用以下命令在示例文件夹下启动示例bin/solr start
核心:
files
我可以在admin url下面登录并管理bin/solr create -c files -d example/files/conf
核心
files
所以我相信我已正确启动http://localhost:8983/solr/#/files
,并开始使用Solr
的{{1}}命令将Nutch
数据发布到Solr
:
Nutch
希望使用bin/nutch index
新的自动架构功能,我可以让自己安静下来,但是,我得到了以下错误(从日志文件中复制):
bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-params solr.server.url=127.0.0.1:8983/solr/files \
-dir crawl/segments
我记得这个
Solr5
是否与WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1.
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2.
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3.
INFO indexer.IndexingJob - Indexer: starting at 2015-12-14 15:21:39
INFO indexer.IndexingJob - Indexer: deleting gone documents: false
INFO indexer.IndexingJob - Indexer: URL filtering: false
INFO indexer.IndexingJob - Indexer: URL normalizing: false
INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO solr.SolrMappingReader - source: content dest: content
INFO solr.SolrMappingReader - source: title dest: title
INFO solr.SolrMappingReader - source: host dest: host
INFO solr.SolrMappingReader - source: segment dest: segment
INFO solr.SolrMappingReader - source: boost dest: boost
INFO solr.SolrMappingReader - source: digest dest: digest
INFO solr.SolrMappingReader - source: tstamp dest: tstamp
INFO solr.SolrIndexWriter - Indexing 250 documents
INFO solr.SolrIndexWriter - Deleting 0 documents
INFO solr.SolrIndexWriter - Indexing 250 documents
WARN mapred.LocalJobRunner - job_local117437667_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
网址相关,但我仔细检查了我使用的网址org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.
,我认为这是正确的。
有谁知道问题是什么?我在网上搜索,在这里没有任何用处。
注意:我还尝试了Solr
中禁用127.0.0.1:8983/solr/files
自动架构功能的方式,并将Solr5
替换为examples/files/conf/solrconfig.xml
examples/files/conf/managed-schema.xml
,仍然遇到同样的错误。
更新:尝试 DEPRECATED 命令Nutch
后(感谢conf/schema.xml
),上一个错误消失但又出现了另一个错误:< / p>
bin/nutch solrindex
错误讯息:
Thangaperumal
答案 0 :(得分:0)
相反,请尝试使用此语句来集成solr和nutch
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/
答案 1 :(得分:0)
您是否尝试使用以下方法指定Solr URL:
/**
* NetworkImageView that fades-in the drawable upon complete download
*/
public class AnimatedNetworkImageView extends NetworkImageView {
private static final int ANIM_DURATION = 500;
private boolean shouldAnimate = true;
public AnimatedNetworkImageView(Context context) {
super(context);
init();
}
public AnimatedNetworkImageView(Context context, AttributeSet attrs) {
super(context, attrs);
init();
}
public AnimatedNetworkImageView(Context context, AttributeSet attrs, int defStyle) {
super(context, attrs, defStyle);
init();
}
private void init() {
shouldAnimate = true; // animate by default. Only when {@link #determineIfAnimationIsNecessary} is called animation is dependent upon cache status
}
@Override
public void setImageBitmap(Bitmap bm) {
super.setImageBitmap(bm);
if (shouldAnimate) {
// your animation. Here with ObjectAnimator for example
ObjectAnimator.ofFloat(this, "alpha", 0, 1).setDuration(ANIM_DURATION).start();
}
}
@Override
public void setImageUrl(String url, ImageLoader imageLoader) {
shouldAnimate = determineIfAnimationIsNecessary(url, imageLoader);
super.setImageUrl(url, imageLoader);
}
/**
* checks if for the given imgUrl and imageLoader the view should animate when a bitmap is set.
* If this method is called before {@link NetworkImageView#setImageUrl(String, ImageLoader)} is called the view would not be animated if the image comes from the cache.
*
* @param imgUrl the image url
* @param imageLoader the image loader
*/
public boolean determineIfAnimationIsNecessary(String imgUrl, ImageLoader imageLoader) {
int width = getWidth();
int height = getHeight();
ScaleType scaleType = getScaleType();
boolean wrapWidth = false, wrapHeight = false;
if (getLayoutParams() != null) {
wrapWidth = getLayoutParams().width == ViewGroup.LayoutParams.WRAP_CONTENT;
wrapHeight = getLayoutParams().height == ViewGroup.LayoutParams.WRAP_CONTENT;
}
// Calculate the max image width / height to use while ignoring WRAP_CONTENT dimens.
int maxWidth = wrapWidth ? 0 : width;
int maxHeight = wrapHeight ? 0 : height;
return !imageLoader.isCached(imgUrl, maxWidth, maxHeight, scaleType);
}
}
而不是-D solr.server.url=http://localhost:8983/solr/files
方法?至少这是-params
脚本的正确语法。因为两者都调用一个下划线的java类来完成工作。
crawl