August 2017 – I heard that you have a great story about…

Even after applying the following command, there is no indexes in the Elastic search.

$ bin/nutch index elasticsearch -all

The logs/hadoop.log displays the following. It looks as if there was not any issues in completing indexing work.

2017-08-18 11:29:59,542 INFO elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,542 INFO elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,564 INFO client.transport – [Behemoth] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:29:59,565 INFO indexer.IndexingJob – IndexingJob: done.2017-08-18 11:32:23,894 INFO indexer.IndexingJob – IndexingJob: starting2017-08-18 11:32:24,048 WARN util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable2017-08-18 11:32:24,123 INFO basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:24,123 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:24,125 INFO anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:24,125 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:24,129 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:24,319 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,099 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-08-18 11:32:25,101 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-08-18 11:32:25,161 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-08-18 11:32:25,162 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-08-18 11:32:25,236 INFO indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:25,348 INFO elasticsearch.plugins – [Lin Sun] loaded [], sites []2017-08-18 11:32:25,956 INFO client.transport – [Lin Sun] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:25,962 INFO basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:25,962 INFO anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:25,963 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,992 INFO elastic.ElasticIndexWriter – Processing remaining requests [docs = 0, length = 0, total docs = 0]2017-08-18 11:32:25,992 INFO elastic.ElasticIndexWriter – Processing to finalize last execute2017-08-18 11:32:26,190 INFO indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:26,190 INFO indexer.IndexingJob – Active IndexWriters :ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port (default 9300) elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

2017-08-18 11:32:26,201 INFO elasticsearch.plugins – [Cloud 9] loaded [], sites []2017-08-18 11:32:26,221 INFO client.transport – [Cloud 9] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:26,222 INFO indexer.IndexingJob – IndexingJob: done.

Then I looked at Elasticsearch through Kibana. But it couldn’t find any indexes posting from Nutch. This was a little concerning since I didn’t know where it breaks. I finally found the reason in the Elasticsearch log (under /usr/local/var/log/elasticsearch).

It says,

java.lang.IllegalStateException: Received message from unsupported version: [1.0.0] minimal compatible version is: [5.0.0] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1379) ~[elasticsearch-5.5.1.jar:5.5.1]

It was obviously a version compatibility issue. This issue was addressed but there is no comment about on Apache Nutch project websites.

https://issues.apache.org/jira/browse/NUTCH-2323

I think Nutch works great, though. The indexer only supports Elasticsearch 1.x or 2.x (as I write this). This seems to be a deal breaker in considering Nutch since our main search engine is set as Elasticsearch.

I’m looking into StormCrawler.

I heard that you have a great story about…

Programming and some algorithm stuff

Month: August 2017

Error in Apache Nutch Indexing for Elasticsearch

Apache Nutch inject URLs

	aluicivica on Installing PHP PDO Informix on…
	Adrian on Installing PHP PDO Informix on…
	Abdul Rahim on jQuery::flexigrid::How to get…