Error in Apache Nutch Indexing for Elasticsearch

Even after applying the following command, there is no indexes in the Elastic search.

$ bin/nutch index elasticsearch -all

The logs/hadoop.log displays the following. It looks as if there was not any issues in completing indexing work.

2017-08-18 11:29:59,542 INFO  elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,542 INFO  elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,564 INFO  client.transport – [Behemoth] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:29:59,565 INFO  indexer.IndexingJob – IndexingJob: done.2017-08-18 11:32:23,894 INFO  indexer.IndexingJob – IndexingJob: starting2017-08-18 11:32:24,048 WARN  util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable2017-08-18 11:32:24,123 INFO  basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:24,123 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:24,125 INFO  anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:24,125 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:24,129 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:24,319 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,099 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.2017-08-18 11:32:25,101 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.2017-08-18 11:32:25,161 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.2017-08-18 11:32:25,162 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.2017-08-18 11:32:25,236 INFO  indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:25,348 INFO  elasticsearch.plugins – [Lin Sun] loaded [], sites []2017-08-18 11:32:25,956 INFO  client.transport – [Lin Sun] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:25,962 INFO  basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:25,962 INFO  anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:25,963 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,992 INFO  elastic.ElasticIndexWriter – Processing remaining requests [docs = 0, length = 0, total docs = 0]2017-08-18 11:32:25,992 INFO  elastic.ElasticIndexWriter – Processing to finalize last execute2017-08-18 11:32:26,190 INFO  indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:26,190 INFO  indexer.IndexingJob – Active IndexWriters :ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port  (default 9300) elastic.index : elastic index command  elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)  elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

2017-08-18 11:32:26,201 INFO  elasticsearch.plugins – [Cloud 9] loaded [], sites []2017-08-18 11:32:26,221 INFO  client.transport – [Cloud 9] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:26,222 INFO  indexer.IndexingJob – IndexingJob: done.

Then I looked at Elasticsearch through Kibana. But it couldn’t find any indexes posting from Nutch. This was a little concerning since I didn’t know where it breaks. I finally found the reason in the Elasticsearch log (under /usr/local/var/log/elasticsearch).

It says,

java.lang.IllegalStateException: Received message from unsupported version: [1.0.0] minimal compatible version is: [5.0.0] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1379) ~[elasticsearch-5.5.1.jar:5.5.1]

It was obviously a version compatibility issue. This issue was addressed but there is no comment about on Apache Nutch project websites.

https://issues.apache.org/jira/browse/NUTCH-2323

I think Nutch works great, though. The indexer only supports Elasticsearch 1.x or 2.x (as I write this). This seems to be a deal breaker in considering Nutch since our main search engine is set as Elasticsearch.

I’m looking into StormCrawler.

Apache Nutch inject URLs

I am trying to crawl a website locally and collect all the seed/urls. I also want to dump all the html contents to the elasticsearch. I am still stuck on injecting urls into CrawlDB. I don’t know where it went wrong.

$ bin/nutch inject seed/urls.txt
InjectorJob: starting at 2017-08-17 19:56:15
InjectorJob: Injecting urlDir: seed/urls.txt

It doesn’t proceed from here.