Setting up StormCrawler

Install Apache Maven

$ brew install maven

Install Zookeeper

$ brew install zookeeper

  • After the installation is complete, it will display the following message:

To have launchd start zookeeper at login:
$ ln -sfv /usr/local/opt/zookeeper/*.plist ~/Library/LaunchAgents

  • Start Zookeeper:

$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.zookeeper.plist

NOTE: If you don’t want/need launchctl, you can just run:
$ zkServer start

Install ZeroMQ

$ brew install zeromq

Install Apache Storm

$ brew install storm

NOTE: This above command will install everything in Cellar folder and will have a symlink in “/usr/local/opt/storm” folder.

Setup Apache Storm and its components

  • Edit the Storm config file: storm.yam. This file is located in /usr/local/opt/storm/libexec/conf folder. Add the following lines:

storm.zookeeper.servers:
– “localhost”
# – “server2”
#
nimbus.host: “localhost”
nimbus.thrift.port: 6627
ui.port: 8772
storm.local.dir: “/Users/gmohr/storm/data”
java.library.path: “/usr/lib/jvm”
supervisor.slots.ports:
– 6700
– 6701
– 6702
– 6703

  • Make sure that whatever folder you create, they have the right permissions because I initially added “/home/user/storm/data” and did not realize that you cannot create any directory under “/home” folder. This was causing issues when I started nimbus and supervisor.
  • Start Zookeeper, Storm Nimbus, Supervisor and UI. The following list is the location of shell scripts. Start them in the given sequence.

$ zkServer start
$ /usr/local/opt/storm/libexec/bin/storm nimbus
$ /usr/local/opt/storm/libexec/bin/storm supervisor
$ /usr/local/opt/storm/libexec/bin/storm ui

  • Check that everything is running smoothly by running the following command:

$ jps
NOTE: The above command should give the following result.
5282 supervisor
5267 nimbus
5460 core
5735 Jps
4235 QuorumPeerMain

Create a StormCrawler project

  • First off, run mvn command to generate a StormCrawler project

$ mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5.1

  • Delete the following files under the StormCrawler project folder
    • src/main/java folder
    • crawler.flux file
  • Go back to the StromCrawler file folder and copy:
    • kibana folder
    • es-conf.yaml
    • es-crawler.flux
    • ES_IndexInit.sh
    • es-injector.flux
    • README.md
    • NOTE: Add a seeds.txt file and add URLs to crawl
  • Create a package

$ mvn clean package

Start crawling process

  • Inject URL to the crawler process and create URL seeds to crawl

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-injector.flux –sleep 30000

  • Begin crawling

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-crawler.flux –sleep 30000

  • After finishing crawling process, send topology to Nimbus. This will create a topology summary section on the Storm UI dashboard. The usage of this is to check crawler topology status

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –remote es-crawler.flux

storm_ui.png

  • NOTE: In order to debug, the log file location is $STORM_HOME/libexec/logs

Screenshot of Elasticsearch index in Kibana

Stormcrawler.png

Error in Apache Nutch Indexing for Elasticsearch

Even after applying the following command, there is no indexes in the Elastic search.

$ bin/nutch index elasticsearch -all

The logs/hadoop.log displays the following. It looks as if there was not any issues in completing indexing work.

2017-08-18 11:29:59,542 INFO  elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,542 INFO  elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,564 INFO  client.transport – [Behemoth] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:29:59,565 INFO  indexer.IndexingJob – IndexingJob: done.2017-08-18 11:32:23,894 INFO  indexer.IndexingJob – IndexingJob: starting2017-08-18 11:32:24,048 WARN  util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable2017-08-18 11:32:24,123 INFO  basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:24,123 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:24,125 INFO  anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:24,125 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:24,129 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:24,319 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,099 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.2017-08-18 11:32:25,101 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.2017-08-18 11:32:25,161 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.2017-08-18 11:32:25,162 WARN  conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.2017-08-18 11:32:25,236 INFO  indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:25,348 INFO  elasticsearch.plugins – [Lin Sun] loaded [], sites []2017-08-18 11:32:25,956 INFO  client.transport – [Lin Sun] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:25,962 INFO  basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:25,962 INFO  anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:25,962 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:25,963 INFO  indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,992 INFO  elastic.ElasticIndexWriter – Processing remaining requests [docs = 0, length = 0, total docs = 0]2017-08-18 11:32:25,992 INFO  elastic.ElasticIndexWriter – Processing to finalize last execute2017-08-18 11:32:26,190 INFO  indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:26,190 INFO  indexer.IndexingJob – Active IndexWriters :ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port  (default 9300) elastic.index : elastic index command  elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)  elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

2017-08-18 11:32:26,201 INFO  elasticsearch.plugins – [Cloud 9] loaded [], sites []2017-08-18 11:32:26,221 INFO  client.transport – [Cloud 9] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:26,222 INFO  indexer.IndexingJob – IndexingJob: done.

Then I looked at Elasticsearch through Kibana. But it couldn’t find any indexes posting from Nutch. This was a little concerning since I didn’t know where it breaks. I finally found the reason in the Elasticsearch log (under /usr/local/var/log/elasticsearch).

It says,

java.lang.IllegalStateException: Received message from unsupported version: [1.0.0] minimal compatible version is: [5.0.0] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1379) ~[elasticsearch-5.5.1.jar:5.5.1]

It was obviously a version compatibility issue. This issue was addressed but there is no comment about on Apache Nutch project websites.

https://issues.apache.org/jira/browse/NUTCH-2323

I think Nutch works great, though. The indexer only supports Elasticsearch 1.x or 2.x (as I write this). This seems to be a deal breaker in considering Nutch since our main search engine is set as Elasticsearch.

I’m looking into StormCrawler.

Apache Nutch inject URLs

I am trying to crawl a website locally and collect all the seed/urls. I also want to dump all the html contents to the elasticsearch. I am still stuck on injecting urls into CrawlDB. I don’t know where it went wrong.

$ bin/nutch inject seed/urls.txt
InjectorJob: starting at 2017-08-17 19:56:15
InjectorJob: Injecting urlDir: seed/urls.txt

It doesn’t proceed from here.

Solr 4.x and Jetty install on Ubuntu

There are a couple of more configuration file change since the inception of Solr 4.x. The first configuration change is to set up the multiple cores. It says 4.4 will be the last version that supports core tags. It recommends to move on to this new set up. The new tag seems to be confusing. But you don’t have to worry about what it does. Basically it replaces core tags and you don’t have to define core configuration. I feel this is clean.

<solr>
  <solrcloud>
    <str name="host">127.0.0.1</str>
    <int name="hostPort">${hostPort:8086}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${solr.zkclienttimeout:30000}</int>
    <str name="shareSchema">${shareSchema:false}</str>
    <str name="genericCoreNodeNames">${genericCoreNodeNames:true}</str>
  </solrcloud>
 
  <shardHandlerFactory name="shardHandlerFactory" class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:120000}</int>
    <int name="connTimeout">${connTimeout:15000}</int>
  </shardHandlerFactory>
</solr>

Then, create core.properties file under the core directory. You can put more configuration file but the very basic is to put “name=collection1” as the example file indicates.

When you look at the jetty/logs you may face log4j warning. It was kind of tricky to figure out how to set it. But as the documentation reads (http://wiki.apache.org/solr/SolrLogging) just copy the four jar files to jetty/lib/ext folder. That will do.