Setting up StormCrawler

September 12, 2017June 15, 2018 G MohrLeave a comment

Install Apache Maven

$ brew install maven

Install Zookeeper

$ brew install zookeeper

After the installation is complete, it will display the following message:

To have launchd start zookeeper at login:
$ ln -sfv /usr/local/opt/zookeeper/*.plist ~/Library/LaunchAgents

Start Zookeeper:

$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.zookeeper.plist

NOTE: If you don’t want/need launchctl, you can just run:
$ zkServer start

Install ZeroMQ

$ brew install zeromq

Install Apache Storm

$ brew install storm

NOTE: This above command will install everything in Cellar folder and will have a symlink in “/usr/local/opt/storm” folder.

Setup Apache Storm and its components

Edit the Storm config file: storm.yam. This file is located in /usr/local/opt/storm/libexec/conf folder. Add the following lines:

storm.zookeeper.servers:
– “localhost”
# – “server2”
#
nimbus.host: “localhost”
nimbus.thrift.port: 6627
ui.port: 8772
storm.local.dir: “/Users/gmohr/storm/data”
java.library.path: “/usr/lib/jvm”
supervisor.slots.ports:
– 6700
– 6701
– 6702
– 6703

Make sure that whatever folder you create, they have the right permissions because I initially added “/home/user/storm/data” and did not realize that you cannot create any directory under “/home” folder. This was causing issues when I started nimbus and supervisor.
Start Zookeeper, Storm Nimbus, Supervisor and UI. The following list is the location of shell scripts. Start them in the given sequence.

$ zkServer start
$ /usr/local/opt/storm/libexec/bin/storm nimbus
$ /usr/local/opt/storm/libexec/bin/storm supervisor
$ /usr/local/opt/storm/libexec/bin/storm ui

Check that everything is running smoothly by running the following command:

$ jps
NOTE: The above command should give the following result.
5282 supervisor
5267 nimbus
5460 core
5735 Jps
4235 QuorumPeerMain

Check the Storm UI page. The index page: http://localhost:8772/index.html

Create a StormCrawler project

First off, run mvn command to generate a StormCrawler project

$ mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5.1

Delete the following files under the StormCrawler project folder
- src/main/java folder
- crawler.flux file
Go back to the StromCrawler file folder and copy:
- kibana folder
- es-conf.yaml
- es-crawler.flux
- ES_IndexInit.sh
- es-injector.flux
- README.md
- NOTE: Add a seeds.txt file and add URLs to crawl
Create a package

$ mvn clean package

Start crawling process

Inject URL to the crawler process and create URL seeds to crawl

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-injector.flux –sleep 30000

Begin crawling

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-crawler.flux –sleep 30000

After finishing crawling process, send topology to Nimbus. This will create a topology summary section on the Storm UI dashboard. The usage of this is to check crawler topology status

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –remote es-crawler.flux

NOTE: In order to debug, the log file location is $STORM_HOME/libexec/logs

Screenshot of Elasticsearch index in Kibana

Error in Apache Nutch Indexing for Elasticsearch

August 18, 2017August 18, 2017 G MohrLeave a comment

Even after applying the following command, there is no indexes in the Elastic search.

$ bin/nutch index elasticsearch -all

The logs/hadoop.log displays the following. It looks as if there was not any issues in completing indexing work.

2017-08-18 11:29:59,542 INFO elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,542 INFO elasticsearch.plugins – [Behemoth] loaded [], sites []2017-08-18 11:29:59,564 INFO client.transport – [Behemoth] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:29:59,565 INFO indexer.IndexingJob – IndexingJob: done.2017-08-18 11:32:23,894 INFO indexer.IndexingJob – IndexingJob: starting2017-08-18 11:32:24,048 WARN util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable2017-08-18 11:32:24,123 INFO basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:24,123 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:24,125 INFO anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:24,125 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:24,129 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:24,319 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,099 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-08-18 11:32:25,101 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/staging/xxx486819994/.staging/job_local486819994_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-08-18 11:32:25,161 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-08-18 11:32:25,162 WARN conf.Configuration – file:/tmp/hadoop-xxx/mapred/local/localRunner/xxx/job_local486819994_0001/job_local486819994_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-08-18 11:32:25,236 INFO indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:25,348 INFO elasticsearch.plugins – [Lin Sun] loaded [], sites []2017-08-18 11:32:25,956 INFO client.transport – [Lin Sun] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:25,962 INFO basic.BasicIndexingFilter – Maximum title length for indexing set to: 1002017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.basic.BasicIndexingFilter2017-08-18 11:32:25,962 INFO anchor.AnchorIndexingFilter – Anchor deduplication is: off2017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter2017-08-18 11:32:25,962 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.metadata.MetadataIndexer2017-08-18 11:32:25,963 INFO indexer.IndexingFilters – Adding org.apache.nutch.indexer.more.MoreIndexingFilter2017-08-18 11:32:25,992 INFO elastic.ElasticIndexWriter – Processing remaining requests [docs = 0, length = 0, total docs = 0]2017-08-18 11:32:25,992 INFO elastic.ElasticIndexWriter – Processing to finalize last execute2017-08-18 11:32:26,190 INFO indexer.IndexWriters – Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter2017-08-18 11:32:26,190 INFO indexer.IndexingJob – Active IndexWriters :ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port (default 9300) elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

2017-08-18 11:32:26,201 INFO elasticsearch.plugins – [Cloud 9] loaded [], sites []2017-08-18 11:32:26,221 INFO client.transport – [Cloud 9] failed to get node info for [#transport#-1][BAS2019][inet[localhost/127.0.0.1:9300]], disconnecting…org.elasticsearch.transport.NodeDisconnectedException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] disconnected2017-08-18 11:32:26,222 INFO indexer.IndexingJob – IndexingJob: done.

Then I looked at Elasticsearch through Kibana. But it couldn’t find any indexes posting from Nutch. This was a little concerning since I didn’t know where it breaks. I finally found the reason in the Elasticsearch log (under /usr/local/var/log/elasticsearch).

It says,

java.lang.IllegalStateException: Received message from unsupported version: [1.0.0] minimal compatible version is: [5.0.0] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1379) ~[elasticsearch-5.5.1.jar:5.5.1]

It was obviously a version compatibility issue. This issue was addressed but there is no comment about on Apache Nutch project websites.

https://issues.apache.org/jira/browse/NUTCH-2323

I think Nutch works great, though. The indexer only supports Elasticsearch 1.x or 2.x (as I write this). This seems to be a deal breaker in considering Nutch since our main search engine is set as Elasticsearch.

I’m looking into StormCrawler.

Apache Nutch inject URLs

August 17, 2017 G MohrLeave a comment

I am trying to crawl a website locally and collect all the seed/urls. I also want to dump all the html contents to the elasticsearch. I am still stuck on injecting urls into CrawlDB. I don’t know where it went wrong.

$ bin/nutch inject seed/urls.txt
InjectorJob: starting at 2017-08-17 19:56:15
InjectorJob: Injecting urlDir: seed/urls.txt

It doesn’t proceed from here.

Solr 4.x and Jetty install on Ubuntu

August 19, 2013February 18, 2014 G MohrLeave a comment

There are a couple of more configuration file change since the inception of Solr 4.x. The first configuration change is to set up the multiple cores. It says 4.4 will be the last version that supports core tags. It recommends to move on to this new set up. The new tag seems to be confusing. But you don’t have to worry about what it does. Basically it replaces core tags and you don’t have to define core configuration. I feel this is clean.

<solr>
  <solrcloud>
    <str name="host">127.0.0.1</str>
    <int name="hostPort">${hostPort:8086}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${solr.zkclienttimeout:30000}</int>
    <str name="shareSchema">${shareSchema:false}</str>
    <str name="genericCoreNodeNames">${genericCoreNodeNames:true}</str>
  </solrcloud>
 
  <shardHandlerFactory name="shardHandlerFactory" class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:120000}</int>
    <int name="connTimeout">${connTimeout:15000}</int>
  </shardHandlerFactory>
</solr>

Then, create core.properties file under the core directory. You can put more configuration file but the very basic is to put “name=collection1” as the example file indicates.

When you look at the jetty/logs you may face log4j warning. It was kind of tricky to figure out how to set it. But as the documentation reads (http://wiki.apache.org/solr/SolrLogging) just copy the four jar files to jetty/lib/ext folder. That will do.

I heard that you have a great story about…

Programming and some algorithm stuff

Category: Search

Setting up StormCrawler

Error in Apache Nutch Indexing for Elasticsearch

Apache Nutch inject URLs

Solr 4.x and Jetty install on Ubuntu

	aluicivica on Installing PHP PDO Informix on…
	Adrian on Installing PHP PDO Informix on…
	Abdul Rahim on jQuery::flexigrid::How to get…