Setting up StormCrawler

Install Apache Maven

$ brew install maven

Install Zookeeper

$ brew install zookeeper

  • After the installation is complete, it will display the following message:

To have launchd start zookeeper at login:
$ ln -sfv /usr/local/opt/zookeeper/*.plist ~/Library/LaunchAgents

  • Start Zookeeper:

$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.zookeeper.plist

NOTE: If you don’t want/need launchctl, you can just run:
$ zkServer start

Install ZeroMQ

$ brew install zeromq

Install Apache Storm

$ brew install storm

NOTE: This above command will install everything in Cellar folder and will have a symlink in “/usr/local/opt/storm” folder.

Setup Apache Storm and its components

  • Edit the Storm config file: storm.yam. This file is located in /usr/local/opt/storm/libexec/conf folder. Add the following lines:

storm.zookeeper.servers:
– “localhost”
# – “server2”
#
nimbus.host: “localhost”
nimbus.thrift.port: 6627
ui.port: 8772
storm.local.dir: “/Users/gmohr/storm/data”
java.library.path: “/usr/lib/jvm”
supervisor.slots.ports:
– 6700
– 6701
– 6702
– 6703

  • Make sure that whatever folder you create, they have the right permissions because I initially added “/home/user/storm/data” and did not realize that you cannot create any directory under “/home” folder. This was causing issues when I started nimbus and supervisor.
  • Start Zookeeper, Storm Nimbus, Supervisor and UI. The following list is the location of shell scripts. Start them in the given sequence.

$ zkServer start
$ /usr/local/opt/storm/libexec/bin/storm nimbus
$ /usr/local/opt/storm/libexec/bin/storm supervisor
$ /usr/local/opt/storm/libexec/bin/storm ui

  • Check that everything is running smoothly by running the following command:

$ jps
NOTE: The above command should give the following result.
5282 supervisor
5267 nimbus
5460 core
5735 Jps
4235 QuorumPeerMain

Create a StormCrawler project

  • First off, run mvn command to generate a StormCrawler project

$ mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5.1

  • Delete the following files under the StormCrawler project folder
    • src/main/java folder
    • crawler.flux file
  • Go back to the StromCrawler file folder and copy:
    • kibana folder
    • es-conf.yaml
    • es-crawler.flux
    • ES_IndexInit.sh
    • es-injector.flux
    • README.md
    • NOTE: Add a seeds.txt file and add URLs to crawl
  • Create a package

$ mvn clean package

Start crawling process

  • Inject URL to the crawler process and create URL seeds to crawl

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-injector.flux –sleep 30000

  • Begin crawling

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-crawler.flux –sleep 30000

  • After finishing crawling process, send topology to Nimbus. This will create a topology summary section on the Storm UI dashboard. The usage of this is to check crawler topology status

$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –remote es-crawler.flux

storm_ui.png

  • NOTE: In order to debug, the log file location is $STORM_HOME/libexec/logs

Screenshot of Elasticsearch index in Kibana

Stormcrawler.png