Install Apache Maven
$ brew install maven
Install Zookeeper
$ brew install zookeeper
- After the installation is complete, it will display the following message:
To have launchd start zookeeper at login:
$ ln -sfv /usr/local/opt/zookeeper/*.plist ~/Library/LaunchAgents
- Start Zookeeper:
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.zookeeper.plist
NOTE: If you don’t want/need launchctl, you can just run:
$ zkServer start
Install ZeroMQ
$ brew install zeromq
Install Apache Storm
$ brew install storm
NOTE: This above command will install everything in Cellar folder and will have a symlink in “/usr/local/opt/storm” folder.
Setup Apache Storm and its components
- Edit the Storm config file: storm.yam. This file is located in /usr/local/opt/storm/libexec/conf folder. Add the following lines:
storm.zookeeper.servers:
– “localhost”
# – “server2”
#
nimbus.host: “localhost”
nimbus.thrift.port: 6627
ui.port: 8772
storm.local.dir: “/Users/gmohr/storm/data”
java.library.path: “/usr/lib/jvm”
supervisor.slots.ports:
– 6700
– 6701
– 6702
– 6703
- Make sure that whatever folder you create, they have the right permissions because I initially added “/home/user/storm/data” and did not realize that you cannot create any directory under “/home” folder. This was causing issues when I started nimbus and supervisor.
- Start Zookeeper, Storm Nimbus, Supervisor and UI. The following list is the location of shell scripts. Start them in the given sequence.
$ zkServer start
$ /usr/local/opt/storm/libexec/bin/storm nimbus
$ /usr/local/opt/storm/libexec/bin/storm supervisor
$ /usr/local/opt/storm/libexec/bin/storm ui
- Check that everything is running smoothly by running the following command:
$ jps
NOTE: The above command should give the following result.
5282 supervisor
5267 nimbus
5460 core
5735 Jps
4235 QuorumPeerMain
- Check the Storm UI page. The index page: http://localhost:8772/index.html
Create a StormCrawler project
- First off, run mvn command to generate a StormCrawler project
$ mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5.1
- Delete the following files under the StormCrawler project folder
- src/main/java folder
- crawler.flux file
- Go back to the StromCrawler file folder and copy:
- kibana folder
- es-conf.yaml
- es-crawler.flux
- ES_IndexInit.sh
- es-injector.flux
- README.md
- NOTE: Add a seeds.txt file and add URLs to crawl
- Create a package
$ mvn clean package
Start crawling process
- Inject URL to the crawler process and create URL seeds to crawl
$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-injector.flux –sleep 30000
- Begin crawling
$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –local es-crawler.flux –sleep 30000
- After finishing crawling process, send topology to Nimbus. This will create a topology summary section on the Storm UI dashboard. The usage of this is to check crawler topology status
$ storm jar target/stormcralwer-0.1.jar org.apache.storm.flux.Flux –remote es-crawler.flux
- NOTE: In order to debug, the log file location is $STORM_HOME/libexec/logs
Screenshot of Elasticsearch index in Kibana