Before moving ahead with this article, I assume you have Solr installed and running. If you would like to install Solr on windows, mac or via docker, please read Setup a Solr instance.
There are several ways to install nutch which you can read from Nutch tutorial, however I have written this article for those who would like to install nutch using docker. I tried finding help on google but could not find any help for nutch installation using docker and spent good amount of time fixing issues specific to it. Therefore I have written this article to help and save time of other developers.
Install nutch using docker-1. Pull docker image of nutch using below command,
2. Once image is pulled, run the container,
> docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
3. You should be able to enter in the container and see bash prompt,
Let's setup few important settings now-
1. Goto bin folder,
> bash-5.1# cd /nutch/bin
you should find nutch and crawl scripts in bin folder.
2. Check nutch is installed by running below command,
> bash-5.1# ./nutch
you should get nutch details as output.
3. Create a new folder where we will add our URLs for crawl,
> bash-5.1# mkdir urls
> bash-5.1# touch seed.txt
> bash-5.1# vi seed.txt
add urls in seed.txt which you would like to crawl.
> bash-5.1# ./crawl --num-threads 3 -s urls crawldb 2
this will start crawling of urls with 3 consecutive threads and iterate it 2 times.
8. After successfull run, you will notice, a crawldb, a segment and a linkdb folders are created.
Integrate it with Solr -
> bash-5.1# ./crawl -i -D solr.server.url=http://solr.bajajsumit.com:8983/solr/nutch_collection crawldb 1