Skip to main content

Nutch crawler and integration with Solr

Before moving ahead with this article, I assume you have Solr installed and running. If you would like to install Solr on windows, mac or via docker, please read Setup a Solr instance.

There are several ways to install nutch which you can read from Nutch tutorial, however I have written this article for those who would like to install nutch using docker. I tried finding help on google but could not find any help for nutch installation using docker and spent good amount of time fixing issues specific to it. Therefore I have written this article to help and save time of other developers.

Install nutch using docker-

1. Pull docker image of nutch using below command,
    > docker pull apache/nutch
2. Once image is pulled, run the container,
    > docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
3. You should be able to enter in the container and see bash prompt,
    > bash-5.1# 

Let's setup few important settings now-

1. Goto bin folder, 
    > bash-5.1# cd /nutch/bin
        you should find nutch and crawl scripts in bin folder.
2. Check nutch is installed by running below command,
    > bash-5.1# ./nutch
        you should get nutch details as output.

3. Create a new folder where we will add our URLs for crawl,
    > bash-5.1# mkdir urls
    > bash-5.1# touch seed.txt
    > bash-5.1# vi seed.txt

         add urls in seed.txt which you would like to crawl.

4. seed.txt file (in this case, I have added two urls to crawl) should look like this,

5. Modify nutch-site.xml file,
    > bash-5.1# cd ../conf
    > bash-5.1# vi nutch-site.xml
    Remove existing lines and add following lines in it,
    <?xml version="1.0"?>

6. Run below command to inject the urls for crawling,
    > bash-5.1# cd ../bin 
    > bash-5.1# ./nutch inject crawlerdb urls
        It should inject the urls and show you a success message. Along with it, a crawldb will also be                    created. 

7. Now run, below command to crawl, create segments and invertedlinks,

    > bash-5.1# ./crawl --num-threads 3 -s urls crawldb 2

    this will start crawling of urls with 3 consecutive threads and iterate it 2 times.

8. After successfull run, you will notice, a crawldb, a segment and a linkdb folders are created.  

Integrate it with Solr - 

1. Modify index-writers.xml file,
    > bash-5.1# cd ../conf
    > bash-5.1# vi index-writers.xml
     Update solr url as shown below, where nutch_collection is collection created in Solr. 
2. Run the command to push it to Solr instance,

    > bash-5.1# ./crawl -i -D solr.server.url= crawldb 1

3. After successful completion, records should show in Solr instance->nutch_collection

Please feel free to ask for any help. Keep learning and building the community strong. 


Popular posts from this blog

Cannot alter the login 'sa', because it does not exist or you do not have permission.

Working on projects, it can happen that 'sa' account gets locked. If it is on local machine OR development boxes, onus would be on you to fix it. If scripts and SQL steps are not working, this might help you fixing the issue. Steps to unlock 'sa' account and resetting the password. 1. Open SQL Server Configuration Manager 2. Select SQL Server Services -> 'SQL Server' service. 3. Right click on 'SQL Server' service and click on "Startup Parameters". For 2008, server "Startup Parameters" are inside Advanced tab.   4. Add '-m' in startup parameters as shown above and click on 'Add'. This will put SQL server into 'Single User Mode' and local admin will have 'Super User' rights. For 2008, server you have to add ':-m' in the last of the existing query. 5. Save the settings and Restart the service. 6. Now open the SQL Server Management Studio and connect to database using 'Windows A

Could not load file or assembly 'Microsoft.Web.Infrastructure'

Could not load file or assembly 'Microsoft.Web.Infrastructure, Version=, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified. What 'Micorosoft.Web.Infrastructure' does? This dll lets HTTP modules register at run time. Solution to above problem: Copy 'Micorosoft.Web.Infrastructure' dll in bin folder of your project and this problem should be resolved. If you have .Net framework installed on machine, this dll should be present on it. You can search for this dll and copy it in your active project folder.   Alternatively,  you can install this dll using nuget package manager PM> Install-Package Microsoft.Web.Infrastructure -Version 1.0.0 Happy coding!!

Dockerize a dotnet core application with SQL connectivity

Before reading this article, I am assuming that you know Docker, Dotnet core and have a dotnet core application which is trying to connect to SQL server. Read how to build aspnet core app, docker and run the docker container. If docker container is running and you are not able to connect to database, this blog should help you fix it.  Prerequisite -  Make sure code is working via running aspnet core locally via visual studio or command line. Port 1433 is opened for connecting to SQL server. Solution If you have Docker file ready, it should somewhat look like below file -  FROM AS build-env WORKDIR /app # Copy csproj and restore as distinct layers COPY /SampleAPI/*.csproj ./ RUN dotnet restore # Copy everything else and build COPY . . WORKDIR /app/SampleAPI RUN dotnet publish -c Production -o publish # Build runtime image FROM WORKDIR /app/SampleAPI COPY --from=build-env /app/SampleAPI . WORK