Crawl and Index….. Nutch / elasticSearch – Partners in the making – Gen AI | Cloud

In the internet era, there is an old tech saying – “Content is King” (inspired by old Jungle saying from Phantom.. 🙂 )

One of the common challenges in content management system is to extract the latest information. In the WWW world, it is commonly known as crawling. The king of the crawler world is Apache nutch.

elasticsearch (no more just the new kid in town) has already established itself as one of the top search platforms. It is only natural that companies are looking at using the both platforms together to achieve a better content management system specifically acquire, analyze, publish, search phase.

Here’s a quick and dirty guide to get them up and running quickly.

1. Download nutch
2. set NUTCH_HOME
NUTCH_HOME=/Users/madheshr/tools/apache-nutch-2.2.1
export NUTCH_HOME
3. Clean build
ant clean
ant
4. Verify new local deploy created under NUTCH_HOME/rutime
/Users/madheshr/tools/apache-nutch-2.2.1/runtime/local
5. Under bin sudirectory of local, create a new directory called urls
6. In urls create a new file called nutch.txt. Edit the file to add URLs to crawl
7. Enable crawler in conf/nutch-site.xml by adding the below lines within configuration tags
<name>http.agent.name</name>
<value>My Nutch Spider</value>
8. Note the value and enter the same in conf/nutch-default.xml as the
value for <name>http.agent.name</name>
9. Test by running the below command in local/bin

nutch crawl urls -dir /tmp -depth 2
Integrate Nutch and ES
1. Activate elasticsearch indexer plugin
Edit conf/nutch-site.xml

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>

2. Verify and add ES specific properties to nutch-site.xlm

<!– Elasticsearch properties –>

<property>
<name>elastic.host</name>
<value>localhost</value>
<description>The hostname to send documents to using TransportClient. Either host
and port must be defined or cluster.</description>
</property>

<property>
<name>elastic.port</name>
<value>9300</value>
<description>
</description>
</property>

<property>
<name>elastic.cluster</name>
<value>elasticsearch</value>
<description>The cluster name to discover. Either host and potr must be defined
or cluster.</description>
</property>

<property>
<name>elastic.index</name>
<value>nutch</value>
<description>Default index to send documents to.</description>
</property>

<property>
<name>elastic.max.bulk.docs</name>
<value>250</value>
<description>Maximum size of the bulk in number of documents.</description>
</property>

<property>
<name>elastic.max.bulk.size</name>
<value>2500500</value>
<description>Maximum size of the bulk in bytes.</description>
</property>

3. Create a new index in ES if it is not there already
<value>nutch</value>

curl -XPUT ‘http://localhost:9200/nutch/’