This site is archived.

How to build a Jobs Aggregation Search Engine with Nutch, Apache Solr and Views 3 in about an hour

11:00 AM - 12:00 PM
309 Microsoft

Presented by

David Stuart (dstuart)
Axis Twelve Ltd
Providing Professional Drupal ServicesProviding Professional Drupal Services
Succeed: Share how to design, theme, develop, host, train, manage and take your Drupal business to the next level.

Nutch is an open web crawler that lets you do fine grained or Internet wide web crawling. In this session I will introduce you to the Drupal Nutch module, which will help with the setup and control of your crawls. We will combine this with some of the new features in the Apache Solr, Views 3 and Apache Solr views to create hybrid search engine vertical that interleaves your content with supporting web content.

The Agenda will be:

1. An introduction to the Apache Nutch crawler
2. An introduction to the Features of the Drupal Nutch module
3. Technical Design decisions on combining crawled data with your Drupal data in Apache Solr
4. Bringing it all together with a demo of a jobs aggregation search engine
5. Questions

Experience: Advanced, Expert
Industry: education, entertainment, library, media

This sounds awesome!

Hope you don't run out of time - sounds like a lot to get through.

Does "about an hour" mean that from a vanilla install of Drupal (plus the required modules) and SOLR/NUTCH we'll see a site that aggregates the content from several other sites, built from scratch and fully functional within an hour?

If I understand this correctly, what you're going to be showing would be appropriate for any situation where I may wish to incorporate 3rd party content into my site without the content owner having to do any work on their end?

Will your demo include a means of intelligently surfacing this 3rd party content alongside relevant locally stored content? facets? or is the whole site going to consist solely of 3rd party content searchable by keyword?

Hi Mika,

My intention is cover everything you've mentioned in the hour (if I get chosen),my challenge is going to be not spreading the talk and demo to thin on detail as to not cover things sufficiently to be meaningful. The demo will include a combination of 3rd party and Drupal content.


David Stuart

Maybe you should do a 2-part session :)

Yea, that would be great! Maybe ill do part two in Copenhagen 2010

Can I vote 2,3,... 10 times?
I'd really like to attend it... ;)

Nutch in about an hour... priceless!

Looking forward indeed.

What Nutch module? only supports Drupal 4.7 rgt now..

Good point,

I have almost completed the d6 version will get a copy up today 11/04/2010

unless we want to do a retro style 4.7 demo ;)

Hi David,

This is great news!

Do you know if anyone will be recording your session?

I'd love to watch it, can't make it over, sorry!

cheers from
Scotland :)

Hey Scot,

I believe all the sessions are recorded. Maybe we could get together either in London (as that's where I'm based) or Copenhagen to talk querypath, Lucene api's etc.



Hey Everyone,

Just to let you Know I am stuck in London so no Drupalcon SF for me. I will be still doing my session via streamed link from London so hopefully you will still come and watch, Q and A will be available at the end