Home > Lucene Search > Step 1: Create the Sitecore.Search File Crawler

Step 1: Create the Sitecore.Search File Crawler

Implement a crawler that indexes files in a specific folder. In this example, the crawler will index the Web site root by default. It must crawl through the contents of the file system and add specific data to the Lucene search index.

Let’s break the code down into smaller chunks to make it easier to understand.

Crawler.cs

1. Open Visual Studio. Create a new Web Application Project and copy project files to the Sitecore Web site root.

2. Create a C# class and include the following namespaces:

namespace FileCrawler.FileSystem
{
  using System;
  using System.IO;
  using System.Text;
  using System.Xml;
  using Lucene.Net.Documents;
  using Sitecore;
  using Sitecore.Search;
  using Sitecore.Search.Crawlers;

3. Think of a suitable name for your class, such as Crawler. Ensure that your class inherits from the BaseCrawler class and the ICrawler interface.

  public class Crawler : BaseCrawler, ICrawler

4. Declare a property called Root. This property defines the root folder to index.

        public string Root { get; set; }

    5. The Add method is the entry point for the crawler. Its purpose is to populate the search index. IndexUpdateContext passed to the Add method provides ways to add content to the search index.
   public void Add(IndexUpdateContext context)

6. Add performs sanity checks and calls a recusive scanner to traverse the file system: AddRecursive  

string path = MainUtil.MapPath(this.Root);
      var info = new DirectoryInfo(path);
      if (!info.Exists)
      {
        return;
      }
      this.AddRecursive(context, info);

7. The Add Recursive method traverses the file system tree by navigating through all the folders and subfolders. For each file it finds, it creates an index entry (a document) using CreateFileEntry method and adds it to the index using the AddDocument method.

    protected virtual void AddRecursive(IndexUpdateContext context, DirectoryInfo info)
    {
      foreach (FileInfo file in info.GetFiles())
      {
        context.AddDocument(this.CreateFileEntry(file));
      }
      foreach (DirectoryInfo subfolder in info.GetDirectories())
      {
        this.AddRecursive(context, subfolder);
      }
    }

8. The CreateFileEntry method(file) creates the file entry to be stored in the index. It calls two functions to populate the entry:

  • AddCommonFields – adds the common metadata fields of the Sitecore.Search infrastructure
  • AddContent – extracts text content from the file that will be used in the search
protected virtual Document CreateFileEntry(FileInfo file)
    {
      var document = new Document();
      this.AddCommonFields(document, file);
      this.AddContent(document, file);
      return document;
    }

Note: Document is a term used to define the structure of a Lucene index. Each document consists of one or more fields and behaves like a row in a database.

9. AddCommonFields adds the common fields supported by Sitecore.Search.

protected virtual void AddCommonFields(Document document, FileSystemInfo info)
    {
      document.Add(this.CreateTextField(BuiltinFields.Name, info.Name));
      document.Add(this.CreateDataField(BuiltinFields.Name, info.Name));
      document.Add(this.CreateTextField(BuiltinFields.Path, info.FullName));
      document.Add(this.CreateDataField(BuiltinFields.Url, "file://" + info.FullName));
      document.Add(this.CreateTextField(BuiltinFields.Tags, this.Tags));
      document.Add(this.CreateDataField(BuiltinFields.Tags, this.Tags));
      document.Add(this.CreateDataField(BuiltinFields.Icon, "Applications/16x16/document.png"));
      document.SetBoost(this.Boost);
    }

Description of common fields added to the Document:

Field Description
BuiltinFields.Name Used to prioritize search by file name. This is also the value that is displayed when presenting search results to the user.
BuiltinFields.Path and BuiltinFields.Tags Can be used to narrow down the search results. Not fully implemented in the current UI.
BuiltinFields.URL Important! Used to identify and open a file associated with the result. Notice that “file://” is used as a prefix to help identify results from the file system.This field is vital for integration with Quick Search because the UI relies on a URL value to identify and activate specific results. URL value must be sufficient to identify a specific result from this location among all results returned by Sitecore.Search.
BuiltinFields.Icon Used to display an icon next to each of the search results.
Boost Used to adjust the priority of results from the file system relative to other results.

Description of functions used to create fields in a Document:

Function Description
CreateTextField(name, value) Creates a field optimized for full-text search. The content of the field cannot be retrieved from the index.
CreateValueField(name, value) Creates a field optimized for value search (such as dates, GUIDs etc). The content of the field cannot be retrieved from the index.
CreateDataField(name, value) Creates a field returned in the search result. It is not possible to search for values in such fields.

Note: These functions are just helpers, and it is also possible to use the Lucene.Net API here.

10. The AddContent function extracts text from the indexed files. It relies on two example functions for getting the content:

  • AddXmlContent to extract information from XML files (XAML controls, layouts or configuration files)
  • AddTextContent to load content from UTF-8 encoded text files (ASPX, CSS and JS files)
        protected virtual void AddContent(Document document, FileInfo file)
        {
          if (this.AddXmlContent(document, file))
          {
            return;
          }
          if (this.AddTextContent(document, file))
          {
            return;
          }
        }
    

    11. The AddTextContent function converts the indexed content into text. It reads the file as text in UTF-8 and puts the content into the search index. Notice that BuiltinFields.Content stores information in the document which is the default destination for search queries performed by the Sitecore.Search framework.
protected virtual bool AddTextContent(Document document, FileInfo file)
    {
       try
      {
        using (var reader = new StreamReader(file.FullName, Encoding.UTF8))
        {
          document.Add(this.CreateTextField(BuiltinFields.Content, reader.ReadToEnd()));
        }
        return true;
      }
      catch
      {
      }
      return false;
    }

    12. The AddXmlContent function extracts content from XML files.
    It parses the file as XML and adds the text content to the index. Markup elements and attribute names are not added to the index. This function also detects and boosts files with “TODO” markers in the comments and text nodes.
protected virtual bool AddXmlContent(Document document, FileInfo file)
    {
       try
      {
        using (var reader = new StreamReader(file.FullName))
        {
          var xreader = new XmlTextReader(reader);
          while (xreader.Read())
          {
            if (xreader.NodeType == XmlNodeType.Text ||
                xreader.NodeType == XmlNodeType.Attribute ||
                xreader.NodeType == XmlNodeType.CDATA ||
                xreader.NodeType == XmlNodeType.Comment)
            {
              float boost = 1.0f;
              if (xreader.Value.IndexOf("TODO", StringComparison.InvariantCultureIgnoreCase) >= 0)
              {
                boost = 5.0f;
              }
              document.Add(this.CreateTextField(BuiltinFields.Content, xreader.Value, boost));
            }
          }
        }
        return true;
      }
      catch
      {
      }
      return false;
    }

    You have now successfully created a Sitecore.Search File Crawler.
    Step 2:  How to Display Search Results in the Desktop
Categories: Lucene Search Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.