10 November 2013

Search with Sitecore - Article - 3 - Crawler to Crawl Media File like PDF, Document

Dear all friends, this is third article in this series and following are the earlier articles for reference :

1) Search with Sitecore - Article-1 - Introduction 
2) Search with Sitecore - Article - 2 - Configure Advance Database Crawler and More
3) Search with Sitecore - Article - 3 - Crawler to Crawl Media File like PDF, Document  
4) Search with Sitecore - Article - 4 - Create Search API
5) Search with Sitecore - Article - 5 - Auto Complete For Search

For the same series, today we will see how i write my own custom crawler and configured it to able to crawl the PDF Files, Word Documents files etc.,

There are two parts where we need write lines in our Sitecore Solution, first is to create a class which has the capability to crawl the Sitecore Media Items. Another is to create a new Index lets name it as 'Documents' and configure this class into that index so that when you build the Indexes next time using Sitecore Index Building Wizard you can see your newly created Index in list in wizard and able to build it.
Following is the typical syntax of such class, you may name your class anything.

using Lucene.Net.Documents;
using Sitecore;
using Sitecore.Search;
using Sitecore.Search.Crawlers;
using System.Diagnostics;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Sitecore.Data.Items;
using System;
using Portal.AppServices.Utilities;
using System.Xml;
using DocumentFormat;
using System.Windows;

using DocumentFormat.OpenXml.Packaging;
public class FileCrawler : BaseCrawler, ICrawler
    {
        public string Root { get; set; }
        public string Database { get; set; }
        public float Boost { get; set; }
        long _totalProcessedSize;
        int _fileCount;
        int _successCount;
        int _failureCount;
        void ICrawler.Add(IndexUpdateContext context)
        {
        }
   }

In above class as you may note there are few default properties which specify key values like default DatabaseRoot and Boost, item which set automatically based on the configuration value you specifiy. Rest other properties are use for run-time value which helps in logging and knowing the number of result. 

First let's assume that our class is ready, then next probable step will be to configure the new Index. You may configure the new Index in the Web.Config file itself or you may choose to create new index file under folder path "~\App_Config\Include\Shared\" with any name you like for example "SearchSettings.config". This config file will be automatically read by the Sitecore. The syntax of file should be as followed.

<configuration xmlns:x="http://www.sitecore.net/xmlconfig/">
  <sitecore>
     <search>

        <configuration>
           <indexes>
              <index id="Documents" type="Sitecore.Search.Index, Sitecore.Kernel">
              <param desc="name">$(id)</param>
              <param desc="folder">Documents</param>
              <Analyzer ref="search/analyzer" />
              <locations hint="list:AddCrawler">
                 <web type="YourNameSpace.FileCrawler,YourAssemblyName">
                    <Database>web</Database>
                    <Root>/sitecore/media library</Root>
                    <Tags>documents</Tags>
                    <Boost>4.0</Boost>
                 </web>
              </locations>
            </index>
           </indexes>
      </configuration>
    </search>
  </sitecore>
</configuration>

If you have configured the above index successfully in config file and your syntax class is in correct place then you may build your solution and check the Sitecore Index Wizard to see does your newly created index appeared. Put your Visual Studio Solution in debug mode and put the debug pointer to "void ICrawler.Add(IndexUpdateContext context" and finally run the wizard, you will be able to see that your debug point is hitting and you may get the Item object inside the 'context' parameter.

How that happens ?
-- As you notice in the configuration above, we tell sitecore to read the items for index building from 'web' Database and use the '/sitecore/media library', rest all settings you may look later and each has its own significance which can customize the Index building approach. So one thing is clear we are reading media files within Sitecore Media Library. You may have other requirement where you may need to parse file stored in file system or some relational database. In such case you need to modify the above configuration file and still write the same way Crawler class.

So till now we have done only our preparatory work which give us code place where we can write our appropriate code and Sitecore is able to call our code. That's great !! because now we know where to write our code and once you know that you can do anything.

First basic, Lucene understands the textual data and we need to feed that for each file into each Lucene Document  for each separate media file with supported attributes we may have. But we have files like PDF and Word Documents which are not simple text files. So we need the capability to parse them, to do so we have rely on third parties and there are various available in market. I will not get into debate what is best among them as it depends upon project cost and performance requirements etc.,

So for parsing PDF files i am using "iText PDF" and to read Word Document files "Open XML SDK 2.0 for Microsoft Office". I know few limitation with above two third parties which are PDF files with images can't be parsed and Office Document which doesn't follow Open XML specification will not be able to parse (Prior to MS Office 2007).

Following is the full code for the class, please remember to add reference to third party and lucene .dll files within your library.

using Lucene.Net.Documents;
using Sitecore;
using Sitecore.Search;
using Sitecore.Search.Crawlers;
using System.Diagnostics;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Sitecore.Data.Items;
using System;
using System.Xml;
using DocumentFormat;
using System.Windows;
using DocumentFormat.OpenXml.Packaging;

namespace YourNameSpace.Crawler
{
    public class FileCrawler : BaseCrawler, ICrawler
    {
        public string Root { get; set; }
        public string Database { get; set; }
        public float Boost { get; set; }
        long _totalProcessedSize;
        int _fileCount;
        int _successCount;
        int _failureCount;

        void ICrawler.Add(IndexUpdateContext context)
        {
            _fileCount = 0;
            _totalProcessedSize = 0;
            _successCount = 0;
            _failureCount = 0;
            //Stopwatch watch = Stopwatch.StartNew();
            ParseMediaLibraryFiles(context);
            //watch.Stop();
            //Log.Info(string.Format("Finished parsing files -- Total files:{0}(Errors:{1}-Success:{2}) -- {3}m:{4}s:{5}ms -- Total bytes {6}", _fileCount, _failureCount, _successCount, watch.Elapsed.Minutes, watch.Elapsed.Seconds,
            //watch.Elapsed.Milliseconds, _totalProcessedSize), this);
        }

        private void ParseMediaLibraryFiles(IndexUpdateContext context)
        {
            if (Root != null && Root.Length > 0)
            {
                Sitecore.Data.Database db = Sitecore.Data.Database.GetDatabase(Database);
                if (db != null)
                {
                    this.AddRecursive(db.GetItem(Root), context);
                }
            }
        }

        private void AddRecursive(Item itm, IndexUpdateContext context)
        {
            if (itm != null && itm.HasChildren)
            {
                foreach (Item singleItm in itm.Children.InnerChildren)
                {
                    //Specify the Media Folder Template ID '{FE5DD826-48C6-436D-B87A-7C4210C7413B}'
                    if (singleItm.Template.ID.ToString().Equals("{FE5DD826-48C6-436D-B87A-7C4210C7413B}", StringComparison.InvariantCultureIgnoreCase))
                    {
                        this.AddRecursive(singleItm, context);
                    }
                    else
                    {
                        var document = new Document();
                        document = AddFileContent(document, singleItm, singleItm.ID.ToString());
                        if (document != null)
                            context.AddDocument(document);
                    }
                }
            }
        }

        void ICrawler.Initialize(Index index)
        {

            //Log.Info("File Crawler Init", this);

        }

        bool IsIndexableFile(MediaItem media)
        {
            bool result = false;
            if (media != null && (media.Extension.Contains("pdf") || media.Extension.Contains("xml") || media.Extension.Contains("txt") || media.Extension.Contains("docx")))
            {
                result = true;
            }
            return result;
        }

        private Document AddFileContent(Document document, MediaItem media, string itemID)
        {
            _fileCount++;
            try
            {
                if (IsIndexableFile(media) && document != null)
                {
                    _totalProcessedSize += media.Size;
                    if (media.GetMediaStream().CanRead)
                    {
                        if (media.Extension.Equals("pdf"))
                        {
                            document.Add(this.CreateTextField(BuiltinFields.Content, this.ParsePDF(media.GetMediaStream(), media.Name)));
                        }
                        else if (media.Extension.Equals("xml"))
                        {
                            this.AddXmlContent(document, media.GetMediaStream());
                        }
                        else if (media.Extension.Equals("txt"))
                        {
                            this.AddTextContent(document, media.GetMediaStream());
                        }
                        else if (media.Extension.Equals("docx"))
                        {
                            document.Add(this.CreateTextField(BuiltinFields.Content, this.AddWordContent(media.GetMediaStream())));
                        }

                        document.Add(this.CreateTextField(BuiltinFields.Name, ValidationHelper.ValidateToString(media.Name, "")));
                        document.Add(this.CreateDataField(BuiltinFields.Icon, ValidationHelper.ValidateToString(media.Icon, "")));
                        document.Add(this.CreateTextField(BuiltinFields.Tags, ValidationHelper.ValidateToString(media.Alt, "")));
                        document.Add(this.CreateValueField(BuiltinFields.Template, ValidationHelper.ValidateToString(media.InnerItem.Template.ID, "")));
                        document.Add(this.CreateValueField(BuiltinFields.TemplateName, ValidationHelper.ValidateToString(media.InnerItem.TemplateName, "")));
                        document.Add(this.CreateDataField(BuiltinFields.Tags, itemID, 5.0f));
                        document.Add(this.CreateTextField(BuiltinFields.Database, this.Database));
                        document.Add(this.CreateTextField(BuiltinFields.Language, Sitecore.Context.Language.ToString()));
                        _successCount++;
                    }
                }
            }
            catch
            {
                _failureCount++;
                document = null;
            }
            return document;
        }

        protected string AddWordContent(Stream mediaStream)
        {
            const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
            StringBuilder textBuilder = new StringBuilder();
            try
            {
                using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mediaStream, false))
                {
                    // Manage namespaces to perform XPath queries.  
                    NameTable nt = new NameTable();
                    XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
                    nsManager.AddNamespace("w", wordmlNamespace);

                    // Get the document part from the package.  
                    // Load the XML in the document part into an XmlDocument instance.  
                    XmlDocument xdoc = new XmlDocument(nt);
                    xdoc.Load(wdDoc.MainDocumentPart.GetStream());

                    XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
                    if (paragraphNodes != null)
                    {
                        foreach (XmlNode paragraphNode in paragraphNodes)
                        {
                            XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                            if (textNodes != null)
                            {
                                foreach (XmlNode textNode in textNodes)
                                    textBuilder.Append(textNode.InnerText);
                            }
                            textBuilder.Append(Environment.NewLine);
                        }
                    }
                }
            }
            catch (Exception)
            {
                _failureCount++;
            }
            finally
            {
                mediaStream.Close();
            }
            return textBuilder.ToString();
        }

        protected void AddTextContent(Document document, Stream mediaStream)
        {
            try
            {
                using (var reader = new StreamReader(mediaStream, Encoding.UTF8))
                {
                    document.Add(this.CreateTextField(BuiltinFields.Content, reader.ReadToEnd()));
                }
            }
            catch
            {
                _failureCount++;
            }
            finally
            {
                if (mediaStream != null)
                    mediaStream.Close();
            }
        }

        protected void AddXmlContent(Document document, Stream mediaStream)
        {
            XmlTextReader xreader = null;
            try
            {
                using (var reader = new StreamReader(mediaStream))
                {
                    xreader = new XmlTextReader(reader);
                    while (xreader.Read())
                    {
                        if (xreader.NodeType == XmlNodeType.Text ||
                        xreader.NodeType == XmlNodeType.Attribute ||
                        xreader.NodeType == XmlNodeType.CDATA ||
                        xreader.NodeType == XmlNodeType.Comment)
                        {
                            float boost = 1.0f;
                            if (xreader.Value.IndexOf("TODO",
                            StringComparison.InvariantCultureIgnoreCase) >= 0)
                            {
                                boost = 5.0f;
                            }
                            document.Add(this.CreateTextField(BuiltinFields.Content, xreader.Value, boost));
                        }
                    }
                }
            }
            catch (Exception)
            {
                _failureCount++;
            }
            finally
            {
                if (xreader != null)
                    xreader.Close();

                if (mediaStream != null)
                    mediaStream.Close();
            }
        }

        protected string ParsePDF(Stream mediaStream, string fileName)
        {
            var returnText = new StringBuilder();
            Stream stream = mediaStream;
            PdfReader pdfReader = null;
            try
            {
                if (stream != null)
                {
                    pdfReader = new PdfReader(new PdfReader(stream));
                    //loop through pages
                    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                    {
                        returnText.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page));
                        pdfReader.Close();
                    }
                }
            }
            catch (Exception)
            {
                _failureCount++;
            }
            finally
            {
                if (pdfReader != null)
                    pdfReader.Close();
                if (stream != null)
                    stream.Close();
            }
            return returnText.ToString();
        }
    }
}

Once you successfully build the project then you can just trigger the Index building wizard and once that wizard is successful then you may see your new folder named 'Documents' under Index directory.
Figure 3.1

Summary: In this article we have solved the issue to have indexes out of the Sitecore Media Files which are PDF and Word Documents. This make us one step closer to our end goal of having enterprise search. Please let me know if any further queries you may have in comments.
In next article I will write about how to use Sitecore and lucene API to able to search before that will create few more indexes which are targeted toward certain nodes in Sitecore like product catalog.


No comments:

Post a Comment