If given this post to find a certain keyword or phrase, your options would be:
- Fully read the text until you find it
- Scan through the text and hope you come across it
As you can tell, option A is slow but reliable while option B is quick but undependable. And if this post grew to be a few web pages? Both methods would be slow and difficult.
This logic applies to computer searches as well. Today, our searches often span millions—or even billions—of documents which would make real-time full-text search, or even a quick scan, nearly impossible. Even for computers.
Instead, the process requires dividing the problem into two tasks: indexing and searching.
What is an index?
Whether the number of pages range from that of a textbook, an enterprise file share, or even the world wide web, an index can be used to improve the speed and efficiency of data retrieval. In the simplest terms, an index is a key for locating information. The indexing stage scans anything from metadata to full text for documents or pages and builds a list of search terms with associated pointers to where useful material can be found.
What is a full-text search?
When a search query (which could include words or phrases) is executed, the index, rather than the original text of every document, is referenced. Searching the index not only improves the speed of data retrieval, it can also return results ordered from most to least relevant since statistics for ranking use are collected when the index is built.
Beyond the basics
There are a number of software products developed primarily to perform full-text indexing and search. One of the most well-known and widely-used is Lucene. Apache Lucene is a free and open-source full-text indexing and search library.
Originally written in Java, Lucene is now available in other programming languages such as Python, Perl, and C++. Its search and indexing process isn’t much different from what I described above. Put simply, it searches an index instead of searching the text directly. Lucene can be used to add search functionality to a website, an application—for example, a SQL/NoSQL database (see SQL vs. NoSQL blog for the basics)—or a file system. Lucene is so widely used because of its solid, well-documented technology… and it’s free.
If you haven’t heard of Lucene, you might have heard of ElasticSearch or Solr since both are used by large organizations like Twitter, LinkedIn, and Instagram. In fact, Solr and ElasticSearch are two of the many sub-projects based on Lucene.
What does indexing mean for you?
For some, efficient search engine indexing means quickly finding the perfect dinner recipe on Google. For others, it could mean millions of dollars in legal and staffing needs. Within the information governance space, search engine indexing is vital to the corporate eDiscovery process. From keyword selection in ECA to document production during trial (if it comes to that), knowing what data exists where is just as important as being able to produce documents on demand.
The right index and search capabilities can make all the difference. See how one Fortune 500 company was able to eliminate the risks and costs of their manual eDiscovery process here.