I this article, asking "What is your assessment of today's enterprise search industry?" I thought I'd chip in.
What's done right
Today's Enterprise Search products have effective answers for content ingestion and and query performance.
Any product that is successful at all has an answer for content ingestion. It's a complex problem because you need to interact with many kinds of system, but it's a solved problem: a vendor who hasn't solved this problem would not be successful at all.
Query throughput is easy to handle with horizontal replication.
After that, there's a concern about latency, but the best answer to that is have the search engine "do more with less", optimizing algorithms and data structures. Developers oriented towards performance work can be found in the video game industry and other pockets of the software industry -- so long as you make it a priority, it's tractable in terms of business and technology
Enterprise search products are often built around Lucene. Lucene 3 had a lot of good traits, but also fundamental flaws.
Strings in the Java language, on which Lucene 4 is based, are encoded in a fixed-length representation. ASCII characters, used heavily in most market areas, get doubled in size. When you're looking at gigabytes of documents, this is a big deal. The Fedora Linux distribution rejected Lucene for a desktop search tool ten years ago because of this overhead.
Lucene 4 represents text as UTF-8, speeds up general operations by at least a factor of two, and speeds up many specific operations by hundreds of times. The design has improved dramatically, making it much easier to engineer substantial changes to the scoring algorithms.
Many organizations have a code base in Lucene 3, but from my viewpoint, it's malpractice to do maintenance work on a Lucene 3 system, because in the long term, it can't compete with a Lucene 4 system.
The science of relevance
There's a quote that circulates in the business literature, which goes something like "You can't improve what you can't measure". It's been misatttributed to Edward Demings and others, but I like the way it is used in J.F. Lawton's 1997 book The Selling Bible -- he talks to successful salespeople and finds that they know what percentage of customers they can sell, then talks to the "losers in the lounge" and draws a blank when he asks that question.
The best case study I can think for relevance work is IBM Watson. When some IBMers got the idea to compete at Jeopardy, they built a demo system based on an existing search engine and got this result
The dark line is the performance of the demo, and the cluster of dots higher up is the performance of winning Jeopardy players. Most of the players are in grey, but the dark ones to the right are from Keith Jennings, the record holder that Watson needed to beat.
The chart is intimidating: if you were up against this and chose to give up, I wouldn't blame you.
After some years of work, IBM systematically improved the performance of Watson until it hit the target
Now, the strategy and the software framework behind Watson had this capacity, but it couldn't have gotten close to the goal without a systematic program of evaluation.
Evaluation has many virtues, the most fundamental of which is comparing two versions and deciding which is better. You and I can think of many things which seem like they'd improve the relevance of a search engine, but if you try them, you might find things stay the same or get worse.
Industry and academic researchers participate in the yearly TREC, which is organized around a group of Kaggle-like competitions where participants try to get the best results
with a specific set of documents and queries.
It's an expensive process for a few reasons. First, you need to have hundreds of queries, annotating thousands of possible search results as valid or not. You'll need to load a substantial set of documents (gigabytes if not terabytes) and then run all of the queries. You might want to try this hundreds of times trying out different combinations of parameters, not to mention to fix the bugs that will certainly turn up. If your culture doesn't put devops first, you'll spend a huge amount of human time running those tests.
At least if you use the artifacts that TREC creates, you get a tolerable set of judgements. You'll certainly get better results if you optimize for your own documents, but then you've got to create your own judgements.
If you talk to Enterprise Search vendors you'll find that some of them participate in TREC or some use it internally. You'll find the overwhelming majority do not.
What they tell me, and I believe it, is that customers don't see enough value in relevant search results to pay for evaluation work. If it's good enough to make the sale, it's good enough. One objection to the mainstream TREC work is that TREC rewards the quality of the 500th search result, something that doesn't matter in some fields, like web search, where users only look at the first 10 result.
Although it's always been easy to tweak Lucene to prioritize certain fields and do other ad-hoc tricks which ought to improve relevance, it's been unusual to see Lucene-based competitiors in TREC because: (i) the Lucene 3 scoring engine is nowhere near competitive on TREC, and (ii) changing the scoring engine to something better was maddeningly difficult and often resulted in terrible performance loss.
The good news is that Lucene 4 now has pluggable Similarity engines. In particular, it contains implementations of the modern Language Modelling approach
which is a dramatic improvement over the old tf*idf scoring in itself, as well as being a rational foundation to build even better systems.
So far as is publicly known, the LM similarity is little used because getting good results on it depends on choosing a "smoothing" function which addresses the poor sample size we get when we're looking at rare words. Lucene 4 currently implements two smoothing algorithms out of several that are in the literature. The successful use of LM in Lucene is a matter of trying out algorithms and their parameters to get the best result, a task that, unfortunately, nobody is doing openly.
Creator of database animals and bayesian brains