-
One of the hard parts about organizing a conference like the 5th annual Text Analytics Summit, held last week in Boston, must be selecting the industry case studies. Text analytics is a highly specialized, but broad reaching topic that has applications in life sciences, financial service, legal, retail, government, media, and entertainment, to name a few. Any one of these industries could have filled the conference with interesting examples.
As it was, most of the case studies and vendor briefings at this conference were about Voice of the Customer or Market Intelligence. I suspect that some attendees might have preferred a little more variety in the cases presented. The absence of any government case studies, for example, was conspicuous, but understandable given the special nature of that domain. We’d all probably have needed security clearances to attend those sessions anyway. Overall, I appreciated the more commercial/consumer focus and felt that the conference organizers did a great job of finding representative examples and balancing the practical (vendor briefings and case studies) with the theoretical.
As a first time attendee to the conference, I was interested in just getting the lay of the land in text analytics, but I was also interested to learn how people were answering the “what’s next” question. It came up several times over the 2 days during Q&A and panel sessions and there were different takes, but I paid close attention to three, in particular, that resonated with my own observations looking through the lens of enterprise search.
Trend 1: ETL-like Tools
Ok, this is not really a trend in text analytics, but it is one in enterprise search that is informed by text and data analytics.
Many of the vendors at the conference demonstrated graphical tools designed to simplify the process of building text analysis “pipelines”. These tools look very much like the Extract, Transform, and Load (ETL) tools that have been around for many years in the data integration world. The difference is that the text analysis versions of these tools focus on operations for handling unstructured text. For example, named entity recognizers are a common text analytics task for automatically recognizing and tagging things like person names, company names, and locations in text.
This ETL “pattern” exists in enterprise search, as well, where information must be extracted from a source repository (e.g. an email archive), transformed into an enhanced, canonical representation (e.g. annotated XML), and loaded into a database or index for searching. The demand for graphical tools to manage the ETL process for search has not been as high as for text or data analysis. I think this partly because, for search applications, it is usually a one-time set up process and not an iterative modeling exercise as it is with text analytics. It may also be because historically the operations performed on content before it’s indexed for search have not been as sophisticated as the operations performed for in-depth text analytics.
This is changing. To start, extensible pipeline processing frameworks that incorporate advanced text analysis capabilities have become more common in enterprise search products. By now, most mainstream enterprise search platforms include entity extractors, for example. We are also seeing more ETL-like graphical consoles for managing content integration and analysis.
The adoption of these tools and techniques for enterprise search is motivated, in part, by a desire to more easily harness text analytics features that increase search precision and create richer search experiences. It’s also the case that, while text analytics shares a heritage more with information retrieval (search) than with business intelligence (BI), it includes technologies relevant to both and sits smack in the middle of the convergence between these two spaces. Sue Feldman and Hadley Reynolds of IDC reinforced this role of text analytics by describing it as a cornerstone of Unified Information Access during their Market Report at the conference. Given this, it shouldn’t be surprising to see that, as text analytic tools and concepts have found their way into BI applications, traditional BI tools and concepts, like ETL, are finding a place within enterprise search.
Trend 2: Empowering the End User
Another topic that popped up at various times during the conference was the challenge of delivering the richness of text analysis tools to users other than specially trained analysts. As with traditional BI tools, many text analysis tools assume a trained user or “analyst” capable of designing sophisticated workflows or advanced analytical models. One question posed to a speaker after he finished describing his text mining process was “when do you think you’ll be out of your job?” - meaning, when will the tools be so easy to use that your end users won’t need you to do their investigation for them?
I’m sure this exact question was asked at a conference of professional research librarians some 15-20 years ago - back when online search services and later Internet search engines were becoming easier and easier to use and obviating the need for “professional searchers”. The answer is likely the same, too. There will always be specialists and “power users”, but as the tools become easier to use, end users will become more empowered to do their own increasingly advanced analysis.
In practice, we are seeing more applications that combine conventional search with advanced text analytics in ways that bring a more powerful search experience to relatively unsophisticated end users. Silobreaker.com is a clever site that combines the richness of text analytics within what is fundamentally a news search application. Unlike other news search sites, Silobreaker offers options and tools that help to uncover and discover interesting and potentially novel connections and patterns in the news. There are still some usability challenges with a consumer site like Silobreaker, but I like it as an example of ad hoc search converging with iterative knowledge discovery.
The trend toward empowering users with more than just a search box and list of blue links also reaches into less “analytical” consuemr applications. Two examples are www.oodle.com and www.globrix.com. Both sites show the power of applying analytics to both structured and textual data (classifieds in the case of Oodle, real estate postings in the case of Globrix) in what are otherwise fundamentally search applications.
Trend 3: Taking Sentiment Analysis to the next level
Sentiment analysis is the ability to recognize the mood, opinion, or intent of a writer by analyzing written text. It is sometimes called the “thumbs up, thumbs down” problem because the most common application is establishing whether a writer is positive or negative on a particular subject. In this form, it is often used to analyze written product reviews (see this example on Microsoft’s new Bing Web search).
Sentiment was a much mentioned topic at the conference. This is not surprising given the focus on Voice of the Customer and Market Intelligence – two areas where accurately establishing the sentiment of customers and consumers toward products, services, and brand is highly desirable. One of the presenters at the conference was Roddy Lindsay from Facebook. I missed that session, but it doesn’t take much imagination to appreciate the possible applications for text analytics and sentiment analysis, in particular, with the information available on Facebook and other social networking platforms.
Every vendor present had something to show or say on the subject of sentiment analysis, but all the panelists in the vendor-only panel acknowledged the difficulties of increasing the precision of sentiment classification. Currently, the number tossed around is 80%. That is, a sentiment classifier will get it right about 80% of the time compared to human judgments. This number is higher in some applications - for example, when analyzing short, strongly opinionated product reviews. It is lower when analyzing longer pieces of text where just fixing the subject can difficult – like this blog post.
Progress is being made, though. The first step has been a shift away from “document-level” sentiment to “topic-level” sentiment. This allows sentiment classification to be more accurate when confronting documents, like this post, that touch on and offer opinion on multiple topics. It also helps with more concrete problems like the ones represented in this sentence:
“Acme’s new P40 digital camera has a good viewer, but its controls are awkward.”
While it’s relatively easy for a human, it takes some heavy linguistic lifting for a machine to recognize that the sentiment of this opinion is directed not just at Acme or at the P40 digital camera, but specifically at the viewer (positive sentiment) and the controls (negative sentiment). It’s ever trickier establishing what the word “its” refers to in the 2nd part of the sentence. Is it the Acme P40 itself, or just the viewer?
Sentiment is admittedly a niche topic, even within text analytics, but getting it right matters a lot for enterprise search applications in eCommerce (think product reviews), Market Intelligence (reputation tracking and competitive intelligence), eDiscovery, and Government Intelligence. One presenter suggested that all the remaining hard problems in sentiment analysis will be solved, at least academically, in a couple years. It will be interesting to see how soon these improvements surface in real-life applications.
Nate
-
I recently came across Paul Nelson’s informative post on search scalability. I don’t know how long it’s been up there, but reading it made me think of customers I’ve spoken with recently who are looking to scale up their search deployments, but, due to tight budgets, want to do so without simply buying more hardware.
Paul focuses on document count as the main consideration for architecting scalable search, saying:
There is really only one dimension of size: The total count of documents in the system.
He goes on to describe several useful strategies for scaling search for “large” systems – those with document counts of >500 million. Importantly, imo, he also points out that even medium sized systems (10-100 million docs) will have special scaling needs depending on their performance requirements:
If these systems have any kind of query or index performance requirements — for example, it is a public web site with 10-30 queries per second, or that new documents arrive at a rate of 10 documents per second — then you will likely need an array of machines to handle your needs.
I mostly want to reinforce and build on this second point. Effective scaling search means getting the most out of your search infrastructure (i.e. maximizing the number of documents per unit of hardware), but scale and performance are two sides of the same coin, and whether a system can squeeze ten thousand or ten billion documents on a machine, it must still satisfy the applications performance requirements.
If you can’t just add hardware, what then? Well, there are still options for getting more capacity out of a search system that provides the right level of control for optimization and tuning. Understanding these options requires understanding how search system performance is measured and the associated trade-offs that exist. Paul alludes to some of these trade-offs, but it’s worth providing a few more details and examples to drive this point home.
Search System Performance Metrics
Metrics for search system performance typically fall into two categories: query performance and indexing performance. In turn, these categories each have two measures associated with them:
Query performance
· Query latency (or response time) – the time it takes for a query to be processed and results to be returned.
· Query rate – the rate at which the system can process queries. Usually measured in queries per second (or QPS).
Indexing performance*
· Indexing latency – the time it takes for a document to be indexed and made available to search.
· Indexing rate – the rate at which the system can process and index documents. Measured in documents per second.
*Indexing performance assumes systems that actually create an index or some other sort of database optimized for information retrieval. This rules out “federated search” engines, which rely on other systems to create and manage these indices.
There are some variations on these measurements. For example, you can track average or peak values for each. Document count per node (where a node = a Processing/Memory/Storage unit on a network) impacts all of these measures, but there’s a balance between query performance and index performance that also influences how many documents you can squeeze onto a single node. The perhaps obvious explanation is that the more system resources you allocate to serve query performance, the fewer resources you’ll have available for indexing, and vice versa.
Applications with rapidly changing content or with very time sensitive data place high demand on indexing performance. Other applications, like highly trafficked Web sites, place high demand on query performance. Different applications place different demands on scalability depending on the performance requirements across these dimensions. To take a specific example, consider an eDiscovery application that provides search across 100s of millions of archived emails. The query rate and indexing latency requirements for this type of application are typically lower than what a reasonably popular social networking site with an equivalent document count might see. As a result, eDiscovery search applications are able to squeeze more documents per node than highly trafficked Web sites – even if they serve the same total number of documents.
For another comparison, large eCommerce sites can have extreme query performance requirements - in some cases handling several thousand queries per second during peak traffic times, while still delivering sub-second responses. Even with these extreme query requirements, these sites can have relatively modest indexing performance requirements when compared to, say, financial news applications where content “freshness” and, so, low index latency are a priority.
Impact of Features
An often neglected factor that impacts performance is feature set. Features like faceted searching, results clustering, automatic query completion, and advanced query operators can each add incremental overhead to indexing performance, query performance, or both, depending on the feature and the system. For example, queries used for eDiscovery are sometimes crafted by teams of lawyers. This can result in queries made up of dozens or even hundreds of carefully selected search terms combined in a maze of (also carefully selected) Boolean, proximity, and other types of search operators.
I remember one FAST partner describing how their legacy eDiscovery tool (built on relational database technology) took up to 2 weeks to process a particularly long and complex query. Needless to say, they were delighted when we demonstrated the same query taking only a few seconds. It was not sub-second, but the point is that they would have been happy with this particular query if it came back in a few hours. In fact, our conversations on optimization included whether we could squeeze more capacity (docs per node) by relaxing the query response time requirements to 10-15 seconds for these queries in their application.
Different search systems are better (faster) than others, but parsing and evaluating very long and complex queries will generally take more cycles and resources than the usual 1 or 2 term ad hoc query. Relative to absolute document count, the individual impact on performance and scale of any one feature may be small, but taken as a whole and for certain applications, like the one in the example above, they can represent meaningful tuning options.
Know Your Options
The moral of the story is that getting enterprise search scale and performance right for large systems can be somewhat nuanced - especially if you’re on a tight budget. If you’re embarked on, or about to embark on a large scale enterprise search project, make sure you understand these performance considerations. Best of breed enterprise search platforms support many tuning strategies that factor in all the key dimensions of search performance and scale. Read your system’s deployment guide (if it comes with one) to understand these options.
Lastly, if you’re not sure if your project has what might be considered demanding scale or performance requirements, consider getting some expert advice. Below are some good online forums you can tap for expert advice and to get a sense for whether your system might be considered “demanding”.
http://tech.groups.yahoo.com/group/search_dev/ (Search Engine Developers group on Yahoo)
http://www.linkedin.com/groups?gid=161594 (Enterprise Search Engine Professionals on LinkedIn)
Nate
-
Day 1 at the Enterprise Search Summit in NYC is wrapping up and I’ve just listened to Lisa Denissen from Shearman & Sterling talk about Actionable Search. Actionable search is a key tenet of Microsoft’s enterprise search strategy, so it was good to see promotion of the concept.
For many organizations, just adding basic, no-frills search to an intranet can have a big impact on employee productivity, but to really create an optimal search experience it helps to understand the processes that drive users to search in the first place. Too often search is treated as an end unto itself, without consideration for the larger processes that it ultimately serves. Users care about finding relevant information, sure, but they care even more about using that information to complete tasks and achieve outcomes. These tasks and desired outcomes are what ultimately define success for an enterprise search application and, it may be argued, for any type of search app.
Understanding what motivates people to search means going beyond capturing requirements like “I need to be able to search all of Product Marketing’s PowerPoints” to addressing more precise needs like “I need to quickly assemble targeted presentations for sales prospects based on existing marketing material”. This second statement doesn’t sound like a search problem, but it speaks clearly to a desired outcome (“targeted presentations”) and the word “quickly” suggests that search may offer some help here. Importantly, the statement also focuses on the question of why the user is searching, not just what they hope to find.
The phrase “actionable search” refers to the idea that items in search results can be directly acted on in a way that moves the searcher toward completion of a specific task – an outcome. While general Web search engines have us accustomed to results sets that contain only links to relevant Web pages, the richness of applications and content types in the enterprise and on targeted Internet sites promise a bit more than just a blue link. eCommerce sites have supported actionable search for years by allowing users to directly add items from a search result to a shopping cart. Facebook provides contextual actions directly from its general search results that let you Join Groups, Add Friends, Join Events, or Send Messages. To take the earlier example, once a relevant PowerPoint presentation is found, an actionable search experience would be to offer the user help with the next steps of finding the right individual slide and then quickly incorporating that slide into their work-in-progress presentation.
One argument for enterprise search starts with the question “What good is an enterprise content management and collaboration if you can’t easily find the information you create, manage, and share? We might switch the question around and ask, “What good is enterprise search if you can’t easily act on the information you find?” Actionable search promises to close this gap between information access and outcomes.
Nate
-
In my first post on this subject last week, I referred to a scene in the movie “Minority Report” as a visionary example of a natural user interfaces (NUIs) and, more to the theme of this blog, a visionary example of ad hoc search within a NUI. I realize that I didn’t offer a definition of NUIs in that post, so, before I go back to the search connection, here’s a quick primer.
NUIs Defined
Natural user interfaces or NUIs rely on natural expressions like touches and gestures to directly and intuitively control the experience of a software application. The word “natural” means that the interaction is not controlled through an artificial device, like a mouse or keyboard. (I take this to imply that a Nintendo Wii is not an example of a NUI, since there are still artificial controllers involved. Other opinions and thoughts on this are welcomed).
NUIs have been described as the next evolutionary step in human-computer interaction – the successor to graphical user interfaces (GUIs), which succeeded command line interfaces (CLIs), which succeeded physical input devices like card readers. Touch screens on hand-held devices are the most common examples of NUIs, but there are number of other emerging NUI platforms and technologies. This article on touch computing from PC Magazine offers a catalog of some of the systems currently available.
Microsoft Surface
One of the technologies mentioned in the PC Magazine story is Microsoft Surface. Microsoft Surface is a Windows powered device in the form factor of a table - a coffee table, if you will - with a surface that supports touch and gesture interaction. There are other NUI platforms, but there are a couple things that make Microsoft Surface different and interesting.
First, the Microsoft Surface form factor and interface are designed to allow multiple users to interact with the device at the same time. The interface can detect and track dozens of touch points simultaneously. It can even recognize the orientation of fingers prints and infer, in turn, the physical orientation of a user relative to the display. Because of these capabilities, many applications created for Microsoft Surface emphasize multi-user collaboration and interaction – for example, there are multi-user games and other collaborative consumer applications for things like music and picture sharing.
Second, Microsoft Surface devices have built-in cameras that can not only track touches and gestures, but can recognize digitally tagged objects and can initiate specific actions when these objects are placed on the table. For example, Infusion Development has created an application designed to enhance the doctor patient consultation experience. By placing a tagged card on Microsoft Surface, doctors can use and access interactive cardiac images, dynamic charts and clinical documents to help explain medical conditions and procedures to their patients.
NUIs: Where’s the Search?
I was wowed by my first experience with Microsoft Surface - as many are when the first get a chance to play with one - but being a search guy, I looked for applications that included some sort of search function. So far, of the NUI applications I’ve seen to date, whether on Microsoft Surface or in other NUI technologies, I’ve seen very few that provide true ad hoc search. In one or two examples I’ve seen, a virtual keyboard is used to enter search terms and traditional GUI search metaphors are used to render search results. More often, though, finding information requires the user to navigate through some pre-defined structure. Even this TouchWall demo by Bill Gates from last year’s CEO Summit focused on navigation. Where’s the search?
I’ll grant that structural navigation metaphors in NUIs are really cool and work pretty well. For example, I’ve seen a medical app that allows you to visually navigate a representation of the human body to explore different anatomical concepts. You can tap on the virtual head to explore the brain and then drill down further to learn about neurons. It looks like a fun and an interesting way to explore human anatomy, but the problem with this navigation-only approach is that if you don’t happen to know that neurons are in the brain, it will take you a while to find them. It is browsing, not ad hoc search and, as we learned from the Yahoo Directory experience back in the 90s, people tend to prefer searching over browsing.
A Prototype and a Request
At our FASTforward’09 user conference in Las Vegas in February, we showed a prototype application, built in collaboration with a very sharp team of developers at EMC Consulting, which brought together ad hoc search and the natural user interface experience of Microsoft Surface. You can see a short video of this demo here, or the longer keynote presentation from the event here.
When Mark Stone, Global Enterprise Search Lead at EMC Consulting, and I first conceived this demo, we were inspired by three things:
1) The dramatic growth and potential of NUI technologies, particularly Microsoft Surface.
2) The dearth of search examples in all these NUI applications.
3) The potential for creating transformative user experiences that combine search and NUIs .
You can judge for yourself how successful the team was in combining ad hoc search with Microsoft Surface by looking at the demos, but one thing is for sure, we were in uncharted waters when building this app. The user interface patterns for search within a NUI are not well established. Even without considering search, building user interfaces in Microsoft Surface requires setting aside the old GUI models and learning brand new patterns and metaphors. As for search in a NUI, well, what are the equivalents to the search box, the search result list, navigation facets, document links, and all the interaction patterns around this “controls”? How can we use a 3rd dimension (“depth”) and what role does “zoom” play in search? Working within a NUI environment even challenges the basic containers of information. Should you first show documents, or just extracted facts and information summaries? All these questions and more came up during the development of this prototype. Some of the answers are now known, or at least we have a better feel for the right direction to go, but others require more research and experimentation.
There is the opportunity here, and a challenge to be met by the search community. NUIs are here to stay and are demanding new patterns for true ad hoc search that satisfy the intuitive and natural interaction requirements of these environments. Reverting to browsing metaphors is not the answer; nor is simply recreating the GUI patterns of keyword search boxes and lists of blue links.
I’m very interested in this topic and am on a hunt for any good examples of true search within NUIs. If you know of an example, please send whatever pointer you can - links to demo videos, screen shots, academic papers, … anything. You can respond to this post or email me directly.
I look forward to seeing your examples and will summarize what I find in a future post.
In the mean time, I feel like we need a new name for search interfaces within NUIs. I like the phrase “Natural Search Interface” used by the Microsoft Germany Partner site in reference to the Microsoft/EMC Consulting prototype. I’ll use that.
Nate
-
About five years ago or so, I participated in a conference panel where the question was asked: “What will search interfaces look like 20 years from now?”. I had just seen Steven Spielberg’s sci-fi film “Minority Report” starring Tom Cruise, so I referred to the scene where Cruise’s character is interacting with a futuristic looking visual display and using appropriately dramatic gestures to grab, spin, shrink, expand, and otherwise manipulate the various news stories and images floating on the display.
I heard later that Spielberg, while developing the script for the film, had consulted a number of futurists to create as realistic picture of the year 2050 as possible (from the point of view of those futurists at least). Interestingly, over the past several years, that scene has become a conceptual benchmark for so-called natural user interfaces (NUIs), to the point where if you search for “minority report” in your favorite Web video search engine you’re as likely to find examples of prototype NUI products as you are trailers for the actual film. It’s not a stretch, imo, to say that the film has inspired and perhaps even accelerated advancements in NUI products and technology.
There are now many good and real examples of NUIs and even some actual products that come close to the vision in "Minority Report", but despite the impact the film appears to have had on the development of NUIs, there is a very strong connection to search that gets overlooked. Cruise’s character in that scene is searching. His various gestures and other contortions are queries, navigation, and refinements intended to help him find answers and collect information. Granted the depiction is not quite up to the vision of the smooth-voiced computer on Star Trek, but it’s a step beyond the keyboard and mouse and, if you look past the theatrics, I think it paints a realistic view of not just the future of natural user interfaces, but of the type of natural search-driven user interfaces we will be seeing soon… in much less than 20 years time.
Nate
-
After years of writing customer proposals, internal memoranda, and various stuffily formal documents, it feels like a luxury to be able to just write what I think about enterprise search. It’s actually part of my job these days and I’m looking forward to sharing a perspective from 13 years in the industry – the past 6 years with FAST and, most recently, with Microsoft.
As a reminder, it’s been a more than a year since the original offer came down from Microsoft to acquire FAST. To be precise, the bid was announced on January 8th, 2008 and the deal closed on April 25th, 2008. The FAST team now makes up a large part of the new Enterprise Search Group (ESG) within the Microsoft Business Division (MBD) – the division that makes SharePoint, the Office line of products, Exchange, etc… .
When I get asked about my reaction to the FAST acquisition by Microsoft, I tend to point out that, while those of us in the business have always understood the value of search, nothing says “Ata boy!” like having the largest software company in the world take notice. Maybe we could ask why it took so long, but even if you didn’t happen to work at FAST, you can’t help but feel that Microsoft’s move is validation of our growing corner of the IT industry.
I admit that the answer above, while maybe heartwarming, doesn’t get to the core of what people really want to know. Not surprisingly, folks are more interested in Microsoft’s vision for enterprise search and plans for the FAST people, products, partners, and customers than they are in my emotions. Now, with a year under the belt at Microsoft, I have a few more insights to offer than just the initial “nice validation” response.
In his keynote presentation at FASTforward’09 in February, Kirk Koenigsbauer addressed three key topics related to Microsoft’s interest in enterprise search (a transcript of Kirk’s keynote can be found here). These were:
· Commitment (to enterprise search)
· Vision
· Product Plans
These topics provide a useful framework for sharing my own observations.
Commitment
There are a number of anecdotal facts that point to Microsoft’s commitment to being a leader in enterprise search. Kirk shared a few of these in his keynote – things like the percentage of Microsoft Research investment going to search (appx 15%), the size of the Enterprise Search Group R&D organization (several hundred engineers and growing), and of course the investment itself to acquire FAST (US$1.2B). There are other supporting data points, like the announcement of Oslo (FAST’s headquarters) as a key R&D center for business search.
Any one of these facts is a strong indication of Microsoft’s ambitions in this space, but my take is that the evidence of Microsoft’s commitment to search comes from more than these metrics or executive statements. It comes from a growing grass roots interest in search across all of Microsoft. For example, I often get a question like this from customers and partners:
“Have you guys talked with the folks over in Microsoft’s <product name> team?”
…and then…
“ Man, you should because FAST technology added to what they’re doing would be powerful combination.”
The usual answer is, yes, we’ve talked to the <product name> team and, yes, there are some very interesting ideas and even some specific activity that we mostly can’t talk about yet. In fact, what’s been most interesting and fun for us former FAST folks is the breadth of technologies that we can now include in our conversations with customers and partners. SharePoint is the “hero SKU”, as we say here, and the combination of FAST search with the capabilities of SharePoint makes for an impressive offering for both intranet and Internet applications that are focused on helping people consume and use information. It’s not a leap to recognize that Microsoft has something to offer at almost every level of an IT solution “stack” complementing the capabilities of both SharePoint and search – from the operating system to application development tools and even cloud-based services. To put it in perspective, ask yourself how many companies offer both a world class enterprise search platform and a world class relational database.
To be honest, search is such a generally valued concept and the possibilities are so compelling when it’s combined with other Microsoft products and technology that it’s all we can do to stay focused on our main priorities. It’s a good problem.
Vision
At some point prior to the acquisition, the Microsoft enterprise search team came to a vision of search that matched what we had developed at FAST. Specifically, that search is more than just a search box and a list of blue document links, but represents a set of capabilities that are enabling new ways to engage users by creating personalized, conversational experiences that cater to the way people prefer to consume and interact with information. This vision was behind the principle theme for the FASTforward’09 conference this past February – “Engage Your Users”.
Whether the original Microsoft team came to this vision independently or after talking to FAST folks (ego would like to think the latter) is less important than the fact that it is now a shared vision throughout the Microsoft Enterprise Search Group and is shaping how we are investing in product development. It’s also a vision that is permeating into other areas within Microsoft. For example, I recently had a chance to apply this way of thinking about search to some other very interesting Microsoft technology, Microsoft Surface, but that’s a topic for another post.
Product Plans
At FASTforward’09 we announced our plans to target enterprise search in two areas:
· Business productivity – applications inside the firewall where, in particular, SharePoint provides the framework for content management and collaboration.
· Internet business – “outside the firewall” applications for attracting, retaining, and otherwise monetizing customers.
The intentions are to have a common search platform supporting both of these general markets and to include application specific capabilities and templates that are unique to each. FAST had already started down this path. For example, FAST AdMomentum is an ad platform that interoperates with search and is relevant to monetization strategies in Internet Businesses, but not so obvious of a fit for inside the firewall apps.
This relatively straightforward strategy and message was very important to get out to the FAST customers base, especially given that Internet Businesses have made up well more than half of FAST’s business to date. Also, most industry pundits will tell you that the requirements for search inside the corporate firewall are simply different than search in consumer facing applications. Even so, what’s so promising to me about this strategy is that there are opportunities to “bleed” capabilities between these two application spaces. We saw this “consumerization” of search features happen more than once at FAST. Features that we initially designed for consumer search found their way into intranet search deployments (one simple example is the “best bets” concept like the one found in SharePoint). The opposite has also happened. Now, consider the capabilities in SharePoint, which is already powering many consumer facing Web sites, and you can see where this can lead.
There you have it, my first post for the Microsoft Enterprise Search Blog. Look for more posts from me in this general category of enterprise search vision and strategy. I welcome all comments on this and future entries.
Next up – Search plus Natural User Interfaces.
Nate
-
The Mirage, Las Vegas, Feb 9-11
Since its inaugural conference in 2006, FASTforward has been a venue for though leadership and innovation in the field of search. This year, FASTforward’09 is the industry’s largest business and technology conference dedicated to search-driven innovation. Join the discussion! At FASTforward’09, we explore how businesses are responding – and evolving – in the face of rapid technological change and the growing demands for user control. As The User Revolution continues, we examine search’s critical role in helping companies engage their users. This year’s conference will also highlight Microsoft’s vision for enterprise search technology.
New this year, a SharePoint technology track covering Enterprise Search, Social Computing, Enterprise Content Management and more! Other tracks include:
- Monetization via Search (customer-facing)
- Productivity via Search (internal enterprise)
- FAST technology
- Partner Solutions
Top Ten Reasons Why You Should Attend FASTforward’09:
1. Uncover new opportunities for using search
2. Hear what others have done with search technology
3. Learn industry best practices for search
4. Hear the Microsoft vision for search and FAST
5. Learn how SharePoint and FAST products are positioned
6. Gain insight on integration plans for SharePoint and FAST products
7. Understand how partners can help
8. Obtain access to Microsoft and FAST executives and industry luminaries
9. Network with colleagues
10. Attend convenient pre-conference technical training
Come spend three days with us at the Mirage in Las Vegas learning from industry thought leaders, customers, partners, and our own Microsoft experts!
Learn more at FASTforward ‘09. Register before January 9 and receive $400 off of the full registration fee. See you there!
-
We’ve got great news to share! Last month, Gartner published the 2008 Magic Quadrant for Information Access Technology, and Microsoft was positioned in the Leaders Quadrant. Since the completion of the acquisition, we’ve worked incredibly hard to communicate and demonstrate a combined vision and strategy to our customers and partners. It’s good to know we’re heading in the right direction!
When I talk with customers about search, it’s clear that organizations have very different needs. In fact, many people tell me that even within an organization the one-size-fits-all approach just doesn’t work. So over the last year, we’ve announced some bold moves designed to create a compelling portfolio of search applications. With the addition of Search Server Express and the acquisition of FAST, we now have a product line-up designed to meet a broad range of business needs:
- Some departments or small organizations need search that is quick and easy to set up; we offer Microsoft Search Server Express as a free download so that you can get it up and running in about 30 minutes. We’re excited to see customers like St. Jude Medical and Urbis having quick successes with Express. We’re also seeing partners, such as StartReady, build solutions around Search Server Express to create a search appliance.
- Many organizations need search as an integral part of a business productivity infrastructure; Search in Microsoft Office SharePoint Server is integrated with other key SharePoint productivity workloads such as portals, collaboration, ECM, business processes and BI. Customers like McCann Worldgroup and Jones Lang LaSalle are all deriving productivity increases with better search in SharePoint. In particular, both companies are promoting collaboration and leveraging in-house experts with people search enhanced by user profiles in MySites.
- Some organizations face business problems that demand high-end search; FAST ESP offers best-in-class search with extreme scalability, query performance, and other advanced capabilities for sophisticated customer-facing or inside-the-firewall applications. For example, Aerotek and TEKsystems, two of the world’s largest staffing companies, deliver job searching to more than 1.3 million users. In more than 164 million queries, greater than 99.5% of query results came back in less than 2 seconds. For inside-the-firewall productivity, they index more than 10 million complex candidate records with low latency during high volume index updates. We’re also excited to see Pfizer pushing the envelope with an Enterprise Collaboration Framework driven by FAST ESP on top of SharePoint
While our “Leaders Quadrant” position in the Magic Quadrant is an important milestone, we still think of this as the very beginning of our journey. We’re continuing to combine our deep technical expertise with our broad reach to deliver exciting innovations to the market – so you can and should expect great things to come. Stay tuned!
Kirk Koenigsbauer
General Manager,
SharePoint Business Group
Magic Quadrant for Information Access Technology (Gartner Research, Sept. 30, 2008) Microsoft is positioned in the Leaders Quadrant of Gartner, Inc.'s 2008 Magic Quadrant for Information Access Technology. This report assesses vendors with capabilities that go beyond enterprise search to encompass a range of technologies. Their capabilities include search; federated search, content classification, categorization and clustering; fact and entity extraction; taxonomy creation and management; information presentation (for example, visualization) to support analysis and understanding; and desktop search to address user-controlled repositories in order to locate and "invoke" documents, data, e-mail and intelligence.
The Magic Quadrant is copyrighted 2008 by Gartner, Inc. and is reused with permission. The Magic Quadrant is a graphical representation of a marketplace at and for a specific time period. It depicts Gartner's analysis of how certain vendors measure against criteria for that marketplace, as defined by Gartner. Gartner does not endorse any vendor, product or service depicted in the Magic Quadrant, and does not advise technology users to select only those vendors placed in the "Leaders" quadrant. The Magic Quadrant is intended solely as a research tool, and is not meant to be a specific guide to action. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
-
In another great blog post Matt McDermott walks you through the steps of enabling SharePoint’s people search capability on a mobile device with the end results looking something like this;

The post is here;
http://blogs.catapultsystems.com/matthew/archive/2008/09/27/mobile-people-search.aspx
Richard Riley
Senior Technical Product Manager
Microsoft Corp.
-
Since the release of Microsoft Office 2003, Microsoft desktop applications such as MS Word, PowerPoint, Excel, Outlook and Internet Explorer have contained an internal federated or meta-search capability known as the ‘Research Pane’. To see this in action in office 2003 (see link for instructions for Office 2007), select (i.e. highlight) a word or phrase within MS Word or MS Outlook, and on PC’s right click on the highlighted word, pull down to the “Lookup Up” option and click. Another way to do this is to hold down the ‘Alt’ button while left-clicking on a highlighted word (in Macs use a command-click). The Research Pane should then open up in the application window and execute a search on the highlighted section. Out of the box, MS Office ships with several research sources such as the Microsoft Encarta Dictionary, Microsoft Live Search, MSN Money and some third party offerings from Factiva and Thomson Gale among others. Here is a screenshot of content returned from three enterprise search engines as well as from some public biomedical websites.

The list of sources that can be searched from the Research Pane is expandable by adding connections to Research Pane service providers. Armed with a URL to a Research Pane “registration service”, a user can install the source into their MS applications using the “Research options…” link. This potentially gives users access to a large set of data sources to choose from. Once a source is installed, the user can select the source from a dropdown list (which causes the search to be executed) or can select a set of sources based on certain pre-defined categories.
Raritan Technologies specializes in Federated Search solutions and has created an array of search connectors to a number of web sites, web services, search engines and databases and directory services (to name a few) using our Search Integration Framework Toolkit (SIFT) and Federation Manager. We and our partner in this effort, New Idea Engineering, have also provided a number of ways to deploy these federated search connectors to web applications and within web services such as SOAP and Open Search. We have recently added to this list by providing a MS Research Pane service ‘front-end’ to our federated connectors. This enables connections to search engines such as Autonomy IDOL, K2 or Ultraseek, Dieselpoint, Endeca, Exalead, Fast, Lucene, Mark Logic (and others) as well as Sharepoint (out of the box) SQL databases, LDAP directories, SOAP and OpenSearch web services, Z39.50 sources, Internet web sites that have search boxes (a very large list that includes general web search engines and specialized sites such as news or research sites) and Content Management Systems such as Alfresco, Documentum and eRoom, and Archival Systems like Symantec Enterprise Vault to be ‘plugged-in’ to any MS Office application. The modular design of the Raritan Search Integration Framework enables other connectors to be added to this list and as this happens, these new sources will automatically be available to users of the Research Pane once configured as a service.
The ability to combine internal content sources from content management systems, enterprise search engines, databases and directory services with external content from subscription or public web sites and web services into MS Office applications provides a huge potential for search integration at the “tip of the sword” where thought and knowledge are combined to create new content.
For more information on the Raritan Technologies “Research Pane Integration” or to arrange for a trial connector please visit http://www.raritantechnologies.com/ResearchPane.shtml.
Barry Freindlich
President Raritan
Technologies, Inc.
-
The thesaurus is an xml file that provides users with a means of automatically expanding or rewriting their queries to include synonyms, acronyms, etc. For example, in a chemical company, product ID 1234, oxygen, O2 and LOX could all refer to the same item.
A SharePoint Search administrator can modify the thesaurus file to substitute all these words at search query time. This document explains how to set up a thesaurus and where to find the relevant files.
Supported Thesaurus Syntax:
To use the sample files provided by the product, you need to remove the comment beginning (<!--) and ending lines (-->) from the xml file.
Explanation of terms:
| Term | Meaning |
| thesaurus | marks beginning (and end) of thesaurus |
| diacritics_sensitive | Diacritics are marks, such as accents that are added to letters that change their pronunciation. For example, the acute accent over and e gives you: é. 0 – ignore diacritics 1 – respect diacritics |
| expansion | A list of alternative forms each marked by <sub> by the sub keyword |
| sub | One of several alternatives in an expansion |
| replacement | Several patterns will be replaced with a substitution. |
| pat | A pattern to be replaced |
| sub | Item to be substituted |
Example:
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
</thesaurus>
The example means:
- We have elected to ignore accents, etc in the thesaurus
- Queries containing IE, or any other one of the <sub> clauses will also contain “internet explorer” and “ie5”.
- If a query contains terms “NT5” or “W2K”, they will be replaced by “Windows 2000”.
How to Customize the Thesaurus:
- Find the appropriate thesaurus file in the config folder contained in the registry key: [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
- Update the thesaurus file(s) for each appropriate language for each desired <expansion> or <replacement>.
- Replace the file(s) on each index, query and web frontend server for each search application path:
%programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config
Note index propagation does not sync these files on all the servers in the farm.
- Stop and restart search service (this is needed to load the new thesaurus files). E.G., in a console window, run “net stop osearch & net start osearch” without quotes, or launch Programs\Administrations Tools\Services then right click Office SharePoint Search Service then choose restart.
Notes:
See “Finding Important Files” below for a summary of where to find the key files to manage your thesaurus.
- (optional) If you want to have the same thesaurus files apply to all newly created SSPs, put your thesaurus files under the main config folder
(e.g., %programfiles%\Microsoft Office Servers\12.0\Data\config).
- If there is a syntax error in the thesaurus file, all expansions and replacements will be ignored.
- If a word in the thesaurus file matches a stop word in the stop word file, it will be ignored. To avoid this, remove it from the appropriate stop word file.
- Thesaurus terms are broken into words at query time. Add words you do not want to be broken into the custom dictionary file customLANG.lex (see Finding Important Files for more details).
- Search first applies the thesaurus, and then expands words into their alternate forms, when “stemming” functionality is turned on. Care should be taken to avoid expanding into too many unnecessary forms as this may harm search performance and accuracy.
- The “All words” option on the Advanced Search page might no longer work when using multiple term substitution with the thesaurus. This is because an implicit “+” is used between every term. For example, if we used our example thesaurus above and typed E.G., “browser ie” in the “All words” field, it would look for “+browser +ie” – it would no longer allow “Internet Explorer”.
- Ambiguous replacements will stop the thesaurus working (this will be noted in the appropriate logs, but will not be obvious to the user). For example, if you replace a with b and a with c, this is an error. Some admins add large thesauri which are automatically populated with items such as “replace a b with c” and “replace a,b with c”. As it turns out, after wordbreaking these two expressions look exactly the same. Please check for this kind of problem in the logs if you are building a large thesaurus.
- There is a 10,000 term limit per language in thesaurus.
Finding Important Files:
The following are the most important files used to manage your thesaurus.
There are 50 default stop word files and 48 thesaurus sample files for the languages we support.
The search service install path can be located by examining registry key [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
The default location of the thesaurus files (for each index, query and web frontend server) is:
%programfiles%\ Microsoft Office Servers\12.0\Data\Office Server
When a search application is created, a copy of the thesaurus file will also be placed under: %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config
Stop word files for each language can be found as noiseLANG.txt, where LANG is the 3 letter acronym for that language. For example, US English is noiseENU.txt, and the language neutral list is noiseNEU.txt.
To find the appropriate acronym for your language(s), please look them up under: http://www.microsoft.com/globaldev/nlsweb/default.mspx.
Ping Lin
Senior Test Lead
Microsoft Corp. |
Victor Poznanski
Senior Program Manager
Microsoft Corp. |
-
Matthew McDermott, a SharePoint MVP, has written a great 4 part blog post on how to make SharePoint 2007 search (and Search Server) render image results in a way that looks very similar to http://images.live.com.
Not only does this make searching images much easier, it’s also a very thorough step-by-step tutorial on how to customize results using the built in Web Parts and XSL – it’s well worth a read.
SharePoint Image Search (Part 1)
SharePoint Image Search (Part 2)
SharePoint Image Search (Part 3)
SharePoint Image Search (Part 4)
The end result makes SharePoint Image results look like the screencap below.

Richard Riley
Senior Technical Product Manager
Microsoft Corp.
-
This article has been a long time coming, but it is finally here. In the post below I will cover how to configure the Search database to span multiple filegroups. First I'll cover a little about the benefits of doing so:
General references on what SQL file groups are:
The method that we have chosen to implement filegroups on the Search database is one of segregation. We have identified all of the tables and indexes within the database that are solely used for crawling and not used at all to satisfy end-user queries. The remaining tables and indexes are used for end-user queries. However, the nature of the Search and indexing problem still dictates that the "query" tables are written to during a crawl. The crawl only tables and indexes are isolated into their own filegroup. With the crawl and query centric filegroups identified you can now ensure that the IO intensive process of crawling has a reduced impact on the IO subsystem that is hosting the query filegroup by ensuring that these filegroups are on separate spindles.
The whole goal of using filegroups is to improve the performance of the system. This is done by providing an additional file. This file must be placed on a different set of spindles to see any kind of performance enhancement. If your SQL machine is not IO bound for the Search database then implementing filegroups will not provide you with any benefits.
To make the migration process easier we did not actually create a query filegroup. We simply created a new filegroup called "CrawlFileGroup" and moved the crawl tables out of the PRIMARY filegroup. Such that PRIMARY effectively becomes the query filegroup. This migration process is one that can be quite expensive to complete and could take hours to finish. Keep this in mind when scheduling this on your production servers. Because the move involves dropping and recreating numerous clustered indexes you should assume that the DB is offline during this move as many long running locks will be taken to recreate the index.
Issues and concerns with using filegroups:
Back-up and Restore
One concern that you will need to be aware of in you planning for deploying filegroups on the Search database is that your restore process will be slightly impacted. Out of the box Search restore is unaware of the filegroup that will exist within the backup image. Because of this there is no way to indicate where this file should be restored to. As a result the restore process is going to try and place the crawl filegroup file onto the same drive that it existed on when you ran the back-up. Once you enable filegroups you will be committed to making sure that all future machines that you restore your back-up to have a drive with the same drive letter that you initially created the filegroup on.
Future upgrades, Service packs and Hot fixes
Each Hotfix, Service Pack and update that you apply to the server has the potential to modify the index that was moved into the CrawlFileGroup or add an new index to one of the tables moved to the filegroup. When/if this happens the index will be moved back or created in the primary filegroup. Updates will also clean out any non-product sproc. Because of the risk of index modification with updates applied you will need to reinstall the stored proc and run the scripts again after each update applied.
The risk of a new index being added or modified quite low at this time. We have confirmed that this does not occur if upgrading from RTM to SP1. But, it does happen when upgrading from SP1 to the Infrastructure Update. Future Updates are less like to modify the set of indexes.
However, the risk still exists and you will want to re-run the scripts below after each update that you apply to your system. In the case when you apply an update and the index did not change running the script is a no-op and nothing gets moved. So it is very cheap to run the script on a system that already has the indexes moved.
SQL 2005 and greater
The script that is moving the indexes is utilizing new features that were released in SQL 2005. As such you cannot perform this optimization with SQL 2000.
Step- by-Step instructions for applying filegroups to your environment.
To deploy this you will need to manually create a file group on the Search database. To do this execute the following steps:
a. Go to the Filegroups section of the Search database properties within SQL Server Management Studio.
b. From the Filegroups section click add and fill in the name "CrawlFileGroup." The scripts are written assume the filegroup has this name, failure to use this name will result in early failures in the script
![clip_image001[1] clip_image001[1]](http://blogs.msdn.com/blogfiles/enterprisesearch/WindowsLiveWriter/SQLFilegroupsandSearch_A612/clip_image001%5B1%5D_thumb.jpg)
c. Once you have a new filegroup with the name CrawlFileGroup you need add a file into this group. To do this select the Files section of the database properties dialog and add a new file into the CrawlFileGroup. Be sure that you place this file onto a separate drive with isolated spindles.
![clip_image002[1] clip_image002[1]](http://blogs.msdn.com/blogfiles/enterprisesearch/WindowsLiveWriter/SQLFilegroupsandSearch_A612/clip_image002%5B1%5D_thumb.jpg)
d. Next you need to install the stored proc that will move the indexes and tables to the new filegroup. Open the script named MoveTableToFileGroup.sql within Management Studio and execute it; ensuring that you are working with the Search database This will create a stored proc named proc_MoveTableToFileGroup. Confirm that this sproc does indeed exist within the Search database.
e. Open and execute the second script named MoveCrawlTablesToFileGroup.sql, this is the script that does all of the work by calling proc_MoceTableToFileGroup for each table that is dedicated for crawling.
That is all there is to it. You have now moved you crawl tables on to a separate set of spindles.
Thank you for your time and as always I welcome any feedback or questions
Dan Blood
Senior Test Engineer
Microsoft Corp
-
Enterprise customers are increasingly struggling with how to apply policy and governance at the desktop. End user adoption is cited as the single most critical barrier to success in ECM and Records Management initiatives. Using Concept Searching’s unique compound term processing conceptClassifier for SharePoint can now be used to automatically classify content from Microsoft Office Applications, upload the documents directly to SharePoint, store the metadata in SharePoint properties and write back the classifications to the custom properties of the document for use within knowledge and workflow applications or enterprise applications such as ECM, Document Management, Records Management, or eDiscovery.
The classification can take place automatically without end user intervention. Optionally, Subject Matter Experts can be granted the authority to manually adjust the classification based on the taxonomy. A ribbon bar has been added to the familiar Office interface enabling automatic classification of content. When the end user classifies a document the system will retrieve existing concepts as an aid to the classification process as shown below. Subject Matter Experts also have the ability to add or delete classes in the taxonomy.

Documents are uploaded to SharePoint and the classification metadata is stored in the properties fields. The classification status automatically reflects the manual classification so as to not overwrite the classification classes the Subject Matter Expert entered. The systems administrator features currently enabled include the ability to edit the classifications, classify the document, a batch of documents or the full library. This metadata can now be used by Microsoft Enterprise Search to improve identification of relevant documents when searching.

For more information visit www.conceptsearching.com or click here to view a webcast demo of the integrated technology.
Martin Garland
President Concept Searching, Inc
-
Hi all, this topic is an area that has caused me much pain and work. My goal for this was to follow the recommended SQL guidelines while minimizing the impact that these maintenance jobs have on Crawling and Queries. We know from the SQL Monitoring an I/O post that Search is extremely I/O intensive . As it turns out so is all of the regular maintenance that SQL recommends, so finding the right balance between the two is an interesting scheduling task.
As a starting point much information about SQL maintenance and MOSS is covered in the following paper:
There are some key areas from the above paper that I would like to augment here.
- The stored procedure (proc_DefragIndexes) identified in this paper will work, but it is extremely expensive to run on the Search DB as it defrags all of the indexes in the table.
- Maintenance plans generated with the Maintenance Plan Wizard in SQL Server 2005 can cause unexpected results (KB 932744.) While this was fixed in SQL 2005 SP2 these maintenance plans also do more work than is necessary to have a healthy functional system.
- Shrinking the Search DB should not be a necessary task that you need to perform. The process of Shrinking the database does not provide a performance benefit. SQL best practices for DBCC SHRINKFILE suggest that this operation is most effective after an operation that creates lots of unused space. Search does not regularly perform these types of operations. The only time that a SHRINKFILE may make sense is after you have cleaned out your index by removing a Content Source.
- Rebuilding an index can cause latency issues with SQL Mirroring if the SQL I/O subsystem is constrained. If you are using SQL Mirroring, be sure you are following the SQL best practices and the SharePoint mirroring white paper. Because Search, SQL Mirroring, and defrag are all very I/O intensive you will want to be extra cautious with your deployment plan for this defrag script and make sure you test the script prior to going into production.
DBCC CHECKDB
DBCC CHECKDB is a command used to check the logical and physical integrity of all the objects in a database. SQL Best practices recommend that you run DBCC CHECKDB periodically. For a Search deployment we would recommend that you run DBCC CHECKDB WITH PHYSICAL_ONLY on a regular basis. The PHYSICAL_ONLY option will reduce the overhead of the command. However, due to the cost of running this you should schedule it during off-peak times. The frequency of execution depends on your business needs, but a good place to start is once a week just prior to your back-up. You still need to run DBCC CHECKDB, but less frequently also based on business needs. Perhaps every other or every third back-up.
When running these commands make sure that you have a monitoring process in-place. DBCC only reports errors, it does not fix them unless explicitly specified by other options. You either want to archive the output of the DBCC command for post processing or make sure you have event log monitoring set-up (for example MOM) to check for DBCC errors.
In very large environments you can run DBCC on an off-line (sandbox) copy of the database. This will be less intrusive to end-users and the crawl. In this scenario you would restore your back-up to a separate sandbox and run DBCC CHECKDB in the restored environment.
Fragmentation and index statistics freshness
We started with the proc_DefragIndexes script mentioned above. After running it became obvious that the script was just too expensive to run on a regular basis. To reduce the load placed on the I/O system we took a look at all of our indexes in the Search DB and defragged them one-by-one to measuring query performance along the way. Doing this we were able to identify the indices that provided a performance benefit to the system when they were defragmented. These indexes are listed below:
- IX_MSSDocProps
- IX_MSSDocSdids
- IX_AlertDocHistory
- IX_MSSDEFINITIONS_DOCID
- IX_MSSDEFINITIONS_TERM
- PK_Sdid
- IX_SDHash
- IX_DOCID
Optionally there are two additional indexes that you may want to include in your defrag maintenance plan. These indexes do not see much use in typical out of box situations and are commented out in the script. But if your environment is built on a custom UI or makes extensive use of the Advanced Search UI you will see improvements in query latencies if you defrag them.
- IX_int -- defrag this index if you have a lot of queries that using numeric properties in the property store. The classic case is date rage queries.
- IX_Str -- defrag this index if you have a lot of queries that using string properties in the property store. There is not a common case for this but if you have made changes to your managed properties and are driving your search UI off of exact matches for a string based property you will want to regularly defrag this index.
Once we knew which indexes to defrag we looked at the duration it took for the index to reach a 10% defragmentation rate. From this we adjusted the FILLFACTOR so we could maintain a longer period of time between actually needing a defrag. At this point we are seeing a duration somewhere around 2+ weeks between defrags. Do note that by increasing the FILLFACTOR we did grow the size of the database slightly, the growth rate on SearchBeta was not that large.
We then looked at the cost/benefit of doing a Reorganize versus a Rebuild. This was a interesting discovery for us. Initially we had a script in place similar to proc_DefragIndexes that would choose to Reorganize or Rebuild based on percent fragmentation with 30% being the decision point (IE greater than 30% would do a Rebuild). What we found was a Reorganize was taking over 8 hours with a 10% fragmentation rate and during this time end-user queries suffered dramatically. Out of curiosity and desperation we tried a Rebuild which is supposed to be the more expensive of the two operations. The Rebuild operation is completing in approximately 1 hour while the Reorganize takes as long as 8 hours. The Rebuild operation is more expensive in the sense that you will see some failed queries during the hour that it runs, where as the Reorganize doesn't have as drastic of an effect on the queries, but the overall cost is much higher since you have an 8 hour window where the query performance is degraded. UPDATE STATISTICS: In the experiments we ran we found that simply doing the rebuild (which also updates statistics) that it was not necessary to regularly use this command.
Finally we deployed the script into an environment that utilized SQL Mirroring. Unfortunately this didn't work out very well. The mirror got so far behind that we eventually had to disconnect the mirror and stop the defrag. Going through an analysis of this it became clear that the root cause was that the environment was heavily I/O bound and the defrag script generated more I/O than the system could keep up with. While the mirror was behind end-user query latencies suffered dramatically. To recover from this we ultimately had to improve the hardware by increasing the number of spindles.
To mitigate this we have added a parameter to the script that allows you to reduce the MAXDOP used in the index rebuild. Setting this parameter to 1 on a SQL box that is minimally I/O bound helps, but it may not be enough depending on how constrained the system is. If you are in an environment that is I/O bound (with or without SQL Mirroring) we strongly recommend that you go through a test of the defrag before you go live with the deployment. The easiest thing to try is the following SQL statement:
ALTER INDEX IX_MSSDocProps ON [dbo].[MSSDocProps]
REBUILD WITH (MAXDOP = 1, FILLFACTOR = 80, ONLINE = OFF)
The statement above rebuilds the largest index using the lowest possible MAXDOP, this index must be rebuilt OFFLINE so you will need to run this on a test system or during a maintenance window. While this command is running keep an eye on the state of your mirroring with:
- The duration of the command. Will it complete within your service window? For comparison purposes this command completes in under an hour on the SearchBeta hardware
- SQL I/O latencies
- If you have mirroring in place
Bottom line we feel the rebuild is a much better operation to run and recommend that you:
- Run the script on a regular basis; once a night or on the weekends depending on your service windows.
- Weekends or weekly - reduce the fragmentation rate (sproc parameter) to 5.0 or lower to prevent missing the defrag due to a fraction of a percent (IE - 9.5%)
- Nightly - use the defaults for fragmentation rate. The largest index (MSSDocProps) gets rebuilt approximately every 2 weeks on SearchBeta. Running the script nightly will ensure that your indexes are up to date more often, but gives you less control over the exact time that the index rebuild occurs.
- Before running the script the first time test out how your system will behave when rebuilding MSSDocProps.
- Reduce MAXDOP - If your environment shows poor I/O response time or unacceptable durations (cannot complete a defrag inside your service window) reducing the MAXDOP value may reduce the duration of the script and put less pressure on the I/O system. Reducing the MAXDOP will not help enough if the system is very I/O bound.
- SQL Mirroring - SQL mirroring is sensitive to I/O latencies, adding the defrag may be too much I/O for the system handle.
- Poor I/O latency - You should focus on improving the I/O subsystem of your SQL environment before you begin running this script.
Stored Procedure syntax:
exec proc_DefragSearchIndexes [MAXDOP value],
[fragmentation percent]
- MAXDOP value - Integer value. Default is 0 which means that all available CPUs will be used.
- Fragmentation percent - decimal value. Default is 10.0. This value was explicitly chosen because we able measure query latency improvements on SearchBeta when defragging at the 10% boundary.
-Thanks
Dan Blood
Senior Test Engineer
Microsoft Corp