We’ve introduced improvements to search in SharePoint 2013 so that it will be easier to display relevant titles and authors in search results. We’ve also introduced some changes how the time of the last document modification is set. This allows now more consistent and intuitive sorting and search refinement based on this time.
In this blog we’ll tell you about these changes. They’re included in the SharePoint Server 2013 cumulative update published on October 26th 2013.
The metadata extractor in the content processing pipeline extracts metadata from the content that you crawl. Before the changes we’ve introduced, the output of the metadata extractor was directly written to the corresponding managed properties. Now, we’ve created two brand new crawled properties: MetadataExtractorTitle and MetadataExtractorAuthor. The metadata extractor now writes extracted titles and authors from Word documents and PowerPoint presentations to these crawled properties. These new crawled properties map to the managed properties Title and Author.
We’ve also removed extraction of the LastModifiedTime from MetadataExtractor code. Now dates included in the document body will not influence setting the date of last modification.
SharePoint Server 2013:
· Install the SharePoint Server 2013 cumulative update package published on October 26th 2013. · Perform a full crawl of all your content sources.
· Install the SharePoint Server 2013 cumulative update package published on October 26th 2013.
· Perform a full crawl of all your content sources.
What has changed to allow search to display better titles? How can I change which title is shown in the search results? What’s new with the Author mapping? What’s new in last saved date/time extraction?
Sometimes, people save or upload Word documents or PowerPoint presentations with titles like “Document1.docx” or “Presentation1.pptx”. Before the MetadataExtractor was introduced this title would typically show up as the title in the search results. That was not so good.
To present a better title for such files in the search results, we use the MetadataExtractor in the content processing pipeline. It searches for a title in the body of Word and PowerPoint files. Currently, if the MetadataExtractor finds a good candidate for a title in the body, it writes the extracted title to the new crawled property MetadataExtractorTitle that is mapped to the managed property Title by default.
Because the title from the crawled property MetadataExtractorTitle has the first priority in the mapping to the managed property Title, there’s a good chance that the titles of Word and PowerPoint files shown in search results are more relevant.
Note: the custom mapping for the managed Title property should be backed up before the October CU installation. Otherwise it will be missed. The reason for this is creation of new crawled properties and thus rolling back to the default Title mappings.
You can change which crawled property is selected to be shown as the title in the search results. This depends on the priorities of crawled properties in the search schema. If you decide to change the priority order of the mapping, make sure that the crawled property that you give priority is filled with useful Titles.
Here’s a table that shows the default priority list for the crawled properties mapped to the managed property Title:
What kind of value does this crawled property contain?
The title extracted from the body of Word documents and PowerPoint presentations.
The title of the item in SharePoint.
The title of the item in Word or PowerPoint, etc.
Name of the SharePoint page. Ex: http://my/sites/wiki/Home.aspx
The title as picked up by the content processing component.
The subject of an email file as picked up by the content processing component.
The subject line of an email file.
Persons first and last name
Contains file name of an Office doc
SharePoint Page Title
Contains Filename metadata associated with file properties
Contains Path metadata associated with file properties.
Even though you can change the priority order of the mapping, if one of the crawled properties is empty, the next crawled property from the priority list will be selected.
So, even though the MetadataExtractorTitle has the first priority for the title, it will only be used if a title was extracted. If that, for some reason, wasn’t possible, the TermTitle from SharePoint will be used as the title, and so on. The same mapping order is active for other document formats. But, the MetadataExtractor doesn’t work for, for example, PDFs. For file types other than PowerPoint and Word documents, the MetadataExtractorTitle will be empty and the next crawled property title will be selected to be shown as the title.
Alternatively, if you want to use the SharePoint TermTitle as the title for all your documents, change the priority of the crawled property TermTitle to position 0. If, for some reason, the TermTitle has no value, the MetadataExtractorTitle will be shown as the title, and so on.
You can change the priority in the search schema, see Manage the search schema (TechNet, on premises) or Manage the search schema
We’ve added the MetadataExtractorAuthor crawled property. The metadata extractor extracts authors from the body of Word documents and PowerPoint presentations and keeps them in this new crawled property. This can be useful for, for example scientific articles where all authors are listed inside the document body but are not displayed in any document properties.
The mapping to the Author managed property for any file format works like this:
1) All possible authors found during crawling are added to a non-prioritized list.
2) From that list, a concatenated string is created that excludes duplicates and empty values.
3) This string is mapped to the Author managed property.
The authors extracted by the metadata extractor are simply added to the list and included in the string.
Even though the priority is not important for the Author managed property, since all authors extracted from content are included in the string, this is where the crawled properties come from:
Author as picked up by the content processing component.
The people names from the from line of an email file.
The people names associated with One Note files.
Internal SharePoint objects
Contains metadata associated with internal SharePoint objects
The author extracted from the body of Word documents and PowerPoint presentations.
We stopped extracting date of the last modification or creating from the document body. Even though it may be useful for PowerPoint documents where the date of presentation is mentioned on the first slide, it was introducing too much uncertainty. Let’s imagine a presentation talking about French revolution and having its dates on the first slide. Then it was highly probable that you presentation will have 14.07.1789 as creation date which, I believe is undesired.
So, with this change you still can map crawled properties to LastModifiedTime and use the managed property in the search results but there will be no output from MetadataExtractor in this list
This table shows the default crawled property mapping and priority to LastModifiedTime:
The timestamp showing when the item was last saved as picked up by the content processing component.
The timestamp showing when the item was last saved in SharePoint.
The timestamp showing when the item was last accessed in One Note.
You can now sort search results based on the preferred date of modification, by changing the priority order, or you can perform more sophisticated logic like deleting too old documents from your site collection
We hope that by adding these changes, we’ve improved the way in which you can control search results.
So first off thank you for allowing us to now remove the MetadataExtractorTitle from the Title managed property. This will be a huge improvement as we have tons of templated documents for meeting minutes, agendas and the like which all had the same titles being displayed in search.
A question though, if we remove the MetadataExtractorTitle property for Title, will this also change the _layouts/15/osssearchresults.aspx site specific search page search results as well as the Enterprise Search Center?
MetadataExtractorTitle is just crawled property. It affects how title of document is extracted. Title is extended as MP only. If MDE Title is removed from mapping, next one in the order will be used as title.
In order for the changes to have effect recrawl (any type) needs to happen, so if the customer wants to affect all documents simultaneously, the full recrawl is needed.
So we tried both removing the MetadataExtractorTitle from the Title Managed Property, and moving it down in the crawled property order but the MetadataExtractorTitle was still being displayed in search results. We ran several Full Crawls and Incremental Crawls with no changes being displayed.
Finally we did an Index Reset, then a full crawl, and now the MetadataExtractorTitle is no longer being displayed in search results. This was all in our Dev environment, so is an Index Reset required for this change to work? Or is there a nightly job that perhaps will update this? We really do not want to do a Index Reset in Production if we can avoid it.
Any information would be appreciated.
This is a very information post - thank you!
How do you add properties like "Mail:6" back if you remove them. We'd like to try some different configurations but I don't see these crawled properties listed.