Show more relevant Titles in search results in SharePoint 2013 plus some other improvements
We’ve introduced improvements to search in SharePoint 2013 so that it will be easier to display relevant titles and authors in search results. We’ve also introduced some changes how the time of the last document modification is set. This allows now more consistent and intuitive sorting and search refinement based on this time.
In this blog we’ll tell you about these changes. They’re included in the SharePoint Server 2013 cumulative update published on October 26th 2013.
Tell me in a few words: what has changed?
The metadata extractor in the content processing pipeline extracts metadata from the content that you crawl. Before the changes we’ve introduced, the output of the metadata extractor was directly written to the corresponding managed properties. Now, we’ve created two brand new crawled properties: MetadataExtractorTitle and MetadataExtractorAuthor. The metadata extractor now writes extracted titles and authors from Word documents and PowerPoint presentations to these crawled properties. These new crawled properties map to the managed properties Title and Author.
We’ve also removed extraction of the LastModifiedTime from MetadataExtractor code. Now dates included in the document body will not influence setting the date of last modification.
How can I benefit from these improvements and get the new properties?
SharePoint Server 2013:
· Install the SharePoint Server 2013 cumulative update package published on October 26th 2013.
· Perform a full crawl of all your content sources.
Tell me the details
What has changed to allow search to display better titles?
How can I change which title is shown in the search results?
What’s new with the Author mapping?
What’s new in last saved date/time extraction?
What has changed to allow search to display better titles?
Sometimes, people save or upload Word documents or PowerPoint presentations with titles like “Document1.docx” or “Presentation1.pptx”. Before the MetadataExtractor was introduced this title would typically show up as the title in the search results. That was not so good.
To present a better title for such files in the search results, we use the MetadataExtractor in the content processing pipeline. It searches for a title in the body of Word and PowerPoint files. Currently, if the MetadataExtractor finds a good candidate for a title in the body, it writes the extracted title to the new crawled property MetadataExtractorTitle that is mapped to the managed property Title by default.
Because the title from the crawled property MetadataExtractorTitle has the first priority in the mapping to the managed property Title, there’s a good chance that the titles of Word and PowerPoint files shown in search results are more relevant.
Note: the custom mapping for the managed Title property should be backed up before the October CU installation. Otherwise it will be missed. The reason for this is creation of new crawled properties and thus rolling back to the default Title mappings.
How can I change which crawled property is shown as the title in the search results?
You can change which crawled property is selected to be shown as the title in the search results. This depends on the priorities of crawled properties in the search schema. If you decide to change the priority order of the mapping, make sure that the crawled property that you give priority is filled with useful Titles.
Here’s a table that shows the default priority list for the crawled properties mapped to the managed property Title:
Priority |
Crawled Property |
Origin |
What kind of value does this crawled property contain? |
0 |
MetadataExtractorTitle |
MetadataExtractor |
The title extracted from the body of Word documents and PowerPoint presentations. |
1 |
TermTitle |
SharePoint |
The title of the item in SharePoint. |
2 |
Office:2 |
Office |
The title of the item in Word or PowerPoint, etc. |
3 |
Ows_BaseName |
SharePoint |
Name of the SharePoint page. Ex: https://my/sites/wiki/Home.aspx |
4 |
Title |
Doc Parser |
The title as picked up by the content processing component. |
5 |
MailSubject |
Doc Parser |
The subject of an email file as picked up by the content processing component. |
6 |
Mail:5 |
The subject line of an email file. |
|
7 |
People:PreferredName urn:schemas-microsoft-com:sharepoint:portal:profile:PreferredName |
People |
Persons first and last name |
8 |
Basic:displaytitle urn:schemas.microsoft.com:fulltextqueryinfo:displaytitle |
Basic |
Contains file name of an Office doc |
9 |
ows_Title |
SharePoint |
SharePoint Page Title |
10 |
Basic:10 |
Basic |
Contains Filename metadata associated with file properties |
11 |
Basic:9 |
Basic |
Contains Path metadata associated with file properties. |
Even though you can change the priority order of the mapping, if one of the crawled properties is empty, the next crawled property from the priority list will be selected.
So, even though the MetadataExtractorTitle has the first priority for the title, it will only be used if a title was extracted. If that, for some reason, wasn’t possible, the TermTitle from SharePoint will be used as the title, and so on. The same mapping order is active for other document formats. But, the MetadataExtractor doesn’t work for, for example, PDFs. For file types other than PowerPoint and Word documents, the MetadataExtractorTitle will be empty and the next crawled property title will be selected to be shown as the title.
Alternatively, if you want to use the SharePoint TermTitle as the title for all your documents, change the priority of the crawled property TermTitle to position 0. If, for some reason, the TermTitle has no value, the MetadataExtractorTitle will be shown as the title, and so on.
You can change the priority in the search schema, see Manage the search schema (TechNet, on premises) or Manage the search schema
What’s new with the Author mapping?
We’ve added the MetadataExtractorAuthor crawled property. The metadata extractor extracts authors from the body of Word documents and PowerPoint presentations and keeps them in this new crawled property. This can be useful for, for example scientific articles where all authors are listed inside the document body but are not displayed in any document properties.
The mapping to the Author managed property for any file format works like this:
1) All possible authors found during crawling are added to a non-prioritized list.
2) From that list, a concatenated string is created that excludes duplicates and empty values.
3) This string is mapped to the Author managed property.
The authors extracted by the metadata extractor are simply added to the list and included in the string.
Even though the priority is not important for the Author managed property, since all authors extracted from content are included in the string, this is where the crawled properties come from:
Crawled Property |
Origin |
What kind of value does this crawled property contain? |
Author |
Document Parser |
Author as picked up by the content processing component. |
MailFrom |
The people names from the from line of an email file. |
|
Mail:6 |
Author, MetadataAuthor |
|
Author |
Notes |
The people names associated with One Note files. |
Internal:3 |
Internal SharePoint objects |
Contains metadata associated with internal SharePoint objects |
Internal:105 |
Internal SharePoint objects |
Contains metadata associated with internal SharePoint objects |
Office:8 |
Office |
ModifiedBy metadata |
MetadataExtractorAuthor |
MetadataExtractor |
The author extracted from the body of Word documents and PowerPoint presentations. |
What’s new in last saved date/time extraction?
We stopped extracting date of the last modification or creating from the document body. Even though it may be useful for PowerPoint documents where the date of presentation is mentioned on the first slide, it was introducing too much uncertainty. Let’s imagine a presentation talking about French revolution and having its dates on the first slide. Then it was highly probable that you presentation will have 14.07.1789 as creation date which, I believe is undesired.
So, with this change you still can map crawled properties to LastModifiedTime and use the managed property in the search results but there will be no output from MetadataExtractor in this list
This table shows the default crawled property mapping and priority to LastModifiedTime:
Priority |
Crawled Property |
Origin |
What kind of value does this crawled property contain? |
0 |
LastSavedDateTime |
Document Parser |
The timestamp showing when the item was last saved as picked up by the content processing component. |
1 |
Basic:14 |
Basic |
LastModifiedTime metadata |
2 |
Basic:16 |
Basic |
LastModifiedTime metadata |
3 |
ows_Modified |
SharePoint |
The timestamp showing when the item was last saved in SharePoint. |
4 |
Lastaccessed |
Notes |
The timestamp showing when the item was last accessed in One Note. |
You can now sort search results based on the preferred date of modification, by changing the priority order, or you can perform more sophisticated logic like deleting too old documents from your site collection
We hope that by adding these changes, we’ve improved the way in which you can control search results.