Welcome to MSDN Blogs Sign in | Join | Help

Indexing and Searching ASPX Files

Someone sent me a comment to one of last week’s postings lamented the trouble with indexing ASPX files, particularly because MS product support says that we don’t recommend searching them.  This person’s company has a lot of content in ASPX files and doesn’t want to go back and change them to HTML or Word files.  I don’t blame them.

Good news — indexing ASPX files is fine in many circumstances.  The default settings are there to ward off a couple ofworst-case scenarios.

If you’re using ASPX files as a means for storing pages, that in and of itself is fine. If the pages are being built dynamically, that’s fine, too, except for two situations:

  1. If your ASPX file contains code that causes the page to render different content every time it’s retrieved, that could be a problem — but it might not be.  If the part of the page you need to index changes all the time, what’s in the index will always be wrong.  That would be bad.

    If, on the other hand, part of the page didn’t change (or didn’t change very often), and it’s that part that needs to be indexed, that’s probably fine.  For example, a page that fetches the U.S. State Department’s traveler advisories for different countries; you could have a page with hard-coded content such as “Traveler Advisory for Canada**” and code that fetches the current advisory over SOAP, RSS, or HTML clipping.

    In this case, (a) it’s the fact that there’s a page with traveler advisory info for Canada that’s most important — clicking on the link within search results to get to the content of the advisory is probably fine, and (b) the advisories don’t change that frequently, so even indexing on the content of the advisory isn’t that risky.

    For extra credit, your ASPX page would keep the <META> tag that reports the page’s last modified time in sync with the date on which the fetched advisory report most recently changed.  This would free up our index gatherer process from having to re-crawl the page every time.
  2. If the ASPX page adapts its content to the person viewing the page, it’s not a good candidate for being indexed.  We index a given content source with one Windows account, ideally one with maximum privileges.  We try to retrieve the permissions for each piece of content and index those, too, so when we return query results, they’re trimmed to only display results you’re allowed to actually open.

    If, however, the page has only a little bit of content when you read it, but a lot of content when the index gatherer “reads” it, it could be returned in the search results as a false positive; when you click on it, the page you then try to read might not be relevant at all.

    If, on the other hand, you overcompensate by using a low-privilege account to index content, you’ll fail to record a lot of content, and pages that should match a query won’t be returned as results (a false negative condition).

    The only way any search architecture could get around this is to execute each page with every known user’s credentials, which is completely impractical at best, non-performant at worse, and a potential security hole at worst.  We’re not going to be doing that.

Keep these two factors in mind, and if you can mitigate them, go into SPS’ search settings and tell it to index ASPX files with a clear conscience.

(**Don’t misinterpret this as anything but an abstract example.  I picked the U.S. State Department because I live in the United States.  I picked Canada because I’m a citizen of Canada.)

Published Thursday, March 24, 2005 2:50 PM by MikeFitz

Comments

Friday, March 25, 2005 4:13 PM by Adil Hindistan

# re: Indexing and Searching ASPX Files

Thanks for the response Mike. We have static .aspx pages so I guess we should be in good shape. Howeve, because no results from any aspx files are returned I am assuming that current search settings are not correct. How can we rectify this?

Also, as I mentioned previously, when we copy/paste parts of text in an aspx doc., or when we change color of text, format of the page sometimes gets weird with html tags </font> </B> appearing on the page. It is very difficult to fix those as we usually end up building whole page from scratch.

I think, SharePoint Pages do not really do a good job of segragating text from html tags and copies the tags together with the text (eg/ when you copy a word in red to some other part of the same page, suddenly you see that that all the paragraph turns red)

Thanks again! I appreciate your help/comments!
Friday, March 25, 2005 4:29 PM by Mike Fitzmaurice

# re: Indexing and Searching ASPX Files

How to do search inclusions/exclusions based on file extension is documented in the SPS Administrator's Guide.

It sounds like your formatting issues are the fault of your editor, not ASPX pages themselves.
Wednesday, March 30, 2005 11:32 AM by Robert Bogue

# re: Indexing and Searching ASPX Files

Indexing Guide for any web pages...
New Comments to this post are disabled
 
Page view tracker