Today we are releasing the IIS Search Engine Optimization Toolkit. The IIS SEO Toolkit is a set of features that aim to help you keep your Web site and its content in good shape for both Users and Search Engines.
The features that are included in this Beta release include:
Checkout the great blog about IIS SEO Toolkit by ScottGu, or this IIS SEO simple video of some of its capabilities.
One of the problems with many similar tools out there is that they require you to publish the updates to your production sites before you can even use the tools, and of course would never be usable for Intranet or internal applications that are not exposed to the Web. The IIS Search Engine Optimization Toolkit can be used internally in your own development or staging environments giving you the ability to clean up the content before publishing to the Web. This way your users do not need to pay the price of broken links once you publish to the Web and you will not need to wait for those tools or Search Engines to crawl your site to finally discover you broke things.
For developers this means that they can now easily look at the potential impact of removing or renaming a file, easily check which files are referring to this page and which files he can remove because of only being referenced by this page.
One thing that is important to clarify is that you can target and analyze your production sites if you want to, and you can target Web applications running in any platform, whether its ASP.NET, PHP, or plain HTML text files running in your local IIS or on any other remote server.
Bottom line, try it against your Web site, look at the different features and give us feedback for additional reports, options, violations, content to parse, etc, post any comments or questions at the IIS Search Engine Optimization Forum.
The IIS SEO Toolkit documentation can be found at http://learn.iis.net/page.aspx/639/using-iis-search-engine-optimization-toolkit/, but remember this is only Beta 1 so we will be adding more features and content.
In the URL Rewrite forum somebody posted the question "are redirects bad for search engine optimization?". The answer is: not necessarily, Redirects are an important tool for Web sites and if used in the right context they actually are a required tool. But first a bit of background.
A redirect in simple terms is a way for the server to indicate to a client (typically a browser) that a resource has moved and they do this by the use of an HTTP status code and a HTTP location header. There are different types of redirects but the most common ones used are:
Below is an example on the response sent from the server when requesting http://www.microsoft.com/SQL/
One of the most important factors in SEO is the concept called organic linking, in simple words it means that your page gets extra points for every link that external Web sites have linking to your page. So now imagine the Search Engine Bot is crawling an external Web site and finds a link pointing to your page (example.com/some-page) and when it tries to visit your page it runs into a redirect to another location (say example.com/somepage). Now the Search Engine has to decide if it should add the original "some-page" into its index as well as if it should "add the extra points" to the new location or to the original location, or if it should just ignore it entirely. Well the answer is not that simple, but a simplification of it could be:
IIS Search Optimization Toolkit has a couple of rules that look for different patterns related to Redirects. The Beta version includes the following:
So how does it look like? In the image below I ran Site Analysis against a Web site and it found a few of these violations (2 and 3).
Notice that when you double click the violations it will tell you the details as well as give you direct access to the related URL's so that you can look at the content and all the relevant information about them to make the decision. From that menu you can also look at which other pages are linking to the different pages involved as well as launch it in the browser if needed.
Similarly with all the other violations it tries to explain the reason it is being flagged as well as recommended actions to follow for each of them.
IIS Search Engine Optimization Toolkit can also help you find all the different types of redirects and the locations where they are being used in a very easy way, just select Content->Status Code Summary in the Dashboard view and you will see all the different HTTP Status codes received from your Web site. Notice in the image below how you can see the number of redirects (in this case 18 temporary redirects and 2 permanent redirects). You can also see how much content they accounted for, in this case about 2.5 kb (Note that I've seen Web sites generate a large amount of useless content in redirect traffic, speaking of spending in bandwidth). You can double click any of those rows and it will show you the details of the URL's that returned that and from there you can see who links to them, etc.
So going back to the original question: "are redirects bad for Search Engine Optimization?". Not necessarily, they are an important tool used by Web application for many reasons such as:
Just make sure you don't abuse them by having redirects to redirects, unnecessary redirects, infinite loops, and use the right semantics.
Today somebody was running the IIS SEO Toolkit and using the Site Analysis feature flagged a lot of violations about "The page contains multiple canonical formats.". The reason apparently is that he uses Query String parameters to pass contextual information or other information between pages. This of course yield the question: Does that mean in general query strings are bad news SEO wise?
Well, the answer is not necessarily.
I will start by clarifying that this violation in Site Analysis means that our algorithm detected that those two URL's look like the same content, note that we make no assumptions based on the URL (including Query String parameters). This kind of situation is bad for a couple of reasons:
Query String by themselves do not pose a terrible threat to SEO, most modern Search Engines deal OK with Query Strings, however its the organic linking and the potential abuse of Query Strings that could give you headaches.
Remember, Search Engines should make no assumptions based on the fact it is a single "page" that serves tons of content through a single Absulte Path and the use of Query Strings. This is typical in many cases such as when using index.php, where pretty much every page on the site is served by the same resource and just using variations of Query Strings or path information.
Well, there are several things you could do, but probably one of the easiest is to just tell Search Engines (more specifically crawlers or bots) to not index pages that have the different Query String variations that really are meant only for the application to pass state and not to specify different content. This can be done using the Robots Exclusion Protocol and use the wildcard matching to specify to not follow any URL's that contain a '?'. Note that you should make sure you are not blocking URL's that actually are supposed to be indexed. For this you can use the Site Analysis feature to run it again and it will flag an informational message for each URL that is not visited due to the robots exclusion file.
In summary, try to keep canonical formats yourself, don't leave any guesses to Search Engines cause some of them might get it wrong. There are new ways of specifying the canonical form in your markup but it is "very recent" (as in 2009) and some Search Engines do not support it (I believe the top three do, though) using the new rel="canonical":
In the Beta 2 version of IIS SEO Toolkit we will support this tag and have better detection of this canonical issues. So stay tuned.
Other ways to solve this is to use URL Rewrite so that you can easily redirect or rewrite your URL's to get rid of the Query Strings and use more SEO friendly URL's.
The other day somebody ask me if there was a way to limit the amount of work that Site Analysis in IIS SEO Toolkit would cause to the server. This is interesting for a couple of reasons,
In Beta 1 we do not support the Crawl-delay directive in the Robots exclusion protocol; in future versions we will look at adding support this setting. The good news is that in Beta 1 we do have a configurable setting that can help you achieve this goals called Maximum Number of Concurrent Requests that you can configure.
To set it:
The other day a friend of mine who owns a Web site asked me to look at his Web site to see if I could spot anything weird since according to his Web Hosting provider it was being flagged as malware infected by Google.
My friend (who is not technical at all) talked to his Web site designer and mentioned the problem. He downloaded the HTML pages and tried looking for anything suspicious on them, however he was not able to find anything. My friend then went back to his Hosting provider and mentioned the fact that they were not able to find anything problematic and that if it could be something with the server configuration, to which they replied in a sarcastic way that it was probably ignorance on his Web site designer.
So of course I decided the first thing I would do is to start by crawling the Web site using Site Analysis in IIS SEO Toolkit. This gave me a list of the pages and resources that his Web site would have. First thing I knew is usually malware hides either in executables or scripts on the server, so I started looking for the different content types shown in the "Content Types Summary" inside the Content reports in the dashboard page.
After running the query as shown above, I got a set of HTML files which all gave a status code 404 – NOT FOUND. Double clicking in any of them and looking at the HTML markup content made it immediately obvious they were malware infected, look at the following markup:
Notice those two ugly scripts that seem to be just a random set of numbers, quotes and letters? I do not believe I've ever met a developer that writes code like that in real web applications.
Notice how both of them end up writing the actual malware script living in martuz.cn and gumblar.cn.
Now, this clearly means they are infected with malware, and it clearly seems that the problem is not in the Web Application but the infection is in the Error Pages that are being served from the Server when an error happens. Next step to be able to guide them with more specifics I needed to determine the Web server that they were using, to do that it is as easy as just inspecting the headers in the IIS SEO Toolkit which displayed something like the ones shown below:
With a big disclaimer that I know nothing about Apache, I then guided them to their .htaccess file and the httpd.conf file for ErrorDocument and that would show them which files were infected and if it was a problem in their application or the server.
Turns out that after they went back to their Hoster with all this evidence, they finally realized that their server was infected and were able to clean up the malware. IIS SEO Toolkit helped me quickly identify this based on the fact that is able to see the Web site with the same eyes as a Search Engine would, following every link and letting me perform easy queries to find information about it. In future versions of IIS SEO Toolkit you can expect to be able to find this kind of things in a lot simpler ways, but for Beta 1 for those who cares here is the query that you can save in an XML file and use "Open Query" to see if you are infected with these malware.
One easy way to enhance the experience of users visiting your Web site by increasing the perceived performance of navigating in your site is to reduce the number of HTTP requests that are required to display a page. There are several techniques for achieving this, such as merging scripts into a single file, merging images into a big image, etc, but by far the simplest one of all is making sure that you cache as much as you can in the client. This will not only increase the rendering time but will also reduce load in your server and will reduce your bandwidth consumption.
Unfortunately the different types of caches and the different ways of set it can be quite confusing and esoteric. So my recommendation is to think about one way and use that all the time, and that way is using the HTTP 1.1 Cache-Control header.
So first of all, how do I know if my application is being well behaved and sending the right headers so browsers can cache them. You can use a network monitor or tools like Fiddler or wfetch to look at all the headers and figure out if the headers are getting sent correctly. However, you will soon realize that this process won't scale for a site with hundreds if not thousands of scripts, styles and images.
To figure out if your images are sending the right headers you can follow the next steps:
Alternatively you can just save the following query as "ImagesNotCached.xml" and use the Menu "Query->Open Query" for it. This should make it easy to open the query for different Web sites or keep testing the results when making changes:
In IIS 7 this is trivial to fix, you can just drop a web.config file in the same directory where your images and scripts and CSS styles specifying the caching behavior for them. The following web.config will send the Cache-Control header so that the browser caches the responses for up to 7 days.
Furthermore, using the same query above in the Query Builder you can Group by Directory and find the directories that really worth adding this. For that is just matter of clicking the "Group by" button and adding the URL-Directory to the Group by clauses. Not surprisingly in my case it flags the App_Themes directory where I store 8 images.
One thing to note is that that even if you do not do anything most modern browsers will use conditional requests to reduce the latency if they have a copy in their cache, as an example, imagine the browser needs to display logo.gif as part of displaying test.htm and that image is available in their cache, the browser will issue a request like this
GET /logo.gif HTTP/1.1
Accept-Encoding: gzip, deflate
If-Modified-Since: Mon, 09 Jun 2008 16:58:00 GMT
Note the use of If-Modfied-Since header which tells the server to only send the actual data if it has been changed after that time. In this case it hasn't so the server responds with a status code 304 (Not Modified)
HTTP/1.1 304 Not Modified
Last-Modified: Mon, 09 Jun 2008 16:58:00 GMT
Date: Sun, 07 Jun 2009 06:33:51 GMT
Even though this helps you can imagine that this still requires a whole roundtrip to the server which even though will have a short response, it can still have a significant impact if rendering of the page is waiting for it, as in the case of a CSS file that the browser needs to resolve to display correctly the page or an <img> tag that does not include the dimensions (width and height attributes) and so requires the actual image to determine the required space (one reason why you should always specify the dimensions in markup to increase rendering performance).