If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

Developers are from Mars, Ops people are from Venus… or It looked good on paper

Developers are from Mars, Ops people are from Venus… or It looked good on paper

Rate This
  • Comments 19

A few weeks back me and Micke (one of our Architect Evangelists) had a session at TechDays where we talked about “things that looked good on paper” i.e. things that sound pretty ok in the design/development phase but sometimes turn out to be a disaster in production.

We are both pretty passionate about making the lives of the ops. people easier by thinking about the maintenance of the sites/apps at design time, rather than having it be an afterthought.  I stole the title of this post from one of Mickes talks about bridging the gap between dev and ops. 

The topics we brought up are based on issues that we commonly see in prod. environments and we started off each section with a quote and dissected the pros and cons and what we think people should think about…

Here is a summary:

1. With web services we can use the same interface from both our web apps and win forms apps

While this is perfectly true, there is a right and a wrong time and place for everything.   When you make a web service call, remoting call or WCF call for that matter there is a lot of stuff that goes on behind the scenes, like getting a connection, serializing and de-serializing parameters and return values, spawning up new threads to make the new httpwebrequests etc.

I’ve talked a lot about issues with serialization and de-serialization, specifically when it comes to serializing large sets of data, complex objects or datasets for example.   Serialization of these types of objects generate a lot of memory usage and is often quite expensive when it comes to CPU usage. Also, if you call web services within the same app pool you can run into issues like thread pool exhaustion.

The moral of the story?  Use web services if you need to get data that you couldn’t get by loading up a component in the app.   In other words, if you need to go to a DMZ or a different network to get it. 

If you want to create a web service (hosted on the same server as your asp.net app) so that you can get the same functionality both from your asp.net app and your win forms apps,  a good option is to write a component that does this, and then wrap it in web service calls for your win forms apps to use.

At the very least you should be really frugal with the amount of data you send back and forth.  For example filter the data before bringing it back so you transfer as little data as possible.

More reading:

Case Study: ASP.NET Deadlock calling WebServices
ASP.NET Performance Case Study: Web Service calls taking forever
OutOfMemoryExceptions while remoting very large datasets
Dataset serialization

2. Bob, just turn on tracing on the WCF end point

A lot of app configuration these days is done in XML.  You often hear that XML is so great because it is human readable/writable, but is it really???  Even with XML configuration some things are extremely wordy and require a lot of xml code to configure.  Imagine that you have an issue in production where you need Bob (or Jerry, or Ruth or <replace the name of your favorite ops guy/gal here>) in operations to turn on WCF tracing on all the servers in the web farm.  He probably doesn’t have Visual Studio handy to swap this through the UI so he’ll probably use the incredibly useful configuration tool Notepad to write the 10+ lines of XML needed to enable the tracing. Rinse and repeat for all servers in the web farm.

Is that fair? What if there is a mistake in the XML? 

To make it a bit easier on Bob you could provide him with two web.config files (with and without tracing) and that is at least an improvement, but then there is of course the issue of forking,  if you have to change something in one config you need to change it in both etc.

A nicer way would be to create some powershell commandlets to enable tracing or whatever config items you want, like connection strings or whatever else you might have stored in your configs.  The nice part about this is that it is scriptable so you could create one script and run it on all servers.

While you’re at it, why not implement a powershell provider for your app that allows the ops guys to configure parts of your application, or get values from your application from powershell.   Powershell objects are .net objects so scripts etc. are written in .net.

Taking it one step further, you can even call powershell command lets from an MMC snapin in case you want to configure things from there.

3. Let’s put this data in session scope so we don’t have to go back and forth to the database all the time

Got a tweet from Fredrik earlier this week where he suggested a title for a pod-cast “Session state is the Achilles heel of ASP.NET”.  I definitely agree… Again, everything has it’s pros and cons, and session state is nice for saving SMALL pieces of user specific data, but if you have a high-load web app, I would say that you seriously need to consider going stateless.

Over the years I have seen many many web apps with lots of data in session state.  A favorite seems to be to put datasets in session scope to avoid hitting the database all the time.  Especially if the query to get the data is pretty complex.

Now imagine that this site grows and needs to be replicated on different servers in a web farm so we can’t use in-proc session state anymore.  In that case you would need to put session objects in an out of proc session store like state server or sql server.   Just like with the web service calls you need to serialize and de-serialize data as you are now doing cross process calls, which again uses a lot of mem and a lot of CPU.

Even if you have one server and store it in-proc, there is a real chance that you will rack up a lot of memory if you have a lot of concurrent sessions.

For out of proc session state it gets even worse… without putting it in session scope you would go out and get the data when you need it.  If you have out of proc session scope you will go out and get every single session var for the given user on every single request (with session state enabled), and then put it back in the session store on end request.  That’s a lot of serialization/de-serialization.

Another thing about hoarding data, unrelated to session state that is a bit of a pet-peeve of mine is when apps bring in loads of data from the database and process it in-proc, based on the notion that they don’t want the DB to be a bottle neck.

Just some food for thought there…  I think that very few applications are better/faster at handling/processing data than database engines, hence if the code in the app is not better at data processing than the DB, then aren’t chances pretty good that this will just create a bottleneck in the app instead? 

Some related posts:

Debugging Script: Dumping out ASP.NET Session Contents
ASP.NET Memory: Thou shalt not store UI objects in cache or session scope
ASP.NET Memory Leak Case Study: Sessions Sessions Sessions…

4. HttpUnhandledException, does that mean I should restart the server?

There are a lot of posts and discussions around which logging framework is best, and I think at least some of you agree with me that a lot of time is spent in the design phase to work out which one to use, but how much time do you spend thinking about what to log?

Often when i get cases and ask for event logs the event logs literally look like a nice and very ornate Christmas tree.  Most of the entries contain stack traces, some contain exceptions that are handled mixed with some that are not.  That’s ok, at least the part about stack traces, I love them, they make sense to me and I can use them to troubleshoot once I have waded through the unimportant events with log parser or some other tool.  But… does this really make any sense to Bob in operations?   Unless he has a dev background chances are that it makes no sense at all and unless it is something he has seen a million times before, he probably wont know how to act, or not act on the events.

In my humble opinion Bob should only get about max 5 events a day in his log, and that’s on a busy day.  Every event should have a nice problem description and most importantly action like restart the server, run diagnostics on the DB etc.  Sometimes you don’t know the action or even the problem description and then maybe the action could be “report this unknown failure to dev”.     I bet that your ops guys/gals would be a lot happier…

I am not saying that you should stop logging the exceptions, but preferably not in the same log as the ops logs.

Oh, just one more thing about this… I’ve talked before about apps that throw a lot of exceptions and the perf impact this has even if the exceptions are handled.   In fact, if you forget about the perf impact,  there is another disadvantage to throwing a lot of exceptions and that is that the app is a lot less supportable… why?  because if you need to debug the app and dump or log on a specific exception type, this can become very hard if you have a lot of benign exceptions as you will generate lots of dumps or logs which takes time, disk space, and more importantly it is very hard to find the needle in the haystack.

5. With ASP.NET we can update the sites even when they are live, ASP.NET will handle the rest

When you update the site with a number of new assemblies for example, old requests will finish running with the old assemblies and a new appdomain will be created when the next request comes in with the new assemblies loaded. 

So far so good…

Now, picture that you have a lot of assemblies in your update and that a new request comes in when you’re halfway done copying the assemblies.  In that case new requests will be serviced in a partially updated application, and you may even see locking issues if the load is really high.

If you have a web farm and don’t use sticky sessions a post can be done from one server (updated) to another (not updated) and if you have changed user controls etc. then the view state might become invalid.

So, for low volume sites, updates to a live environment is usually cool, but if you have a lot of load you need to consider taking the server out of rotation before updating, or update when load is not as high.

Related post:
ASP.NET Case Study: Lost session variables and appdomain recycles

6. It works on my machine, let’s go live

Test and load test, with lots of scenarios and with appropriate load levels, Nuff said.

I know that you know this already, and yet we get so many cases where issues that come up in production and become crisis situations could have been avoided if the applications had been properly load tested.

A lot of issues, such as a specific method causing a leak, or a hang, can even be discovered with very simple load testing at the dev stage.   

There are plenty of really good load testing tools and profiling tools out there like Load runner, Ants profiler, Visual Studio Team System Test etc.  I’m sure you have your own favorites.

For poor mans stress testing, that can be done on the dev machine, you can use the free tool tinyget that comes with the IIS 6.0 Resource Kit

7.
- Do we have a plan for crashes? 
- We’ll document it in phase 2

Crashes, hangs and memory leaks is not usually something that people really plan for.  I have seen extreme examples of planning for these types of issues at some of my customers like:

#1 One company that I work with has a full fledged plan for what will happen if a crash/hang/memory leak is discovered in production.  The plan includes documentation for ops with step by step instructions on how to get dumps, how to upload them etc.  They even do fire drills with ops to test that their plans work.

#2 Another company I work with has included code in their app to dump the process under certain conditions and the dumps are then automatically bucketized by issue type and scripts are autorun to debug the dumps and collect vital information about the issue.   In other words, most of their analysis for these cases is automated to the tee.

Not everyone has to go to these extremes, especially if the app isn’t mission critical, but a good recommendation would be to have some documentation for ops on how to act in general cases so that you get the most data possible about the issue.  Like how to gather performance counter data or dumps.

In my last post i described how you can set up rules with debug diag that ops can activate as needed.

8. What do you mean baseline?  I think CPU usage is usually around 40%, maybe 50

When you troubleshoot a problem like a hang, memory leak or crash a key piece of information is often “what is different in the failure state compared to the normal state”.

Setting up a performance counter log that rolls over for example every 24 hours and alerts ops when certain values exceed some predefined number is very easy and has extremely low impact on the system.

Having this history when you troubleshoot something is as i mentioned very useful since you can see things like, right around the time it crashed memory went up to x MB, or we started seeing a large number of exceptions etc.   Although it might not solve the issue it can often give you a good direction to move in.

This article about Performance monitoring and when to alert administrators is from 2003, but except for a few changes in a few of the counters most of it still holds true.   The article describes what counters to look at, suggested trigger values and some typical causes for issues that cause the counter to hit the trigger.  It’s simply a must read.

9. I don’t need to care about memory management, isn’t that what the GC is for?

True, true, the GC manages the .net memory in the sense that it will automatically collect any objects that are collectable and free them so you don’t have to call free or release as you would in native languages.

Memory problems are very seldom caused by the GC not collecting as it should.  Instead they are often caused because the app is unintentionally still holding on to the objects, through references or through not disposing/closing/clearing disposable objects.

I have written tons of articles around memory management and memory issues in .net so rather than listing them all, just look at the Memory #tag to get more info about .net memory issues, how the GC works and how .net memory management works.

The moral of the story here is that even if you have a garbage collector, you still need to make sure that your memory is ready to collect.

 

I would love to hear your comments on these topics. 

This is by no means supposed to be a complete list, so I would also love to hear about your own tips on how Devs can make life easier for the Ops guys and avoid production issues.

On the soapbox,
Tess

  • Tess- For point #2, you can enable tracing in WCF but have the level set to Off. At runtime, you can use WMI to set the tracelevel to Verbose | Error | whatever to capture traces, then set to Off to stop tracing, close the TraceSource, and move the trace files around.

    Also WCF (and most of .NET 2.0) supports System.Configuration.ConfigurationElement based config, meaning that you can write to a real Object model and read/write config to a batch of servers without notepad. Powershell and some knowledge of ConfigurationManager should be sufficient. By sticking with these classes for config and your own configuration, you should make ops better.

  • A few weeks back me and Micke (one of our Architect Evangelists) had a session at TechDays where we talked

  • A quick question and a quick comment.

    For the question...with session state in a web farm, I never tried this (I hate in proc session...), but if you make your servers sticky, isn't in proc session still an option?

    And second for the data handling in the database vs the web app as the bottleneck... One thing to consider a lot, is unless you're using a free database, database servers are expensive. SQL Server and Oracle can run a pretty penny. Windows Server 2008 Web Edition however, may as well be free for any company that has a farm (of course, there's the hardware, but...)

    So sometimes...and I stress -sometimes-, from a $$$ perspective, having to get more web servers is cheaper than having to get more database servers. Oh, and some RDBMSs suck at clustering, so sometimes you really don't want to have to...thus, the load goes to the webfarm itself, which is a snap to load balance.

  • Hi ya,

    Great post. Simple things but need to be reiterated every now and then!

    Nice one :o)

  • Is using session state ok if the object is removed from memory by the application as soon as the object is no longer needed?

    We store datasets in session temporarily while a user is working with a dataset coming from a very large table. (over 20 million records) We store it to handle smaller AJAX requests to modify or work with specific records. Once the user navigates away from the page, we clear the session object that stores the dataset.

    Should we instead always hit the database for these type of smaller requests?

  • Web The Browsers Performance in Dependence of HTML Coding COMET (or Reverse AJAX) based Grid control

  • WebTheBrowsersPerformanceinDependenceofHTMLCodingCOMET(orReverseAJAX)basedGrid...

  • "Just some food for thought there…  I think that very few applications are better/faster at handling/processing data than database engines, hence if the code in the app is not better at data processing than the DB, then aren’t chances pretty good that this will just create a bottleneck in the app instead? "

    Actually, Tess it does make good sense to cache data in memory especially for high traffic websites that need high performance. This all goes back to the old time vs. space trade off so essentially it really depends on whats needed for a particular app. The bandwidth and latency of the network or disk IO is much slower than memory itself. This has a cost ,however, the space needed in memory for data will increase but the time for processing should decrease.

  • Hi Tess,

    in number one you refer to "Case Study: ASP.NET Deadlock calling WebServices" (http://blogs.msdn.com/tess/archive/2007/12/18/case-study-asp-net-deadlock-calling-webservices.aspx) which in turn refers to the HangAnalyzer tool. Unfortunately the download link of it does not work any more.

    Is it included in Debug Diagnostic Tool v1.1? If not, could you please update the link?

  • #1 - Use interfaces and IoC/DI. That way you can easily change your underlying implementation based on your development scenario and deployment.

    #3 - Statelessness brings is own set of problems. Your framework should allow you to easily choose between clientside/serverside state as needed. Additionally, it should allow you to easily inject caching mechanisms that can be offload to other servers.

    #9 - You should always use a Profiler on your code.

  • Forgot on #3 - as o.s.  said caching is better than hitting the db. The db can easily become the bottleneck and that is why some vendors are making a living (make that a retirement) off of caching.   http://coherence.oracle.com/display/COH34UG/Coherence+for+.NET

  • Great comments,  love it

    Ooh, I will see what happened to the hang analyzer link.

    Francois,  yeah you can use inproc if you use sticky sessions since the whole user session will go to the same server.

    Re caching vs. db,  I agree that caching definitely has it's place, I think though that some thought has to be put into when it is best to cache vs. get the data from the DB.

    Stress testing would be a good way to find out whether you made the best choice, and looking at the hit/miss cache ratio.  You might also want to look at the cost of gcs etc. if memory gets really high so see if your caching strategy paid off.

    Caching is a bit of a different beast than session state, even if inproc session state is stored in cache, since it is used by multiple users.  

    If you are in a web farm and do caching, I would also add to the list of perf. considerations to make, that you look at how long it takes to rebuild the cache on all nodes vs. getting the data from the source to determine if the strategy paid off.

    If you have a large cache, built at appstart for example, you need to be especially mindful of appdomain restarts, and making sure that you make requests for the app after a restart so the first user is not hit with a major delay.

  • Dataset in sessions gave me lot of headaches with .NET 1.1.

    "This search is long to execute, the data is about, mmh, 200 ko, let's put it in session !"

    Two months later, the application go live, and three days later the first "Out of Memory" exception comes out, with an almost hysterical client wanting to kick my ass out (and I can't blame it for that).

    "Hell, what's happening ? 200 ko per user, it isn't much...... Right ?"

    A quick look in the session database showed me 3 MB worth of session by user... Yeah, dataset are serialized in XML in .NET 1.1.

    Oh, and never ever put something else than a base type in a .NET 1.1 dataset that will be serialized. Forget about the XmlAttribute property, or ISerializable interface.

    Just launch Reflector, look for the method System.Data.Common.DataStorage.CreateStorage(Type), then the method System.Data.Common.ObjectStorage.ConvertXmlToObject(string) used to deserialize your object, and laugh. Or cry, I'm pretty sure I did this day.

    public static DataStorage CreateStorage(Type dataType)

    {

       switch (Type.GetTypeCode(dataType))

       {

           case TypeCode.Empty:

               throw ExceptionBuilder.InvalidStorageType(TypeCode.Empty);

           case TypeCode.Object:

               if (typeof(TimeSpan) != dataType)

               {

                   return new ObjectStorage(dataType);

               }

               return new TimeSpanStorage();

           case TypeCode.DBNull:

               throw ExceptionBuilder.InvalidStorageType(TypeCode.DBNull);

           case TypeCode.Boolean:

               return new BooleanStorage();

           case TypeCode.Char:

               return new CharStorage();

           case TypeCode.SByte:

               return new SByteStorage();

           case TypeCode.Byte:

               return new ByteStorage();

           case TypeCode.Int16:

               return new Int16Storage();

           case TypeCode.UInt16:

               return new UInt16Storage();

           case TypeCode.Int32:

               return new Int32Storage();

           case TypeCode.UInt32:

               return new UInt32Storage();

           case TypeCode.Int64:

               return new Int64Storage();

           case TypeCode.UInt64:

               return new UInt64Storage();

           case TypeCode.Single:

               return new SingleStorage();

           case TypeCode.Double:

               return new DoubleStorage();

           case TypeCode.Decimal:

               return new DecimalStorage();

           case TypeCode.DateTime:

               return new DateTimeStorage();

           case TypeCode.String:

               return new StringStorage();

       }

       return new ObjectStorage(dataType);

    }

    public override object ConvertXmlToObject(string s)

    {

       Type dataType = base.DataType;

       if (dataType == typeof(byte[]))

       {

           return Convert.FromBase64String(s);

       }

       ConstructorInfo constructor = dataType.GetConstructor(new Type[] { typeof(string) });

       if (constructor != null)

       {

           return constructor.Invoke(new object[] { s });

       }

       return s;

    }

    Damn, I love .NET 2.0.

  • Add me to the list of people supporting judicious caching fo data. One solution I've seen is to have the cache in a seperate process on the Web/App server, or a nearby server. That way it survives client cycling and can store/access the data more efficiently.

    The thing about SQL Server is that it's reasonably fast and flexible. It's easy for a custom solution to be faster, but a lot less flexible. Heck, there are a commercial heirarchical databases that are faster than most SQL databases, but you lose a LOT of flexibility.

    Also, the question of where to process the data's really dependent on what you're doing. I've had pretty good luck using stored procedures and temp tables on the server, but a couple of weeks ago I rewrote a SP that took 10 hours to run as a client program that took less than 10 minutes. Most of that time was spent writing the data back to the db. (1 thread reading/processing the data, 2 threads doing heavily optimized writes.)

  • I tend to agree with Tess on the whole "offload the database" thing.  I frequently see this used a lot as an excuse not to understand database features and program it properly.  If you need caching and you're using Oracle, it can do all kinds of caching for you.  You can request it to cache the result set of a query so that subsequent requests don't actually process the query, they just pull the results.

    http://www.oracle.com/technology/oramag/oracle/07-sep/o57asktom.html

    There is also a client result cache that caches results in the Oracle client on the web server.  

    http://www.oracle.com/technology/oramag/oracle/08-jul/o48odpnet.html

    And it may be that you don't need to cache at all, just write design efficient tables and indexes, and write efficient SQL.  I've heard too many times that we need to do the processing in the application so that the database does not become that bottleneck.  I've even seen developers join tables at the web tier in loops rather than just sending a join to the database.  I've even seen filtering done in the web tier instead of SQL (the "where" clause was done in the web app because it would "scale" better).  But these applications scale much worse than they would if they just did the work in the database.  

    Often, I hear "but we can just add more web servers if it gets slow".  But that can make the application even slower if not done right.

Page 1 of 2 (19 items) 12
Leave a Comment
  • Please add 5 and 7 and type the answer here:
  • Post