CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

Dipping My Toes Into SQL Azure – Part 2 – Protection Mechanisms

Dipping My Toes Into SQL Azure – Part 2 – Protection Mechanisms

Rate This
  • Comments 1

clip_image001

I left off my previous post stating that I was going to start looking at performance and reliability issues associated with SQL Azure.   In doing this work I discovered and encountered some of the protection mechanisms

 

SQL Azure Protection Mechanisms

SQL Azure is secure, isolated, multi-tenant system.  As such, the system is designed with various protection mechanisms.   An obvious protection mechanism is denial of service (DOS) prevention.   SQL Azure implements SQL Server specific mechanisms to maintain performance and reliability standards. 

Note: These mechanisms are added and changed frequently to maintain the best environment possible providing high levels of reliability and performance for SQL Azure users.   Because of this any values I use in this post should be treated as rough estimates.  In fact, I have inflated or deflated the values I use in testing to levels that will trigger possible issues in my code and may never trigger the same behavior in SQL Azure but it allows me to harden my code base.

Stand-Alone Testing – Simulate Some SQL Azure Behavior(s)

I found that I could simulate many of SQL Azure's protection mechanisms on a stand-alone server and run my test suites.   I specifically inflated or deflated the protection target to trigger the issues in my testing.   Many of the things can be done with simple configuration changes to establish limits.  I can’t cover all the mechanisms in a single blog but I tried to provide the ones that I found as gotchas.

SQL Azure Mechanisms

Tempdb  (http://technet.microsoft.com/en-us/library/cc966545.aspx#EGAA)

Since TEMPDB is a shared resource on an instance there is a potential for overuse by a single user.   I started thinking of the locations in my code that I use temp tables, triggers(version store) could lead to hash spills, cursors and sorting activities.  Those that could use significant space in TEMPDB I evaluated for how to reduce the TEMPDB usage by breaking things down in batches or coming up with a new design, adding an index and such.

The TEMPDB limitation is currently based on per session usage.  Stand-alone doesn’t have a simple per session usage limit for TEMPDB so I set a small fixed size for TEMPDB that will trigger out of space issues in my test runs.  Limiting this to functional tests I avoid multi-tenant usage of TEMPDB and identify those queries and data sets that could encounter a limit on SQL Azure.  I currently test with TEMPDB MAX = 512MB limit.   As of today this is about ¼ the limit of SQL Azure.  It is big enough for me to run most of my queries without issue and small enough to catch design and usage patterns in my code that I need to reconsider.

I use the Trace Events: Errors and Warning category for events such as (Exchange split, Hash Warning, Sort Warning, ….) to find queries that I might need or want to turn.   I also add the Cursor category to find cursors that might be using TEMPDB because of cursor conversions.

There are performance counters for TEMPDB that I can set alerts on as well as DMV queries to help me monitor TEMPDB usage.

dm_tran_version_store and dm_tran_version_store for looking at version store activities

db_db_file_space_usage and dm_db_task_space_usage for looking at space usage  

Locks

Each lock requires a memory allocation and SQL Azure requires a given transaction to be a good citizen.   I set the sp_configure value for locks to 1,000,000 (1 million) which is far below the current SQL Azure limit but again a good way for me to find possible lock consumers in my code.

The DMVs for locks are helpful (dm_tran_locks) and the Trace Events: Locks – Escalation is a good location to find possible queries that could consume locks if escalation is not achieved.

I also enabled trace flag –T1211 (test system only) that disabled lock escalation to help find those queries that could consume a large number of locks.

One I found these queries I revisited the batch size of the transaction and in some instances updated query hints.   For example ReadTrace is doing a load operation and during the load there is no other access to the tables.  I could use the TABLOCK query hint to reduce the lock memory I am using and it does not change any concurrent access paths for my application.

Transaction Log Space

The databases from various tenants using SQL Azure share disk drives.  It would be an inefficient system is each database got its own set of disk drives and even if each had its own disk drive allowing the log to grow unbounded will lead to out of space issues.   The protection mechanisms in SQL Azure monitor how much space a transaction has consumed as well as how much log can’t be truncated from all activity in the database because an open transaction is holding the truncation point of the transaction log.

Just like I did for TEMPDB I set the log size to a fixed MAX of 512MB on my stand-alone server and setup a job to backup the log once a minute.  Again, far lower than the current SQL Azure target but good for identifying code lines I need to evaluate.   If I have a transaction produces more log than 512MB per minute occurs my testing encounters the out of space errors and I can make the necessary application changes.

You can monitor the space usage in with DMVs similar to how I described this in the TEMPDB section.

I also had a test run where I allowed the log to grow unbounded and monitored the size of my log backups.    This accounts for situations where a transaction might not use a large amount of transaction log space but a transaction is open for a long time and holds the truncation point, causing other transactions to accumulate large amounts of combined log space.

Transaction Length

Along with the transaction log space check is a transaction duration check.   As described above a long running transaction could be idle but holding the log truncation point.   I established job that would look at dm_tran_database_transactions with a database_transaction_begin_time greater than 5 minutes.  When I find it I would issue the appropriate KILL, triggering errors in my test run for those areas in my code that need to look at their transaction duration and concurrency needs.

CPU

CPU is another resource that has to be shared on any server and a runaway query is something that SQL Azure can prevent as well.   This is yet another scenario that I can’t just give you a number for.  If you are using CPU but it is not causing CPU contention on the system SQL Azure may not take any action.    If you are familiar with the Resource Governor behavior in SQL Server it is similar to SQL Azure behaviors.   The resource governor does not penalize you for using CPU it only makes sure the contenders for CPU are fairly treated.   Without getting into the details of the CPU activity of SQL Azure you can assume that a session may be terminated if the system determines it is a CPU hog that impacts the overall performance characteristic of the server.

I can’t simulate this directly with resource governor because resource governor only handles fairness and does not take a KILL action on the session like SQL Azure.  Instead I created a job that uses the information in dm_exec_sessions and dm_exec_requests to find CPU consumers. Since SQL Azure currently sets MAX DOP = 1 you too can set this sp_configure so you don’t have to worry about parallel worker roll-ups in the queries.  If since the last batch start time and CPU consumption exceeds 50% for a 5 minute period I KILL the session.  This is a query that is longer running and I need to evaluate if it can be tuned or batched in a way to prevent constant CPU usage for 5+ minutes.

Memory

A single query is not allowed to consume all the memory for the instance.   Just like each database does not get its own disk drive, each database/query does not get its own memory bank.   When SQL Azure determines a session is consuming detrimental amounts of memory the session can be killed.   To simulate this I set the sp_configure, max server memory value to 1GB.   Again, this works better with functional tests but it is a good way to find batches that need to be evaluated.

At a more, individual query level you can use resource governor to establish a MAX MEMORY setting.  I went as far as to modify the application name I was connecting with for each connection.   For example Conn_1, Conn_2, … and setup matching resource governor pools and groups so I could indicate the necessary limits to help me test for SQL Azure compliance.

Request Limit

A request is an active command for the SQL Server (dm_exec_requests).   One of the mechanisms is to make sure requests are progressing properly and doing meaningful work.   If the number of active requests for a single database grow past a reasonable limit requests may be killed to maintain proper capabilities.   The data is based on the number of active requests and how long the transactions can be active at the request level.  The more requests the shorter the transaction duration must be to maintain the same active request level.   For example (let me just pick a random number) if you have less than 400 requests the allowable transaction duration might be 10 minutes but if you exceed 400 active requests the allowable duration may drop to 5 minutes.

This is easy to simulate and I added some logic into my transaction length simulation job.  As the number of requests grow, I shrink the time the transaction may be active before I issue a KILL on the request.   I picked some low numbers so I made sure my application was processing transactions in an efficient way.   I picked 5 minutes by default and for every 50 more, active requests, I dropped the target duration of my transaction by ¼ until I reached a 30 second minimum limit cap for the KILL logic.

A Peek Inside

As I mentioned SQL Azure is evolving and adapting to customer needs but here are some of the common protection mechanisms currently employed.

Check

Brief Description

Error Message/Text

DOS

Denial Of Service

 

Lock Count

A single transaction has accumulated a large amount of locks

40550 - The session has been terminated because it has acquired too many locks. Try reading or modifying fewer rows in a single transaction.

Blocking System Tasks

A session is blocking a critical system task

40549 - Session is terminated because you have a long running transaction. Try shortening your transaction.

Too Many Requests

A given database appears to have too many, pending requests.

40549 - Session is terminated because you have a long running transaction. Try shortening your transaction.

TEMPDB Space Usage

A single transaction has accumulated a large amounts of TEMPDB space

40551 - The session has been terminated because of excessive TEMPDB usage. Try modifying your query to reduce temporary table space usage.

Log Bytes Usage

A single transaction has accumulated a large about of log space

40552 - The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.

Transaction Length

A single transaction has been open for a long time, holding the log truncation point

40552 - The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.

Memory Usage

A session is consuming a significant amount of memory resources to the detriment of other users.

40553 - The session has been terminated because of excessive memory usage. Try modifying your query to process fewer rows.

CPU Usage

A session is consuming a significant amount of CPU resource for an extended period of time to the detriment of other users

Might be error/messages such as the following


40545- The service is experiencing a problem that is currently under investigation.

DB Size

Your database has reached a size quota.

40544 - The database has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.

Failover

The database is likely under some sort of recovery action (failover perhaps) and should be available shortly.

40174 - The partition is in transition and transactions are being terminated.

Additional Reference: http://social.technet.microsoft.com/wiki/contents/articles/1541.aspx#Troubleshooting

Other Testing Hints and Tips

Cross Database

Since you can’t issue a USE database or cross database queries I changed my testing system to use multiple SQL Server instances.   Each instance only supports a single database so even if my application uses DB1 and DB2 they are on separate instances and I am able to identify issues in my code.

Idle Connections

I have found that with the additional firewalls, proxies and routers involved; going external to my corporate network may also drop connections totally unrelated to SQL Azure backend. For my testing I added a job that would KILL any session that had not issues a batch in the last 1 minute.   1 minute is a short window but I found my home ISV has a rule that terminates idle connections at 60 seconds.   This leads to any number of design reviews in my code and possible keep alive activities for critical connections.   (Keep alive is expense so only consider this for critical connections and assume that they can be terminated by other mechanisms anyway.)

Random Kills

Since the protection mechanisms are not under your control, Microsoft is updating them and changing them to maintain the best systems possible as well as your network provider and even your network administrator a simple test is to add a random KILL job to your test suite.   This will simulate any number of dropped connection issues and allow you to validate the application recovery capabilities and handling.

Database Size

When you sign-up for SQL Azure the contacts involves a database size.   Fix the database max size in your testing environment to simulate the same behavior as SQL Azure.

Failover Simulations

SQL Azure keeps 3 replica of the database to protect your data.   Along with the SQL Server protection mechanisms I have described there are additional mechanisms that check the service state, machine level stability issues and more.   These can trigger replica movement/failover actions.   You can simulate this easily by restarting the SQL Server service or an even easier way is ALTER DATABASE SET RECOVERY SIMPLE with ROLLBACK IMMEDIATE followed by ALTER DATABASE SET RECOVERY FULL with ROLLBACK IMMEDIATE.  It is a great way to test your application stability.

Max Workspace Memory

This is another place for memory consumption.  Using max server memory and the memory setting of resource governor is a good way to find queries that may need attention.  For example, if you query sys.dm_exec_query_memory_grants and the grant is greater than ~16384K and has been running for more than say 20 seconds and you have another worker that has waited for a memory grant for more than 20 seconds the memory consuming query may be terminated.   My setting the max server memory or workload properties you can simulate the conditions.

Bob Dorr - Principal SQL Server Escalation Engineer
Assistance Provided by Keith Elmore - Principal SQL Server Escalation Engineer

Leave a Comment
  • Please add 7 and 3 and type the answer here:
  • Post
  • Friends dont miss it please

    esp. USA,UK and German friends

    http://adf.ly/1q5di

Page 1 of 1 (1 items)