During some of my recent partner engagements, I had the opportunity to test deeply the Azure network performances and latency impact on SQL Server in Azure Virtual Machines (IaaS VM), now in this blog I’m going to unveil some interesting findings, testing methodology and comparison with other Cloud providers. Finally, I will show you an interesting finding on a pretty unknown and un-used SQL Server parameter that you can play with to reduce network latency effect. Regarding the VM configuration, for Azure I used two “LARGE” VMs inside the same Virtual Network (VNET) deployed in “West Europe” Azure datacenter (Amsterdam). I executed all tests on Monday afternoons (November 4, 11, 18 and 25), starting from 2:00pm until 6:00pm, local CET time, I used this day and hours to have good measures in normal workload conditions on Azure infrastructure.
Testing in the Cloud
I spent the last two years mainly testing performances in the Cloud, on AZURE and some others Cloud providers, and I have to say that is challenging due to the intrinsic “volatile” nature of this kind of platform, then I want to provide you my five “Don’t assume….” rules in approaching this kind of activity:
Finally, an important recommendation if you want to run your network performance tests in Azure. In order to avoid blacklisting of your deployment, be sure to open a Support Case to the Azure Support otherwise you may be throttled by the infrastructure DDOS protection mechanisms. During my tests, I encountered problems crossing 10K packets per second on a specific connection, and when the aggregate number of packets to a VIP reached 100K packets per second.
Azure Network Configuration and Optimization
The first trivial recommendation in Azure network configuration, is to use the same “Affinity Group” (AG) used for Azure Virtual Network also for your VM and storage placement.
You may be wondered why I’m using Azure Virtual Network (VNET) since using AG should be enough to locate as close as possible all the VMs that need to communicate each other with the minimum latency, the reasons are simple if you consider the facts below:
What is really surprising is that testing latency across Azure Load Balancer (LB) using VIP (Virtual IP – external - Azure LB) originated almost the same results compared to internal network communications using DIP (Direct IP – Internal - No Azure LB). After internal investigations, I finally discovered that Azure recently introduced a very nice network optimization: once the connection is established between two VMs, Azure will recognize that the communication is between two internal resources and will allow direct communication as in the case of network connection using DIP. This is really a huge gain since you application will be able to connect to a SQL Server VM (and all other type of server side VMs) efficiently even if in a different Cloud Service!
SQL Server Network Latency Analysis
First of all, how to test and measure effectively network latency in a typical client-server application scenario using SQL Server as the backend database? As you probably know, there is no performance counter or trace log that will track network packet latency, but SQL Server is unique in providing a great but simple mechanism, inside the engine itself, it’s called “Wait Types” and you can query details using a specific SQL Server Dynamic Management View (DMV) called “sys.dm_os_wait_stats”. If you are not a DBA nor familiar with SQL Server world, you can read about this DMV and “Wait Types” at the link below, I also provided a sample query in this blog post attachment, but you can also find plenty of similar example over the Internet:
Dynamic Management Views and Functions
Now, just to land the boots on the ground for everyone, using the DMV mentioned above, you can look for a very specific wait type logged by the SQL Server engine, called “ASYNC_NETWORK_IO”, looking in the SQL Server BooksOnline, this is the official explanation:
Occurs on network writes when the task is blocked behind the network.
Verify that the client is processing data from the server.
Great! You may argue we have the perfect tool to measure network latency, unfortunately this is not 100% accurate for two reasons:
Anyway, also with these limitations, “ASYNC_NETWORK_IO” will provide a good indicator and the best (at my knowledge) tool to achieve this goal, then go on and use it! How? This is a simple step list:
DBCC SQLPERF (N'sys.dm_os_wait_stats', CLEAR);
What I described above is a SQL Server specific method to measure, with good approximation, the network latency, but in the next section I will show you an agnostic way to do that in general, without considering specific application scenarios and architectures.
General Network Latency Test
If you don’t have SQL Server, or you need to run some more general network tests, for example to evaluate the goodness of a Cloud platform, there is a great tool from great Microsoft Technical Fellow Mark Russinovich called “PsPing” from SYSINTERNALS tool suite:
This tool is great for several reasons, but most importantly it can tests both network latency and bandwidth, for the context of this blog post, let’s focus on the latency. Here is how to run your first network latency test:
psping -s 10.0.0.5:50000
psping -4 -l 1024 -n 1000000 -h 10 10.0.0.5:50000
Nice tool, isn’t it? The major (and probably only one) drawback is that this tool is not application specific, then you may experience different results testing real workloads.
Guest OS Network Optimization
For my tests, I used Windows Server 2012 as the Guest Operating System, looking at the network settings; everything is expected as a normal on-premise installation regarding “TCP Task Offloading”, “Chimney” and “Jumbo Frames” support.
Netshell command to show NIC capabilities: netsh interface tcp show chimneystats <<NIC ID>>
Unfortunately, the virtual network NIC used in the Azure VM is not capable to support TCP Task Offloading and Chimney, then there is nothing you can tweak here. Regarding Jumbo Frames, disabled by default, I tried to enable and test all the possibilities listed below, but nothing changes significantly on the latency test results.
SQL Server Optimization
Even if you are familiar with SQL Server, you may haven’t ever heard about an instance wide parameter called “network packet size (B)” only accessible using SP_CONFIGURE system stored procedure with advanced options visualization turned on: this parameter, with default value of 4KB, will affect the TDS (Tabular Data Stream) protocol packet size that SQL Server will use for communications with applications. During my 15-years working experience on SQL Server, I had to change it only 2 times, and only for SAP that need a special value, additionally you may have heard many Microsoft SQL experts recommending to do not change it. Well, if you are using SQL Server in an Azure VM, probably this time you may consider tweaking this parameter since I found beneficial in Azure network infrastructure. To measure the impact of this parameter, I used “Wait Types” observation and specifically the “ASYNC_NETWORK_IO” type, here are the results I obtained testing a real application workload:
NOTE: The maximum value for “network packet size (B)” is (32767) byte on SQL Server 2012 (and many other SQL versions), not 32KB as reported for simplicity in the last row above.
Now, I need to warn you about raising the value for this parameter: don’t assume it comes with no drawbacks, then change it using higher values only if you have serious network performance problems and high “ASYNC_NETWORK_IO” wait type. The biggest risk in raising this parameter may be an increase in SQL Server multi-page allocation in its VAS (Virtual Address Space) causing memory pressure and fragmentation, then it’s recommended to test your application workload using the new configured value for “network packet size (B)”. A safe value you can try to experiment is (8000) bytes. Details on how to change “network packet size (B)” and additional information can be found at the link below:
Configure the network packet size Server Configuration Option
Finally, if you want to check the parameter value for specific connections, you can run “select * from sys.dm_exec_connections” TSQL query, if you want to check the instance wide value, you can use “EXEC sp_configure 'network packet size'” TSQL query.
Final Test Results
In the following section, you can find the results of my tests, there are some very important conclusions we can obtain, let me summarize here and invite you to read the details later:
Based on a real application workload I used as a test case, over 20 minutes network activity, using VIP for application-to-SQL communication only introduced 1-2% overhead in the test completion time;
AZURE Network Latency Results using internal IP (DIP)
AZURE Network Latency Results using external IP (VIP)
I hope this information will be useful for you, please if you have experienced interesting facts or done other types of tests, send me a mail using firstname.lastname@example.org address and I will be happy to discuss the details with you and consolidate information in this blog.