America Online ("AOL") uses Global Server Load Balancing ("GSLB") to maintain their SIP gateway farm. Their configuration uses sub VIPs, one for each of their two datacenters. Whenever AOL performs maintenance (new code, hardware, etc.), they use GSLB to take one datacenter offline while they update it and point all of the SIP traffic to the secondary datacenter.
As with all GSLB systems, DNS caching on the client side can be an issue. For example, this past week (July 16th), AOL replaced some hardware at one of their sites. When doing so, they directed all traffic to their other site around 12am EDT. It takes a maximum of 30 seconds for these changes to be reflected in DNS.
Several of our Office Communications Server customers were having issues connecting via PIC to AOL (through their SIP gateways) as late as 6:30am that morning, but that the issue resolved itself in about an hour's time. This sounds very much like DNS caching at work. That is, the users began using the system at 6:30am and continued to have issues until their DNS was refreshed.
While it is unclear where the caching took place (local host, ISP DNS, local network, or within the OCS topology itself), a good rule of thumb is that whenever an OCS customer is unable to reach AOL's SIP gateways via PIC, the first troubleshooting step should be to initiate a DNS flush (and that it is best to use the sip.oscar.aol.com FQDN instead of a specific IP address when connecting to AOL).
It sounds like they have a good set up global load balancing multiple VIP's but why are they the only PIC provider which I ever have issues with. Yes, flushing DNS on my Access proxies usually fixes it but, it is usually reported to me by end users :( and not proactively fixed.
I'm not sure how much insight you have into their environments but, do you know if Yahoo/ MSN are doing something better/ different?
Transparency certainly is a double-edged sword. Over the years, our relationship w/AOL (specific to OCS and PIC) has deepened, and as a result, they are now including us on their proactive alerts prior to any maintenance windows they have planned. I do my best to re-post them here on my blog, as well as on the Unified Communications Team’s Twitter account: http://twitter.com/ucteam.
I can’t speak to why you encounter more PIC issues w/AOL when compared to MSN or Yahoo, nor can I comment on what they do differently when compared to Windows Live/MSN or Yahoo. However, from a high-level perspective, the three entities are architecturally vastly different.
Bottom line: we love all of our children (in this example, PIC partners) equally, and we try our best to give our OCS customers proactive outage notice ahead of time regardless of the PIC partner.
Is MSN performing maintenence as well? Our MSN PIC contacts went down yesterday and still do not work.
AOL and Yahoo are fine
Given the abnormally high temperatures here in Redmond, one of our datacenters went offline yesterday. We are currently investigating if this correlates w/the Windows Live/MSN Messenger Servers going offline ... will follow-up later today with a post-mortem.