Why I love Fridays
This morning I got a call from a flustered customer. We have been performing a migration with them from Enterprise Vault on premise to O365, but this week for some reason their EV server was failing intermittently through the day. The EV services were still running but nothing was working and when they rebooted the EV server it only fixed the issues for a short while, within half an hour things were back to failing. The logs indicated that there were SQL connectivity problems.
When they called, the obvious first question I asked was “what has changed since it was working?”. It turns out that quite a bit had changed. They had recently relocated their SQL instance to a new data centre. The VM that Simply Migrate is installed on had also moved but the EV server had remained in place at the old data centre. I imagine that in the mind of the engineer who made the move it wasn’t a big deal. They had just changed the IP address of the SQL Server and a lot of other databases exist on that server so why should the EV Server be the only thing having issues?
Size does matter
There are already tonnes of articles on the internet about how to determine your optimal MTU size, so I won’t waste your time rehashing all of that. What I will do is spend some time explaining what I did and how we resolved the issue.
Due to some previous experience with issues like this, such as intermittent connectivity of file transfers and Outlook not connecting to Exchange I quickly reviewd the more obvious causes and moved onto some more obscure ones. We looked at TCP chimney and TCP offloading settings and then it dawned on me that since the data centre move they now have a router in-between the EV and SQL servers that wasn’t there before. I wondered if they had done anything around MTU sizes before?
I quickly checked by running :
ping SERVERNAME -f -l 1473 Request timed out Request timed out Request timed out Request timed out
It came back with a ping timeout which was not what I was expecting!?
What I had been expecting to see was this :
ping SERVERNAME -f -l 1473 Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set.
When I didn’t get the result I expected I thought … Boom! Ladies and gentlemen we now have a starting point for investigation.
The next steps were rapid fire and very informative. The first thing we did was check to see what the MTU setting was on the NIC’s.
C:\>netsh int ipv4 show int Idx Met MTU State Name —- ———- —————– ————— ————————— 1 50 4294967295 connected Loopback Pseudo-Interface 1 13 10 9000 connected Local Area Connection 1 14 10 9000 connected Local Area Connection 2
And there is the money shot ladies and gents, an MTU size of 9000. My mind flicked to Jumbo Frames and storage devices and then back to the router that now sat between the newly relocated SQL server and the EV server.
The money shot
A quick question later about the purpose of the NIC’s and the confirmation was real. The NIC with Idx 13 was for storage and the other was for the corporate network. Not even 2 minutes later we ran this
netsh interface ipv4 set subinterface 13 mtu=1500 store=persistent
EV is running wonderfully well now. But not for long, we’re almost done migrating away from it for the customer.
I hope this helps one of you out there.
The EV Server was losing connection to the SQL server when any load was put on the server. Down to the point where testing the ODBC connections would fail and the customer was rebooting the server to fix the issues constantly.
SQL Server had moved to a new data centre. Other applications had databases on the same server and they weren’t having any issues. It was the EV Server that had not moved, but was experiencing the issue.
Rebooting the EV server solved the issue for a period of time.
We fixed it. Simply.
P.S. MTU stands for Maximum Transmission Unit