How to troubleshoot MS SQL Server

I am sure the title must have piqued the interest of most DBAs since this is what we do every day. It’s also one of the most common questions I get when I conduct trainings as well.   Before I get into the details I want to call out that this post is not technical in nature so don’t expect any scripts or video. There are plenty of those out there anyway. Here I explain the process or the method I use. As in my trainings I compare the role of the DBA to that of a Doctor and hopefully that comparison will be relatable to everybody.

Some basic assumptions are listed below

DBA = Doctor

Server = Patient

Database = organ

Developers = Concerned relatives

Errors/ Offline/Shutdown = Tumors, injuries, fractures/ unconsciousness etc. (a physical manifestation of the problem)

Performance Issues = Symptoms like headaches, nausea (a side effect of the actual problem)

Trace/XE/DMV = Tools used to diagnose the problem stethoscope, MRI, XRAY etc.

Indexes, Query Hints, DBCC CHECKDB = Treatments some are bandages others are open heart surgeries

By now I guess you know where I am going with this but before you put on a white coat and roam the corridors of the data center, let’s dive a little deeper.

There are different types of Doctors and Medicine.  E.g. there are Residents / GP / Specialist. The resident is just starting out and akin to an L1 DBA using DMVs, Sp_who2 etc. to figure what’s going on.  The GP is obviously more senior but still not a specialist  and so he gets to use tools like XE and TRACE etc. to figure out what’s going on but unfortunately he isn’t an expert in all fields and so prone to misdiagnosing the issue. The specialist is the expert but since SQL is very vast you won’t find a lot of people who know performance tuning and Security and infra management and HA and DR and Development and SDLC and AGILE and DEVOPs, and CLOUD and .Net and ….. So just like in real world the GP needs to recommend the right specialist ( heart, brain , kidney) who can then use tools like DMVs , Perfmon, RML, XE, Trace, DBCC commands  etc.

Now that we have a distinction between different types of DBA let me explain give you an example of what a doctor is likely to encounter.

A patient and his concerned relatives walk into a hospital and are directed to a resident, they walk up to him and say “doctor he was fine till yesterday evening and nothing has changed since then, what’s wrong with him now?” How many DBA have heard this before “It was all working fine yesterday and we haven’t made any changes so why is the query slow?”

Here the L1 DBA needs to diagnose and suggest possible solutions however he has limited tools and understanding. As a result he is either not able to find anything wrong or misunderstands the symptoms and suggests a quick fix (aspirin) instead of fixing the actual root cause.

So what can the L1 DBA do to improve his success rate?

Ask the right questions – A doctor always starts his investigation asking a standard set of questions that eliminate 70% ( just my number, with the right question you could get 90%) of all issues right off the bat and help narrow down the problem. If you don’t ask the right questions or if the relatives are afraid of the cost (code rework) and hide info then you’re likely to go down the wrong path and implement a temporary fix.

Know your tools properly – L1 DBA only have to work with a very small subset of the tools available and there is no reason why it can’t be mastered in a few months. Sure you are going to use DMVs but not all of them so why not understand what each column means in the most common DMVs you use. An L1 DBA should know what exec_query_stats, Missing indexes and a basic understanding of execution plans are so that he can identify tempdb spills, inefficient joins, out of date stats etc.

Set expectations – Concerned relatives need to know that just because they didn’t make a change doesn’t mean that a change hasn’t occurred (for all you know the server was having an affair with an Ad hoc SQL he met a few weeks back). For servers the volume of data being inserted is growing and just like people servers age, if the hardware and software are not upgraded then naturally performance will decrease over time and additional changes like indexes, partitioning, rewrite etc. will be required. If you want performance like the first day you got the server then don’t do any deployments or add new data to the server :-p

Monitor and learn from experienceJust like in real life a major health issue shows a lot of warning signs before they finally collapse the patient. Heart patient would probably be overweight, fatigued easily or complain of minor chest pains etc. It’s not OK to say “he’s always been like that, well he is getting old or it must be gas” Relatives and family physicians have been monitoring the server long enough to understand that things are not good and should implement proactive steps early on. DBA who encounter the same issues over and over again need to ask what can be done to improve the server and if nothing can be done then call it out so that everybody understands.

You probably noticed that I have mixed my doctors and DBA analogy quite a few times above but it hasn’t really affected the meaning or context has it? For now I am stopping at this point but as a thought experiment, how would you troubleshoot a connectivity issue?

Assuming TCP/IP = eyes, Shared memory = Touch, Names Pipes = ears and VIA=??? J

If the connection are not accepted it could be because one of the senses are not working or because the server is not listening because they have earphones (firewalls) blocking the path, or your speaking a languages they don’t understand or they are ignoring you because their mom (DBA) told them not to listen ( permissions) to anything you said :-p

PS: – I hope you had as much fun reading this as I had writing it? Stayed tuned for more …

Please Consider Subscribing

Leave a Reply