Lessons Learnt – the story of a recent production issue
Yesterday we’ve had a major production issue, which resulted in our system being pretty much unusable.
Below is much of the story, it is quite long, and that with me skipping a lot of the detail I felt was unnecessary to understand the main points, I wanted to highlight the route we’ve taken to resolution, and the key facts we’ve looked at.
You can skip to the summary section in the end, if you only want to read about the bottom line, or you can follow it more or less as we did, either way I would start by listing the initial symptoms we’ve encountered -
The trigger to this was a report from our support team that the environment is very slow and that every request times out.
A quick investigation revealed several facts -
- Actual load on BizTalk server was not that high, certainly server activity was similar to what we’ve seen in past months, with no significant increase or spikes.
- BizTalk ‘High Database Size’ performance counter was set to 1, as in ‘not good’.
- BizTalk was throttling it’s publishing rate due to database size (value of 6).
- Due to this throttling a delay of over 30 seconds on average was imposed on most published messages.
- CPU on the SQL box serving BizTalk (dedicated to the BizTalk databases) was averaging 90% (normally it is around 30%).
Upon looking into this further we noticed a few further facts -
- Our tracking database (BizTalkDtaDb) was very large at 27GB.
- Our message box database was equally large at just under 20GB
- Our BAMPrimaryImport database was also very large at around 120GB
The state of our databases
We knew about the BAMPrimaryImport being quite large and were actually taking steps to counter that over the past few weeks, I’ll spare you the details around that, but were about to reduce the size of that to 1/7th of the current size in the coming weeks; I will mention, though, that whilst looking into the size of this database and how much of it we were going to get rid of we did notice that many records were ‘phantom records’ – either records that remained active for months (doesn’t make sense in our environment) and/or records that were missing a referenced record, for example an active relationship record without an active instance. this is something we’ll have to monitor over the next couple of weeks to see if it’s a historical issue or a current one; we might look at having a process in place to check for that on a weekly basis as part of out environment health checks.
We also knew that the Dta database is large, this represents about 14 days of tracked activity for our system, which is quite a lot, but we did not think this was an issue other than the usual struggle to get enough data-centre disk space allocated… We’re not backing up the data, and we do have the DTA purge and archive job running to ensure we don’t keep more than 14 days of data.
In contrast, the size of the MessageBox size definitely surprised us, it’s size was not known to be big, and our data suggests it was never even remotely as big. It has grown rapidly over the last few days, maybe no more than 1 or 2; in addition - whilst our processes are quite complex, and we get decent activity volumes, they are not long running and the volumes we are getting do not explain such a high message box database size.
Looking deeper we realised the size got bloated by records in the TrackingData tables; these got larger and larger, at a rate of about 3000 records a minute when we looked.
The TrackingData tables
The TrackingData tables, of which there are two sets - TrackingData_0_n and TrackingData_1_n (where n is a running partition number from 0 to 3) – contain data intended to be moved to either the Dta database or the BAMPrimaryImport database by the TDDS service. The first set holds data destined to the BPI db and the second holds data destined to the Dta db (HAT).
The TDDS service is hosted in any instance of any BizTalk host marked to ‘allow host tracking’, and is the process responsible moving that data.
Generally speaking these tables should not have lots of data in them, and if you monitory the size of that data (something you can do using the “Tracking Data Size” counter in BizTalk’s “General Host Counters” category) you would see it fluctuates quite a lot – data gets added rapidly to the tables and is generally removed just as quickly; if you see data being built up in these tables it suggest one of two things – you either don’t have any instance of the tracking host running or you have reached the capacity of your environment (where data being written faster than it can be moved).
In our case we have indeed had one tracking host running, and that host was pretty much dedicated to tracking, and had nothing else running on it, which is the best practice, so that wasn’t the issue.
Equally, we could not believe we have hit the capacity, firstly because our volumes, whilst are significant, are not extremely high, and certainly ones we’d expect BizTalk to cope with, but mostly because the system has been running on these volumes for a while now, and we have not seen any increase in the number of transactions or the number of users nor have we seen any specific spike in activity over the last day or two.
And so – whilst we couldn’t explain why data in TrackingData tables is building up, we did assume that this is what causes the slowdown in BizTalk, and understood we have to clear this data.
Initially, wanting to avoid loss of tracking data, we decided to take the system down (it wasn’t responsive anyway) and left just the tracking host running; this slowly cleared the data in the tables (did take over an hour), and then when we turned the hosts back on the system was responsive again.
Looking for recovery
However – as was expected – that just left us on ‘borrowed time’ – the data was again building up in these tables and we knew it will not be long before it fills up again; in the current rate we had only a few hours before the system will be down again.
At that point we enlisted the help of Microsoft support to help us understand why data isn’t flowing out of the message box, but in the meantime we took two actions –
- We have started to look into the size of the BAMPrimaryImport database, and started to remove some of the older ‘phantom records’ initially we removed some 25GB from it in just one pass (further cleanup would see an additional 25GB worth of records being removed)
- We looked to reduce the size of data being written to the TrackinngData tables and the easiest thing to turn off is Dta tracking, as there’s a global tracking switch (turning off our BAM tracking would require us to replace pipelines used in tens if not hundreds of ports, so we preferred not to go there initially)
Turning off Global Tracking
Most BizTalkers would know that to turn off global tracking one needs to go to the ADM_Group table in the BizTalk management database and set the Global Tracking field to 0 (default is 1).
What we learnt from Mandi Ohlinger in MS support is that there’s actually another step, and that using the table column alone is not guaranteed to stop all Dta tracking; in addition to that, one has to set the GlobalTrackingOption field of the MSBTS_GropuSetting BizTalk WMI class to 0, to do that follow these steps -
a) Go to Start, Run and type wbemtest. Click OK.
b) Click Connect.
c) In the Namespace box, type root\MicrosoftBizTalkServer, and then click Connect.
d) Click Open Class, type MSBTS_GroupSetting and then click OK.
e) In the Properties list, click GlobalTrackingOption. Click Edit Property.
f) To disable the Global Tracking option for the BizTalk Group, change the Value from 1 (0x1) to 0 (0x0), and then click Save Property. To enable, change the Value from 0 (0x0) to 1 (0x1), and then click Save Property.
g) Click Save Object.
h) Click Close and then click Exit.
Changing either setting (database or WMI) requires restarting the tracking host.
Stopping writing to BAM and clearing more data
Once we’ve done that we saw that the size of the tracking data was increasing at a lower rate, but it was increasing non-the-less so eventually we started the activity of turning off most of our BAM writing as well; both tasks were done as a troubleshooting step (we certainly need our BAM data), but we wanted to see if the problem is that of capacity as in – do we write more data than the TDDS service is able to move - or whether the TDDS service had stopped working.
Reducing the size of the databases
At this point we used the terminator tool to clear the tracking data tables as well as several other things highlighted by the message viewer report such as orphaned tracking items, extra cache instances and other such ‘goodies’.
With that done, the TrackingData tables were empty, and with only about 5% of the data previously being written to them now coming in the system was functioning again, we bought ourselves a decent time window to figure out what was going wrong with the data flow.
Speaking with Mandi she explained that with the destination databases being the sizes that they are the TDDS service simply struggles to update them, which starts the chain reaction we’ve been seeing and resulting with a system in a standstill, and so we’ve looked for further ‘phantom’ records in BAM we were able to get rid of as well as extracting data elsewhere, eventually reducing the size of the BAM database to a third of it’s original size.
Then our attention moved to the Dta database, we could have used the dtasp_CleanHMData to clear all the tracking data, but Mandi explained that it is generally two tables that cause the chain effect on BizTalk – the dta_DebugTrace table – storing the start and stop times of all the orchestration shapes used for the replay mode display in HAT and dta_MessageInOutEvents storing the data used by the message flow view. She actually mentioned it has been observed that when the DebugTrace table grows beyond 1m records significant slowdowns occur on the BizTalk server; we had 55m.
And so we’ve truncated the data in these tables and the Dta database size was reduced significantly.
Why tracking data keeps accumulating?
Now we were feeling confident that the system is functioning, and will continue to do so for a long time, but of course we still had the issue of having to turn all of our tracking off, and that’s what we looked at next.
As some BAM writing still occurred (we could not turn it all off easily) we saw that whilst the inflow of data was very low, data was clearly not going out of the TrackingData tables. even if we only saw an increase of 40-50 records per minute (compared to 3000 previously), the numbers still kept growing. the TDDS service was definitely not doing it’s job.
At this point we tried several things - we’ve turned global tracking back on, saw the inflow of data increasing, and noticed data is actually moving into the Dta database (confirmed by the sequence numbers increasing in the TDDS_StreamStatus table in the Dta database), so it was clear that the TDDS service was happy moving Dta data, but not BAM data.
To look further into that we turned to examine the sequence numbers in the TrackingData tables compared to those in the StreamStatus table in the BPI database -
Every record that is written into the TrackingData tables is assigned a sequence number. this number increases with the number of records going into the table and, presumably, recycles at some point.
The TDDS service keeps track of what it had already moved to the Dta and BPI databases by keeping the last sequence number moved in a table called TDDS_StreamStatus in each database.
Whislt the TrackingData tables contain sequence numbers that are higher than the ones in the TDDS_StreamStatus table the TDDS service will move data to the relevant database, if they are the same the TDDS will assume it is up to date and will do nothing until a new record arrives with a higher sequence number.
One of the theories raised was that somehow the sequence numbers in the BPI database were much higher than the numbers in the TrackingData and so the TDDS service is ‘frozen’ until the numbers in the TrackingData tables ‘catch up’, but when we checked it the situation was the reverse – the sequence numbers in the TrackingData tables were about 600,000 higher than the ones in the BPI database, and so they should have all been moved.
When discussed how come we reached such a huge gap we agreed that the most likely reason was the use of the terminator, but that, on its own, this shouldn’t pose a problem; the TDDS service would simply have a long ‘todo list’ to go through, but with the destination databases now being of a reasonable size this shouldn’t be a problem; but still - the TrackingData kept growing and no data was moved to the BAMPrimaryImport.
Eventually, after looking at the source code in conjunction with traces from our servers, the clever guys at Microsoft figured it out – turns out that there’s actually a known bug in BizTalk 2006 R2 (kb969870), which refers to exactly what we were seeing –tracking data is not being moved, with no warnings or errors.
My understanding of the nature of the bug is that whilst BizTalk is happy with a gap in the sequence numbers between the BPI and the MsgBox databases, if it’s in the ‘right’ direction, if that gap is too big, BizTalk spends a lot of time and effort attempting to query the TrackingData tables for data with sequence numbers that do not exist.
For illustration – let’s say that BAM has last sequence number of 1 and the TrackingData tables hold sequence numbers 100 and up, before processing sequence number 100, the TDDS service will ask for data for sequence numbers 2-99; with the gap being several hundred thousands, and the size of the table being very big, this takes ages, and in the meantime, the size of the table only keeps increasing….you can see where this is going, right!?
This bug was previously identified and fixed for BizTalk 2006 R2, but a fix was never issued for BizTalk 2006, which is what we’re running, presumably as MS never saw the issue on this platform (it probably doesn’t hit people that often), and so I believe they will now back port the fix and it should be available in the near future for BizTalk 2006 users, and so – in our case – the gap between the sequence numbers, albeit in the ‘right direction’ was too big for the TDDS to handle and it froze.
What have we concluded so far, then?
- The problem started with us letting our tracking databases (BAM and Dta) get too big, eventually they got so big that BizTalk struggled to move data into them.
- This, in turn, caused a build up of data in the TrackingData tables, which eventually resulted in a major slowdown of the hosts themselves (as each operation needed to wait for the tracking data record to be inserted into these tables before completing).
- In the process of recovering from this (by reducing the sizes of the tracking databases) we have hit a bug in BizTalk 2006 whereby it cannot handle a gap over a certain size between the sequence numbers in the BAMPrimaryImport database and the records in TrackingData tables in the MessageBox database, had it not for this bug, once the tracking databases size was reduced, BizTalk would have recovered nicely.
The road to recovery
As is often the case, once the problem is known the fix is relatively simple – to get the TDDS service running again we have simply synchronised the sequence numbers – we took the lowest sequence number from each partition in the relevant TrackingData table and copied it into the respective field in the BPI StreamStatus table, once that was done the TDDS service (after being restarted) was again moving data between the databases.
We are keeping global tracking off until we sort out our configuration to ensure we don’t track as much data as we used to (but enough to serve our support needs) and will shortly be reducing the amount of data kept in the BPI database too.
Once available we will also, of course, install the hotfix to the bug encountered to help us recover from similar situation should it happen again.
During the day we’ve used the following tools -
- MessageBoxViewer – an awesome tool which collects and reports on lots of data regarding the BizTalk environment. first point of call and well worth running periodically on each environment.
- Terminator – not for the faint hearted, and does require some care an attention, but a great tool to query some aspects the state of BizTalk and to perform some cleanup work that would simply be too difficult without it.
- Performance Counters – these have greatly helped us get the focus on what was going on, without understanding these it would have taken us at least twice as long to get a grip of what’s actually going on and where the issues lie.
- PSSDiag – once Microsoft support are involved PSSDiag is the tool at hand to extract the very useful BizTalk trace they can analyse. eventually it was this that highlighted the exact bug that affected us on the second round.
- wbemtest – we’ve used this to turn global tracking on and off (on top of the database setting)
Labels: biztalk troubleshooting