Yossi Dahan [BizTalk]

Google
 

Wednesday, November 24, 2010

The tale of the mysterious BAM records

We’ve been using BAM to log the activity of our solution which is running in production under a reasonable load.
Most of the data comes from receive and send pipelines but some is written from orchestrations or even custom code as well, all of it is stored using the BAM API, mostly using the buffered event stream, but occasionally using the direct event stream and orchestration event stream.

Generally speaking, this had been working very well, and the log kept proved invaluable.

A few of months ago, though, we started seeing incidents where BizTalk processing would simply grid to a halt.
The first symptom was fairly easy to spot – we’ve seen the Tracking Data performance counter climb steadily, and CPU on the SQL box supporting BizTalk would be in the high nineties constantly.

Data was being transferred from the message box TrackingData tables to the BAMPrimaryImport, so it wasn’t the previously reported sequencing error, but it was simply not going fast enough – we were suddently writing more data than BizTalk was able to process.

Initially we thought we may have reach the breaking point for our solution on the existing infrastructure, but quickly we realised this doesn’t make sense – yes – we’re running some load – but not enough to warrant this.

A closer look revealed that the active tables in the BAMPrimaryImport database were very large and so, needing to avert the impeding production issue the team had decided to truncate all the active tables, knowing that we would lose some tracking data but, hopefully, keep the system running and buy us some time.

Now – generally speaking this is not the best idea – you mustn’t touch these tables manually, but given the alternative – a non-functioning system, we’ve decided to take the risk, and it kind of worked – once the tables were truncated, BizTalk happily cleared the TrackingData tables in the message store.

So – we know we have a problem – but what caused it? well – me – for the most part!

We knew the active tables kept growing, and a closer look revealed we were getting a lot of ‘orphaned’ records – these are records with very partial information, with many nulls, and in which – crucially – the IsVisible column is null.

A quick search revealed that the IsVisible should really never be null, as these records will never move from the active table to the completed table, and that this happens when the record is being updated AFTER it has been completed and moved away from the active table. I must have done something wrong then!

Turns out we had at least a couple of issues -

The first was a simple misunderstanding on my part -

We have 4 activities, which are hierarchical; we use the relationships to define the connection between the activities.
In one place, having already created an instance of ‘Activity A’ we went on to create a new instance of ‘Activity B’, end it, and then add a reference from the ‘Activity B’ instance to the ‘Activity A’ instance.

Turns out if you add a reference FROM an activity instance that had already been completed, BAM will ‘kindly’ create a new record in the active table (it doesn’t matter what you’re referring to, and if it’s an activity – if it’s closed or open. only the source is important).

Now – learning that, it does makes sense – but my defence would be that it seems quite logical to me to create entity A, create entity B, and then define the relationship between them, rather than having to do that half-way through the creation of entity B.
In any case – this was easy enough to fix, even if it took a bit more effort that it would appear given the broader story of our tracking, but, as these things go – once you know the answer, coding is never the difficult part.

The second issue was more clearly a bug than a ‘misunderstanding’ – it would not surprise you if I said all of my pipeline code is streaming ,and that works fine without too many issues.
In one of our scenarios we create an activity in the send pipeline of a solicit-response port, and close it in the receive pipeline.
Recently I added some asynchronous logic in there that, in some circumstances, will update an activity shortly after the message had been delivered.

Unfortunately, under load, and due to the asynchronous nature of BAM, this update may be delayed.
If this happens, and the system we’re calling happened to respond quickly, it may happen than my update will occur after the activity was closed by the receive pipeline and, in that case, we end up with this ‘null record’ again.

This latter problem could potentially be solved using continuation, ensuring that the completed record does not get moved from the active table until any related records are completed as well, but this does not fit our scenario very well.

In any case, resolving these two issues saw the number of bad records in the active table drop by about 96%, we still get a few a day, which I still need to iron out, but the rate is very slow now, and we know how to surgically remove them until we sort this out, and so our environment is much healthier now.

The TrackingData counter that would often show 9,000 records before, now rarely climbs above 200.

To summarise I would make the following points -

  • Obviously there’s a lesson I’ve learnt about how BAM references work, and my assumption that it’s ok to create a reference from a closed activity was wrong.
  • The other reminder is to be very careful with asynchronous coding, something I’m sure we’ll see more and more in the future. it is very powerful, but can get hairy.
  • There’s also a lesson about checking carefully the contents of the BAM active tables, not only completed table, after any BAM related change, as well as a reminder how important it is to fully understand any mechanism you use!
  • Last – it seems to me that, whilst I put hand up and admit I was wrong – it’s not a good story if such a simple mistake done by a developer can bring down the entire server. surely BAM should be more defensive and either error or deal with someone updating a records that has been closed, but writing records that will never get cleaned up cannot be a good store, don’t you think?

Labels: ,

2 Comments:

  • So Yossi,
    Did you ever iron out the details? And based on your experience would you continue to use BAM as a custom log store?

    By Blogger Anthony, at 25/10/2011, 15:56  

  • Hi Anthony

    I believe that in the end all the remaining 'rouge' records crated where flavours of the above that I've missed in my initial code scan. I don't believe there was anything else, and to your question - yes - I would definitely do it again, it does work very well, and now that we have these issues ironed out there are no further issues.

    By Blogger Yossi Dahan, at 26/10/2011, 08:22  

Post a Comment

<< Home