Throttling in full action
Here's another one from the archives (=the list of things I have waiting to be blogged)
At some point we had a sudden peak in system load on our BizTalk processes and, as a result, our BizTalk solution that was running so nicely seem to have gotten "stuck".
In "stuck" I mean - we ended up with lots of processes in "Active" state, but they did not seem to be active at all; a closer inspection (of trace that should have been emitted) showed that although the instances status says "Active" they were all very passive indeed - nothing was executing on the server - close to 0% CPU and no trace whatsoever.
This is where you might expect me to describe the long hours we've spent investigating the issue, the sleepless nights and empty cartons of pizza... - but really what happened is that, not being able to afford any more down time, we called out premier support which turned out to be a great thing because the first thing they did (well, not literally, but anyway) was to ask us to check the state of the server using the MsgBoxViewer which in turn pointed out that we have simply "max-ed out" our memory consumption throttling level.
You see - we use a lot of caching of data in our processes; mostly because we access a lot of reference data frequently - data that does not change very often; this is by design. what we forgot to do is estimate the amount of memory this caching will require when many different clients use the system and adjust the throttling level accordingly.
As you can see from the image below - out of the box the BizTalk hosts are configured to throttle at 25% of the server's physical memory. the idea is to prevent the BizTalk processes from taking up too much memory and killing the server, and the assumption is that if throttling kicks in, and stops processing instances, memory consumption will slowly reduce until the server gets back to a more healthy state. however - from it's very nature - caching does not really release memory that often and so instances have stopped progressing but no memory was released as a result and so we got "stuck".
In our case, the solution was straight forward - as we know our memory consumption will be high, and we know there's nothing else running on the server to compete with that memory consumption (more or less) we could increase the threshold to 50%, which is enough to grant BizTalk Server enough memory for the caching and all the processing requirements.
In the process we monitored the situation by investigating two BizTalk performance counters - "Process memory usage threshold" (here shows as 500MB) compared to "process memory usage" (here showing around 130MB).
As long as there was large enough gap between the two we knew our processes are going to be just fine; it is always important, of course, to monitor these over time to ensure there's no memory leak in the processes, which we have done, on top of peak load tests - which we have not.
Now, while all of this is down to a test or two we may have neglected on our side, there are a couple of interesting points at the back of this from a product perspective -
- We were confused by what we saw mostly because of the "active" state of all instances (and we had quite a few); we would have diagnosed the problem much quicker, and on our own, had the admin console indicated that the server is not actually processing anything due to it's throttling state.
- I can't help but wondering whether the throttling mechanism couldn't be a bit more clever and identify it has reached a dead end and is not actually helping in improving the situation. following on our case the engine realised memory usage has gone too high and has stopped processing instances. wouldn't it be great if after, say, 10 minutes it realised that memory is not actually reducing and so it will never exit the throttling state and would write something to the event log?
Again - not trying to make any excuses, just thoughts with the power of hind sight...