Yossi Dahan [BizTalk]

Google
 

Thursday, April 24, 2008

Throttling in full action

Here's another one from the archives (=the list of things I have waiting to be blogged)

At some point we had a sudden peak in system load on our BizTalk processes and, as a result, our BizTalk solution that was running so nicely seem to have gotten "stuck".

In "stuck" I mean - we ended up with lots of processes in "Active" state, but they did not seem to be active at all; a closer inspection (of trace that should have been emitted) showed that although the instances status says "Active" they were all very passive indeed - nothing was executing on the server - close to 0% CPU and no trace whatsoever.

This is where you might expect me to describe the long hours we've spent investigating the issue, the sleepless nights and empty cartons of pizza... - but really what happened is that, not being able to afford any more down time, we called out premier support which turned out to be a great thing because the first thing they did (well, not literally, but anyway) was to ask us to check the state of the server using the MsgBoxViewer which in turn pointed out that we have simply "max-ed out" our memory consumption throttling level.

You see - we use a lot of caching of data in our processes; mostly because we access a lot of reference data frequently - data that does not change very often; this is by design. what we forgot to do is estimate the amount of memory this caching will require when many different clients use the system and adjust the throttling level accordingly.

As you can see from the image below - out of the box the BizTalk hosts are configured to throttle at 25% of the server's physical memory. the idea is to prevent the BizTalk processes from taking up too much memory and killing the server, and the assumption is that if throttling kicks in, and stops processing instances, memory consumption will slowly reduce until the server gets back to a more healthy state. however - from it's very nature - caching does not really release memory that often and so instances have stopped progressing but no memory was released as a result and so we got "stuck".

clip_image004

In our case, the solution was straight forward - as we know our memory consumption will be high, and we know there's nothing else running on the server to compete with that memory consumption (more or less) we could increase the threshold to 50%, which is enough to grant BizTalk Server enough memory for the caching and all the processing requirements.

In the process we monitored the situation by investigating two BizTalk performance counters - "Process memory usage threshold" (here shows as 500MB) compared to "process memory usage" (here showing around 130MB).

clip_image006

As long as there was large enough gap between the two we knew our processes are going to be just fine; it is always important, of course, to monitor these over time to ensure there's no memory leak in the processes, which we have done, on top of peak load tests - which we have not.

Now, while all of this is down to a test or two we may have neglected on our side, there are a couple of interesting points at the back of this from a product perspective -

  1. We were confused by what we saw mostly because of the "active" state of all instances (and we had quite a few); we would have diagnosed the problem much quicker, and on our own, had the admin console indicated that the server is not actually processing anything due to it's throttling state.

  2. I can't help but wondering whether the throttling mechanism couldn't be a bit more clever and identify it has reached a dead end and is not actually helping in improving the situation. following on our case the engine realised memory usage has gone too high and has stopped processing instances. wouldn't it be great if after, say, 10 minutes it realised that memory is not actually reducing and so it will never exit the throttling state and would write something to the event log?

Again - not trying to make any excuses, just thoughts with the power of hind sight...

Labels: , ,

5 Comments:

  • Hi Yossi,

    Its Mike..

    Nice article, we went through the exact same type of problem quite a while ago, but fortunately out issue came to light during testing.

    I think the one key area in your article that you miss is that this scenario is just one of the many examples why you should always have MOM or SCOM with any serious BizTalk implementation.

    We used MOM in our testing environments and it can alert you when the process memory is getting high, and also when any kind of throttling begins. I think MOM should be used during testing as it often highlights a lot of issues that a development team might otherwise not be aware of because until a defect is raised from a test team you usually dont bother looking at the environment too much (except in performance environments hopefully)

    HP Open View also has a BizTalk pack which offers the same kind of monitoring however OpenView is more limited out of the box so you will need to create a few custom alerts.

    Hope this helps
    Mike

    By Anonymous Mike Stephenson, at 25/04/2008, 08:46  

  • Hey Mike, thanks for the comment - you are absolutely correct obviously! and indeed - are preaching to the preacher. unfortunatley I don't control everything (but I try!).

    I am a strong advocate for having MOM, SCOM or anything else early on and integrating it into the development process so that when the application goes out it is already fully supported. unfortunately as we all know - these sort of things tend to be neglected in projects.

    By Blogger Yossi Dahan, at 25/04/2008, 10:28  

  • Dear All,

    we have a BizTalk Cluster environment on Windows 2k3 32 bit. We have configured different hosts for our BizTalk deployed applications.

    our Many applications are using Codeplex TCP Adapter for communicating on TCP Ports. And one of our application is getting a load of around 100-150 Messages per Second.

    The Problem we are getting is that the Host Instance Process Memory is continuously getting increased and after the Process Memory threshold, the process memory get throttle and the host instance get stuck and did not process anything and all the orchestrations and messages stuck in active state.

    To overcome the problem for the time being, i disabled the process memory based throttling but by doing this the memory continues to increase and after reaching to approx 1400 MB, out of Memory exceptions start to come.



    Does any body have any idea how to resolve it.


    Thanks.

    Syed Hasan Zubair

    By Blogger hasan137, at 05/03/2009, 10:27  

  • Hi Syed,

    Would be best to post the question (if you haven't already) in the newsgroups - there are many more people hanging out over there than there are reading this blog post (sadly!), and it's easier to get a bit more "interactive"; I would also try to pick this up there.

    By Blogger Yossi Dahan, at 07/03/2009, 21:08  

  • We had very similar in our production environment, and you are right, BizTalk eventually shoots itself in the foot and gets stuck. Throttling impacts synchronous traffic (e.g. WCF receive-send ports) badly.

    FWIW our 'recipe' for unjamming BizTalk when it gets stuck in Throttling state 4.
    1) Suspend all the active orchestrations. Depending on how 'far gone' BizTalk is over the memory threshold, this might have no effect other than marking the orchs pending suspension.
    2) Stop the (process) hosts and restart them. The orchs should now be suspended and your memory released.
    3) Resume the orchs a few at a time until the backlog is cleared.

    Note that if you have a cluster and only one or two of the hosts are jammed, that the orch suspension step is vital otherwise the memory load will just be transferred to the working hosts and jam them as well.

    Prevention is better than cure, and all of the Dev best practices that we violated came home to roost:
    => Check for leaks, e.g. undisposed XLangMessages in custom assemblies
    => Using XmlDocuments in orchs is a no-no. Use readers.
    => Scope your message variables and any other variables to a local scope with as short a lifespan as possible.

    On 64 bit servers, the 25% CPU default does seem to get hit quite often under load, especially given the vaguaries of garbage collection. We've increased to 50% and haven't seen any issues since.
    http://msdn.microsoft.com/en-us/library/ee308808(v=bts.10).aspx

    By Anonymous nonnb, at 16/06/2011, 09:39  

Post a Comment

<< Home