Yossi Dahan [BizTalk]


Thursday, April 24, 2008

Throttling in full action

Here's another one from the archives (=the list of things I have waiting to be blogged)

At some point we had a sudden peak in system load on our BizTalk processes and, as a result, our BizTalk solution that was running so nicely seem to have gotten "stuck".

In "stuck" I mean - we ended up with lots of processes in "Active" state, but they did not seem to be active at all; a closer inspection (of trace that should have been emitted) showed that although the instances status says "Active" they were all very passive indeed - nothing was executing on the server - close to 0% CPU and no trace whatsoever.

This is where you might expect me to describe the long hours we've spent investigating the issue, the sleepless nights and empty cartons of pizza... - but really what happened is that, not being able to afford any more down time, we called out premier support which turned out to be a great thing because the first thing they did (well, not literally, but anyway) was to ask us to check the state of the server using the MsgBoxViewer which in turn pointed out that we have simply "max-ed out" our memory consumption throttling level.

You see - we use a lot of caching of data in our processes; mostly because we access a lot of reference data frequently - data that does not change very often; this is by design. what we forgot to do is estimate the amount of memory this caching will require when many different clients use the system and adjust the throttling level accordingly.

As you can see from the image below - out of the box the BizTalk hosts are configured to throttle at 25% of the server's physical memory. the idea is to prevent the BizTalk processes from taking up too much memory and killing the server, and the assumption is that if throttling kicks in, and stops processing instances, memory consumption will slowly reduce until the server gets back to a more healthy state. however - from it's very nature - caching does not really release memory that often and so instances have stopped progressing but no memory was released as a result and so we got "stuck".


In our case, the solution was straight forward - as we know our memory consumption will be high, and we know there's nothing else running on the server to compete with that memory consumption (more or less) we could increase the threshold to 50%, which is enough to grant BizTalk Server enough memory for the caching and all the processing requirements.

In the process we monitored the situation by investigating two BizTalk performance counters - "Process memory usage threshold" (here shows as 500MB) compared to "process memory usage" (here showing around 130MB).


As long as there was large enough gap between the two we knew our processes are going to be just fine; it is always important, of course, to monitor these over time to ensure there's no memory leak in the processes, which we have done, on top of peak load tests - which we have not.

Now, while all of this is down to a test or two we may have neglected on our side, there are a couple of interesting points at the back of this from a product perspective -

  1. We were confused by what we saw mostly because of the "active" state of all instances (and we had quite a few); we would have diagnosed the problem much quicker, and on our own, had the admin console indicated that the server is not actually processing anything due to it's throttling state.

  2. I can't help but wondering whether the throttling mechanism couldn't be a bit more clever and identify it has reached a dead end and is not actually helping in improving the situation. following on our case the engine realised memory usage has gone too high and has stopped processing instances. wouldn't it be great if after, say, 10 minutes it realised that memory is not actually reducing and so it will never exit the throttling state and would write something to the event log?

Again - not trying to make any excuses, just thoughts with the power of hind sight...

Labels: , ,

Wednesday, April 23, 2008

Just in case there's someone out there who didn't hear yet -

Microsoft have just publicly announced BizTalk Server 2006 R3 on Steve Martin's blog here.

Not much more to say on top of that I guess...that's the whole story.


Monday, April 21, 2008

GAC assembly as a post build event

With BizTalk server every DLL we use has to be in the GAC.

Much too often, after making a change to such DLL, I forget to GAC it in time before running a test which would, naturally, result in the test failing; in most cases the error is obvious and I simply have to return to the infamous build-gac-restart host cycle before running my test, but every now and then I get thrown by the error and do not realise it is a simple case of me forgetting to GAC an assembly.

To avoid those annoying moments I have developed a habit of adding the command to GAC an assembly to the project's post-build event.

The command required looks like this -

"$(DevEnvDir)..\..\SDK\v2.0\Bin\GacUtil.exe" /i "$(TargetPath)" /f

and it can be used to make sure that on a successful build the assembly generated will be added to the GAC.

Of course this has downsides - it is quite possible that you do not want to GAC on every build; further more - the post-build event is part of the project properties and as such goes into whatever source control you're using - now you have to worry about all those other developers who might get that code and build it - do they want it in the GAC?

Having said all that I find that - for me - having the post build event is usually better than not having it.


RSS feed now delivers all post contents

Ewan Fairweather, a good friend, had pointed out that my RSS feed only published the first paragraph of any posting and that it puts him off reading my blog (as if any reason is required...).

Anyway - I completely forgot I left it at this state last time I played around with the blog's settings, and so I apologise to anyone this has caused any inconvenience; I believe this has now changed and so all the posts' contents is available directly from the RSS feed.

Wednesday, April 09, 2008

Have you spotted what the terminate shape does?

The list of shapes in the BizTalk toolbox is not a very long one and so - 4 years after BizTalk 2004 was released I find it strange to discuss the behaviour of a simple shape like the terminate shape; the thing is - that I never quite paid attention to how useful it can be, and so I figured others may have missed it as well.

Now - it is not like I'm talking about anything revolutionary here, just a small oversight on my part.

In essence the terminate shape is as straight forward as anything can be really - you put it in your workflow and behold - your process, upon reaching this shape, would terminate!

The shape takes one "parameter" - a string (or anything that returns a string) which would be the "reason" for termination.

I've known for a while that this string would appear in the HAT query results list as the reason for the termination, and so it is quite useful from operational perspective (why did that process stop again?)

However, I always claimed that terminate is really intended to be used for error scenarios, and not to situations where the process simply reached a point where it should stop processing; I have event, in a previous post, suggested introducing a new "end" shape; in that post I have mentioned a possible alternative - using the Terminate shape - which was also suggested in a comment by an anonymous, but as I believed that the shape is really meant to be used for error scenarios I felt this was not ideal, but at the time this was more a gut feeling thing then anything else.

Today I noticed for the first time the label used for the termination "reason" - it is called "error info", which to me is the "proof" I needed (at least to satisfy myself) that the terminate shape was indeed intended to be used for error scenarios; same error info appears when you view the message flow in HAT of a process that has been terminated , at the top section with all the general details about the process you will find an 'error info' label; any text provided for the terminate shape will be shown there.

I've spotted this only because, being the hard headed guy that I am, I have a few orchestration that have a terminate shape as the last shape in the process. "what is the point in that??" I hear you ask...well - this is why I looked at my decision again.

Well - there are a few possible scenarios - here's one - we have a process that is exposed as a web service. the process initiates several sub-processes (using a mixture of call and start orchestration) and then returns to the caller with a response.

If we have errors in the process we keep track of them in a helper object and return them as a soap header (with or without a response) so that the client is aware of them.

using the terminate shape at the bottom of the orchestration (if my errors collection is not empty) I can report the errors to HAT and make them visible to our operators as well; instead of the orchestration showing just showing as completed, they get a hint through the status that something did not go smoothly and by inspecting the error info field they can find out what it was.

yes - I know I can use the event log - but this way it gets logged with the process in HAT which we can easily find.

Labels: , ,

Saturday, April 05, 2008

...and then just when you actually needed HAT..

I'm pretty sure I'm not alone suggesting that the HAT tool is somewhat, let's say, lacking....

There's quite a few annoying things about the tool, but there's one thing in particular that has to be at the top of the list, because, in my view, it means that in the one case that you really need HAT to help you out, it fails you miserably.

The "orchestration debugger" is a nice selling point for BizTalk: you develop your process and, assuming you have the relevant tracking settings turned on you can go back to processes already completed and "replay" them to see which shapes have been executed and which haven't.

This is really great when viewing processes already completed, and also somewhat useful when setting  a breakpoint in the process and attaching to the process in HAT (although not as useful as one might think).

However - it is completely useless when dealing with suspended orchestrations.

If your orchestration get's suspended for whatever reason, you get a nice error message in the event log, in most likelihood the event log message will even contain the name of the shape in your orchestration in which the exception occurred; however - find the suspended instance in the admin console or in HAT, open the orchestration debugger - and you're into a surprise: the viewer will only show you execution up to a few shapes BEFORE the actual shape that failed.

I'm not sure I have the story right, but I believe this "bug" (it's really a "side effect", more on this in a second) was introduced in 2006 (but I no longer have 2004 installed to prove it was actually better before hand).

As far as I know, one of the changes in BizTalk 2006 is around the way orchestrations handle exceptions - in BizTalk 2004 all unhandled exceptions in an orchestration would (if my memory serves me right) result in a suspended non-resumable instance; in 2006 these instances are resumable; this suggests that BizTalk 2006 has to keep the state of the orchestration BEFORE the error occurred - so that if an administrator chooses to resume the process (possibly after fixing whatever caused the suspension) the process could start again and retry the action where it got suspended before).

In order to achieve this BizTalk probably keeps the last GOOD state of the orchestration in the database (from the last persistence point executed); in other words - where before a suspension would cause a persistence point, from 2006 it does not and the orchestration's last persistence point is what's kept in the message box.

If that is correct it would explain why the orchestration debugger only shows information up to a point before the shape that caused the suspension - it would only have information up to the last persistence point.

I don't know if this was an oversight when releasing 2006 or a conscious sacrifice, but I think it's a big pain point; it would have been great to see it all - see where the exception occurred, see the state of the orchestration at this point as well as having an indication as to where was the last persistence point - so we could tell what will get executed when we resume the orchestration.

There are quite a few good uses for HAT - it's a great tool to know what's been executed on the server over time; it's not a bad tool to take a look at the duration it took for a service to run, it's even somewhat useful to check the flow of a particular message through the engine using the message flow or the orchestration debugger view - but when it comes to helping out finding out a cause for a suspended orchestration - it is quite pointless for that reason.

So if you counted on it bailing you out when your process fails - you may as well switch off orchestration shape tracking for your processes.

Labels: , , ,