What not to do – avoid reading entire message’s to memory unnecessarily
This one is fairly common, I suspect, and I can certainly see why – the temptation is simply to big – but too many pipeline components start by reading the message into memory, when, with a little bit more effort this could have been avoided.
One pipeline component I’ve seen, for example, receives a flat file, and needs to remove records already processed (duplication elimination) – quite a good thing to do in a pipeline, and I also liked the approach of doing so before the disassembler, to make the xml produced smaller.
V1 of the component used a memory stream – the incoming stream was read line by line, each line was assessed, and – if was not a duplicate – it would get written to the memory stream.
When the component had finished going through the entire incoming stream, the memory stream would be assigned to the message, replacing the original stream, and the message would get returned to the pipeline.
There are two downsides to this approach – the first is memory consumption – the component will always consume at least as much memory as the size of the (outgoing) message; done properly BizTalk would then clean this memory, but only after completing the processing of the message; the second downside is potentially unnecessary delay in further processing of the message – one of the huge benefits of the pipeline, in my view, is its streaming fashion, where subsequent components, if developed in the correct manner, can start working on parts of the message before preceding components completed their processing; basically each component passes back to the pipeline the portion of the message it already processed, whilst working on the next portion.
It appears that the customer in question encounter memory issues as the component’s code was changed to use virtual stream instead of memory stream; a virtual stream is effectively a stream that uses disk for storage instead of memory.
This solves the memory consumption issue, but merely replaces it with IO operations which may have an even bigger impact on the server’s overall performance (and does not address the processing delay point at all).
What would have been the correct way to implement this in my view?
The component should have create a custom stream, wrapping the original stream from the message; It would then replace the message’s stream with the custom stream immediately returning the message back to the pipeline. Note that so far the component hadn’t touched the message stream – zero bytes have been read.
As BizTalk (and not the component!) would read the message (for instance when persisting it to the message box), the custom stream’s read function would be called which would contain that reads the underlying stream (the original stream received by the component), probably buffering reads until the end of a line for simplicity (although in many cases this is not necessary) and assessing whether the record is a duplicate or not; if it is a duplicate the function will simply read the next line and so on until a non-duplicate record is found, at which point the line would be returned as the output byte stream from the read function.
This effectively means that the next component, or the message box, will receive the message line by line, duplicate records removed, without having to wait for the component to process the entire message, and with only a maximum of one line ever loaded into memory.