Friday, October 19, 2007

BPEL and compensation

Is BPEL a good tool for implementing compensation? It really depends, and you really have to know what you are doing - which (with all respect) doesn't seem the case for most people (not even BPEL specialists). So if not even those experts know, how can we expect the rest of us to know? Hence this blog entry.

For instance, on repeated occasions I have heard renowned BPEL and workflow experts mention that compensating transactions are "perhaps" best modeled at the business logic level. This, by the way, includes Bill Burke in the case of JBoss/jBPM - see here. Note that I emphasized the word "perhaps": this indicates the shade of misunderstanding usually present in the arguments.

I have been saying this here and there in the past (and in fine detail in this article), but I want to repeat it again: BPEL, nor workflow nor WS-BA are ideal for compensation unless the compensating party doesn't care whether it needs to compensate eventually. In other words, if the compensation is business as usual to the provider of the compensatable service then BPEL might be OK (though certainly not desirable - see below).

Why is that? Put yourself in the place of a service that is asked to compensate by a BPEL engine somewhere. Also suppose that you are in a B2B ecosystem where you don't necessarily trust the party that owns the BPEL engine. Now what would you rather do: trust the BPEL to compensate - eventually (which might be never!) or rather deal with compensation yourself, say after a timeout? I would definitely choose the latter. I don't want someone else to decide when I need to compensate. I want to decide for myself, and the Atomikos TCC model allows for that. BPEL and jBPM don't.

So BPEL is ruled out for me - at least as far as compensation goes. What about WS-BA? It is a step in the right direction, but unfortunately it is a bloated protocol, very inefficient and loaded with application-level messages that pollute the compensating part. Even worse, it also suffers in a large part from the lack of timeout and depends on the BPEL to at least trigger compensation.

Also, WS-BA doesn't allow for application logic on close - I won't go and bother you with the entire spec details but it is like a try..catch...finally where the exception is raised by the client (ugly!) and where the finally block can only be empty! Again, Atomikos TCC is far superior, more efficient and more elegant. It is also more natural for compensation than any BPEL engine will ever be.

One last note on BPEL and this supposed "modeling the compensation in the business process": I was talking to an IBM architect the other day. He said that they were doing a large telco project with BPEL to co-ordinate things. One of the things he complained about was exactly this: they have to model the compensation and error logic as explicit workflow paths, and it was literally overloading everything with complexity. Moreover, this complexity is hard to test. As he correctly put it, they were implementing a transaction manager at the business logic (BPEL) level, over and again in every process model. In addition, this was also hard to test he said and that it was virtually killing the project - especially if there were change requests to consider. I believe him:-) I gave him the URL to our TCC article above.

Atomikos and TCC allow you to focus on the happy path of your workflow models. We take care of the rest. Now imagine what a reduction in complexity that is, and how much more reliable things get! So no, compensation should NOT be modeled at the business level. Except on rare occasions maybe.

REST and reliability

Whenever I see a presentation on REST I am impressed by its simplicity. With just four operations (GET, POST, PUT, DELETE) it seems to accomplish a simple model for service-oriented architectures, where every business resource has a URL.

With this simplicity, REST also leverages the ubiquitous HTTP protocol as the underlying mechanism. More and more people seem to like this, including me.

However, the big question for me is: how do you make this reliable? Imagine that you integrate 4 systems in a REST style. You would be using HTTP and a synchronous invocation mechanism for each service. Now comes the question: how reliable is this? The answer: less than the least reliable system that you are using! More precisely, availability goes down quickly because your aggregated service fails as soon as one of the services fails...

With transports like JMS you can improve reliability, but how do you do REST of JMS, given its close relationship with HTTP and URLs? That is the problem with REST for me.

Thursday, October 11, 2007

Data Replication in SOA: The Price of Loose Coupling

When designing a corporate SOA architecture you are often faced with a tough choice: do you rely on a common database (centralized) or do you implement replication instead?

Let me explain what I mean. The idea in SOA is that you define more or less independent services that correspond (hopefully) to clearly defined and business-related activities. For instance, you could have a customer management service and a payment/invoicing service. The customer management service belongs to CRM, the invoicing to the billing department. However, both of these services might need the same customer data. Now what do you do? Basically, you have the following options:

  1. Use the same centralized customer database. This gives you the benefit of easy maintenance because there is only one copy. However, this also means that you are coupling your services into the same database schema, and updates to the schema are likely to affect more than one service.

  2. Replicate the customer database, by identifying one master (the CRM?) that regularly pushes or publishes updates (in an XML feed, for instance). While you lose the benefit of easy maintenance, this does give you loose coupling: as long as the XML format is the same, you can change DBMS schemas as much as you like - without affecting other services.

  3. Merge the customer and invoicing services into one. However, this may not always be possible or desirable, and may even defeat the purpose of service-oritentation altogether.

  4. Have the invoicing query the customer service for each payment. Thi seems to incur a lot of dependencies and network traffic.

So what do you do? My preference tends to go to the second option. However, it means that realistic SOA architectures are likely to have an event-driven nature.

Monday, October 08, 2007

Atomikos Offers 3rd Generation TP Monitors

This post on InfoQ was made by Arjuna, one of our (ex) competitors after JBoss (and then Red Hat) bought their transaction technology.

More interesting than the referred paper are the comments, which I would like to discuss here. Most posts seem to rule out transactions as something that doesn't scale. None of these comments I agree with.

The main complaints uttered seem to fall into these categories:

  1. Transaction managers are supposedly centralized.

  2. Transaction managers are accused of overhead for two-phase commit and synchronization.

I will now show that both these statements are a misconception, claiming that the 3rd generation transaction monitor already exists. Moreover, I will show that 3rd generation transaction managers are better than (or at least as good as) the alternatives - when used correctly.

The product I am talking about is Atomikos ExtremeTransactions, including its JTA/XA open source edition named TransactionsEssentials. Let me now outline why none of the above objections are actually accurate:

  1. Atomikos ExtremeTransactions is a peer-to-peer system for transactions. Whenever two or more processes are involved in the same transaction, the transaction manager component (library) in each process will collaborate with its peer counterpart in the other process. This is how it is done. Consequently, there is no centralized component nor bottleneck. Our studies have shown that this gives you linear (i.e., perfect) scalability. This invalidates the first criticism above.

  2. While two-phase commit does incur some synchronization, the same is true for any other solution (assuming that you want to push operations to one or more backends). A simple example to illustrate my point: many people think that queuing is a way to avoid the need for transactions (and two-phase commit). Is it? Hardly: even if we neglect the resulting risk in message loss (see then you have to realize that most queueing systems use two-phase commit internally anyway. This invalidates the second criticism above.

  3. The often-heard criticism that transactions may block your data is not fair either.
    There is some interesting theoretical work done by Nancy Lynch (MIT) et al - I believe it is this one. Basically, this is mathematics that proves that you cannot have a non-blocking (read: perfect) solution for distributed agreement in realistic scenarios.
    In practice, this means that a queued operation may not make it if the connection to the receiver is down too long. So your system is 'blocked' in the queue, even though you don't use transactions. This is the equivalent of the perceived 'blocking' but now placed in a non-transactional scenario.

  4. Again on the perceived synchronization overhead: if you don't keep track of "what" you have done and "where" (by synchronizing) then you end up with an error-prone process. This is especially true for many critical applications that consume messages and insert the results in a database. If you don't use transactions then you will find yourself implementing duplicate message detection and/or duplicate elimination, none of which are safe without the proper commit ordering. Basically, you are implementing a transaction manager yourself (yuk!).

Am I saying that transactions and two-phase commit don't block? Not exactly - especially if you use XA then things can block. However, Atomikos avoids this in two ways:

  • Very strong heuristic support: unilateral decision are encouraged both in the backend and in the Atomikos transaction manager. If a transaction takes too long, it is terminated anyhow. Where classical scenarios would block, Atomikos enforces a unilateral termination by either party. The resulting anomaly is reflected in the transaction logs, so the transaction manager can track problem cases (instead of letting you chase different systems to find out what happened - the alternative without transactions). Ironically, we have seen more blocks caused by non-XA transactions: if your database does not support an internal timeout mechanism for non-XA (which seems to be so in the most commonly used DBMS) then it will be non-XA transactions that cause the blocking!. I can go on for hours about this - but that is another post.

  • Atomikos also offers local transactions with compensation instead of rollback: you can use our TCC (Try-Cancel/Confirm) API to handle overall rollback. This allows you to use non-XA, local transactions. It never blocks your application, ever! TCC is similar to WS-BA, only better because we have been working on it for much longer than anybody else in the world. See for more on TCC.

Summing up then: do I recommend two-phase commit? Yes, if needed. In the past, this need arose out of legacy integration. In the present and future, that need arises out of up-front requirements. The most typical use cases are:

  • Processing persistent messages with exactly-once guarantees. There is no substitute for the reliability and ease of Atomikos ExtremeTransactions here. Note that this can be done intra-process!

  • Across processes/services if you have a reservation model inherent in your business process. Our TCC technology will make sure that your database never blocks.

More information about Atomikos products can be found here