Archive for May, 2010


Instrumentation and Monitoring as means of problem detection and resolution

While designing the last big project I worked on we chose to place some monitoring code in specific points of the application. Those metrics were meant for emitting information targeted at business users but also to technical users, this information was then split according to audience through the respective JMX agents used (was it a Tivoli ITCAM agent or a JOPR agent).
At first we were expecting only that this could provide us valuable information for live environment such as when we had anything abnormal on the legacy software we were integrating with but we ended up noticing that this could also provide us valuable informal of the operation behaviour of our software and the best part: on live environment. And in fact it turns out that this is so common that we can find others pointing into this direction as well.
The picture below presents an overview of the application in question as well as the instrumentation points.

Application overview with instrumentation points

Application overview with instrumentation points

The instrumentation points gathered the following information:

  • The instrumentation point on the left of the OuterMDB collected the sum of the messages processed in the current and last hour as well as the messages per second.
  • The instrumentation point on depicted in the top of the OuterMDB collected the sum of the time spent in pre-processing as well as the number of times pre-processing was invoked.
  • The instrumentation point on top of the InnerMDB collected the sum of messages processed in the current and last hour as well as the messages per second.
  • And finally, the instrumentation point on the bottom of the InnerMDB collected the sum of time spent in the communication with the legacy system as well as the average of processing time per request in the current and last hour, the minimum and maximum times of processing for current and last hour and the amount of request processed as well as the timeouts.

The comparison between the number of messages processed in the InnerMDB and OuterMDB could provide us means of comparing how we should size the Thread pools for each of these components. This is such an information that would be harder to obtain by any other means. We also used those metrics for detecting misfunction on the pre-processing legacy software that was invoked by our PreProcessing component, this way we could switch off pre-processing and avoid a negative impact on overall system performance.
But this monitoring was key to the detection of a misbehavior of our JCA connector. A misimplementation of the equals/hashcode method pair for the ManagedConnection lead to a huge performance degradation after a few hours of software operation. By using our monitoring infrastructure we could isolate the problematic code area. Sure it did not point towards the equals/hashcode pair but it was clear that it was related to connection acquisition code.
Finally, the monitoring in our application provided us an effective way of monitoring the legacy application we were communicating with since it did not provide any native way of monitoring its operation. We were then able to instantly respond to outages on the legacy application through metrics collected on our software.


Designing for Extreme Transaction Processing – Memento Pattern

Applications with huge transaction processing requirements or tight response times always result in a careful architecture design. Special care has to be taken on how to partition data in order to be able to better parallelize load. This task can be even trickier if some of the data isn’t suitable for proper partitioning. Being ableto partition data is an essential requirement for some of the elastic frameworks around – some of them even demand that data is local to the node processing the request while others may still work but with a significant performance drop if it is not. This negative impact due to difficulties in the proper partitioning and grouping of data can be mitigated at the cost of increased memory usage as it is always possible to increase replication forcing data to be in multiple nodes to avoid serialization upon request as there isn’t the concept of gravitation on such tools as there was on previous distributed caches.
Billy Newport (the mind behind IBM eXtreme Scale) proposed a classification scheme for the different styles of Extreme Transaction Processing (XTP) systems he identified:

  • Type 1: Non partitionable data models – Applications in this category usually perform adhoc queries that can span an unpartitionable amount of data which leaves scaling up as the only scaling possibility.
  • Type 2a: Partitionable data models with limited needs for scaling out – Applications in this category already present means of partitioning data and thus load but still on a limited fashion. Hence those applications can still be built using regular application servers backed by Sharded Databases, elastic NoSQL databases or IMDGs but still won’t use a Map/Reduce pattern for processing requests.
  • Type 2b: Partitionable data models with huge needs for scaling out – Finally this category is composed of applications that presents means of partitioning data as on Type 2a but instead of being exposed to limited load, Type 2b are applications are pushed to the limit. Type 2b applications usually evolved from Type 2a and were faced with the scalability limits of traditional architectures moving from regular application servers to Map/Reduce frameworks. It is worth noting that it is possible to scale to similar load requirements with traditional architectures based on regular application servers but they’ll usually require more human resources for administration as well as more hardware infrastructure.

Among the list of common items on a classic Transaction Processing (TP) system that must be avoided on an XTP system are two phase commit resources. Note that it is not that you can’t have a two phase commit resource as part of an XTP but they must be carefully evaluated and excess will certainly compromise system performance.
Billy presented (at the eXtreme Scale meet the experts session@IBM Impact 2010) an excellent example scenario on where an excess of two phase commit resources could undermine the performance of an e-commerce solution. In his example, the checkout process of the referred e-commerce site would, as part of a single transaction, scan the whole shopping cart and for each product in the cart it would then perform an update on a fictional inventory database as well as updating a few other two phase commit databases from other departments. If any of the items were out of stock transaction would be rolled back and an error would be presented to the user. It is obvious that this hypothetical system wouldn’t be able to scale much since the cost related to the long time span of this transaction combined with the number of resources involved would be tremendous. Not to mention that there would be a huge load on the transaction manager.
Instead of using this rather obvious approach, Billy suggested that these updates could be chained and combined with the Memento design pattern – the updates would then be sequentially applied and if any of them failed the Memento pattern would then be used to revert the changes that have already been applied. Using this approach the contention on all the databases involved would be minimal and the application requirement would still be fulfilled.
This is one of the many examples we can point that need to be carefully observed when designing XTP systems.


distributed version control systems

About 2 years ago I first read something about distributed version control. It was a post in a friend’s blog (portuguese), I never used or tried any distributed version control alternative, but the name (and idea seems to give much power to developers and it) doesn´t sounds good for me.

More than a year latter I find a (excelent and funny) mercurial tutorial by Joel Spolsky. Since this day I can’t wait to get into a good project to begin using Mercurial in my company. It isn’t an easy task because only CVS (yes, some crazy people still use that) and SVN (recently updated to 1.6, only a few months ago our server version was still 1.4!!!) are currently (officialy) suported in my company servers.

If you have any experience with a migration from SVN or CVS to Mercurial (or any other dvcs), please send us your feedback.

I’ll report my experiences as soon as possible.


Blog Stats

  • 372,288 hits since aug'08

%d bloggers like this: