Application Performance Management
Application Performance Management (APM) deals with analyzing the runtime behavior of software systems and monitoring their response time and resource consumption. We currently concentrate on the monitoring of Java programs and try to answer the research question of how an ideal Java VM should look like that provides analysis tools with the necessary information about executed programs. In particular we work on memory monitoring (collecting data about allocated objects with their types and live times, allocation sites, allocating threads, and the garbage collection behavior), execution monitoring (collecting execution profiles with method frequencies and dynamic calling contexts), and thread monitoring (analyzing the synchronization between threads, their locking behavior and their waiting times). We also want to investigate anomalies such as memory leaks or race conditions.
Our research is based on Oracle’s Java HotSpot VM, which is one of the most widely used Java execution environments. Profiles are collected by dynamic instrumentation of programs, by efficient sampling techniques and by modifications of VM components such as the bytecode interpreter, the JIT compiler, and the garbage collector. Furthermore, we produce traces of relevant events during a program’s execution and use them for analysis and data mining. Finally, we apply various visualization techniques to show event traces, heap layouts, and dynamic calling context trees on various abstraction layers and with sophisticated filtering and searching mechanisms.
Lock contention tracing.
Software developers must write concurrent code to benefit from multiple cores and processors, but implementing correct and scalable locking for accessing shared resources remains a challenge. Examining lock contention in an application at runtime is vital to determine where more sophisticated but error-prone locking pays off. We devised a novel approach for analyzing lock contention in Java applications by tracing locking events in the Java HotSpot Virtual Machine. Unlike common methods, our approach observes not only when a thread is blocked on a lock, but also which other thread blocked it by holding that lock, and records both their stack traces. This reveals the causes of lock contention instead of showing only its symptoms. We further devised a versatile tool for the analysis of the traces which enables users to identify locking bottlenecks and their characteristics in an effective way. With a mean runtime overhead of 7.8% for real-world multi-threaded benchmarks, we consider our approach to be efficient enough to monitor production systems. More details and downloads…
Hofer, P.; Gnedt, D.; Schörgenhumer, A.; Mössenböck, H.: Efficient Tracing and Versatile Analysis of Lock Contention in Java Applications on the Virtual Machine Level. 7th Int’l Conf. on Performance Engineering (ICPE’16), March 12-18, 2016, Delft, The Netherlands.
Hofer, P.; Gnedt, D.; Mössenböck, H.: Efficient Dynamic Analysis of the Synchronization Performance of Java Applications. 13th Int’l Workshop on Dynamic Analysis (WODA’15) co-located with SPLASH’15, Oct 26, 2015, Pittsburgh, PA, USA.
Memory event tracing – Ant Tracks. In order to analyze GC delays, one has to understand what goes on in the heap in terms of allocations, reclamations and object moves (collectively called memory events). Since it is not viable to monitor these events in real time we produce a binary trace in which all events are stored in the order of their occurrence. In case of an observed delay the trace can be analyzed offline, which not only enables us to replay the memory events and to inspect the state of the heap at any point in time but also allows us to mine for patterns that might have caused the problem. Our traces contain detailed information about object allocations, reclamations, and GC moves. For capturing the allocations we modified the HotSpot client JIT compiler as well as the bytecode interpreter. For capturing GC events we instrumented HotSpot’s Parallel GC. By using sophisticated instrumentation and buffering techniques, the tracing overheads could be kept down to 5% on average.
Lengauer, P; Bitto, V.; Mössenböck, H.: Accurate and Efficient Object Tracing for Java Applications. 6th Int’l Conf. on Performance Engineering (ICPE’15), Jan 31-Feb 4, 2015, Austin, TX, USA.
Bitto, V.; Lengauer, P; Mössenböck, H.: Efficient Rebuilding of Large Java Heaps From Event Traces. 12th Int’l Conf. on Principles and Practice of Programming in Java (PPPJ’15), Sept 8-11, 2015, Melbourne, FL, USA.
Incremental stack tracing. This technique is an alternative to asynchronous stack sampling (described below). Instead of decoding the full stack for every sample, we decode only parts of it. When a sample is taken, only the topmost frame is decoded and the return address of the current method is patched so that it will return to a stub that will decode the caller frame and again patch the caller’s return address. In that way, the calling context tree is built incrementally and every stack frame is decoded only once. For selecting the sampling points, we also tried a new idea: Like with JVMTI sampling, we take samples only at safepoint locations. However, we do not wait until all threads have reached such locations (which is a bottleneck in JVMTI sampling) but rather start the sampling when n threads have reached their safepoint locations, where n is the number of cores on the machine. The efficiency of incremental stack sampling is between that of JVMTI sampling and asynchronous stack sampling. For the DaCapo benchmarks, we achieved overheads of 2%, 7%, and 23% for 100, 1000, and 10000 samples per second, which is still 2-4 times faster than traditional JVMTI sampling.
Hofer, P.; Gnedt, D.; Mössenböck, H.: Lightweight Java Profiling with Partial Safepoints and Incremental Stack Tracing. 6th Int’l Conf. on Performance Engineering (ICPE’15), Jan 31 – Feb 4, 2015, Austin, TX, USA.
Virtualization time accounting. With hardware virtualization, the hypervisor must frequently suspend one VM to execute another, “stealing time” from the suspended VM. Nevertheless, the stolen time is accounted as CPU time to the scheduled threads in the suspended VM. We devised a technique to reconstruct to what extent the threads of an application running in a virtualized environment are affected by suspension. We accomplish this by periodically sampling the CPU time of the individual threads as well as the steal time for the entire VM. We then assign fractions of the VM’s steal time to each thread at the ratio of the thread’s CPU time to the entire VM’s CPU time. Using this technique, we were able to correct measurements of CPU time consumption.
Hofer, P.; Hörschläger, F.; Mössenböck, H.: Sampling-based Steal Time Accounting under Hardware Virtualization. Work in progress paper, 6th Int’l Conf. on Performance Engineering (ICPE’15), Jan 31 – Feb 4, 2015, Austin, TX, USA.
Automatic GC tuning. The Java HotSpot VM comes with 6 different garbage collectors, each of which has dozens of parameters for setting the size of local buffers, the tenuring age, the adaptation policies, and many other things. Depending on the application’s memory profile and workload, different parameter settings must be used to achieve optimal GC performance. Manual tuning is time-consuming and difficult. We therefore developed a technique for automatic GC parameter tuning based on a hill-climbing approach, which finds the optimum parameter settings for a given application and workload. The target function can be arbitrary (e.g., the overall GC time, the maximum GC pause time, or the maximum heap usage). Using a large number of benchmarks from the well-known DaCapo 2009 and SPECjbb 2005 benchmark suites, the GC times could be reduced by up to 77% and by 35% on average compared to the default parameter settings.
Lengauer, P.; Mössenböck, H.: The Taming of the Shrew: Increasing Performance by Automatic Parameter Tuning for Java Garbage Collectors. 5th Intl. Conf. on Performance Engineering (ICPE’14), March 22-26, 2014, Dublin, Ireland, pp.111-122.
Also read the Dynatrace APM blog article
Asynchronous stack sampling. We developed efficient techniques for computing execution profiles and building dynamic calling context trees for Java programs. We used a sampling approach exploiting a feature of the Linux perf monitoring subsystem to produce timer interrupts at which a fragment of every stack is copied to a buffer. The fragments are then analyzed asynchronously by a background thread running on a separate core. In contrast to other Java-based sampling techniques (e.g., using JVMTI), samples can be taken anywhere and not only at safepoint locations, thus increasing the accuracy of the profiles and reducing the run-time overhead at the same time. The average overheads for the DaCapo benchmarks were 1%, 2%, and 10% for sampling rates of 100, 1000, and 10000 samples per second, which is about 5 times faster than the commonly used synchronous sampling in safepoints with JVMTI.
Hofer, P; Mössenböck, H: Efficient and Accurate Stack Trace Sampling in the Java Hotspot Virtual Machine. Work in progress paper, 5th Intl. Conf. on Performance Engineering (ICPE’14), March 22-26, 2014, Dublin, Ireland, pp.277-280.
Hofer, P.; Mössenböck, H.: Fast Java Profiling with Scheduling-aware Stack Fragment Sampling and Asynchronous Analysis. 11th Intl. Conf. on Principles and Practice of Programming in Java (PPPJ’14), Sept. 23-26, 2014, Cracow, Poland, pp.145-156.
Feature-based memory monitoring. Very-large-scale software systems are often structured as software product lines consisting of features that can be individually selected by users for being added to the application. When selecting features it is useful to know their “costs” in terms of memory consumption. We developed a technique for capturing object allocations and deallocations and associating them with program features.
Lengauer, P.; Bitto, V.; Angerer, F.; Grünbacher, P.; Mössenböck, H.: Where Has All My Memory Gone? Determining Memory Characteristics of Product Variants using Virtual-Machine-Level Monitoring. 8th Intl. Workshop on Variability Modelling of Software-intensive Systems (VaMoS’14), January 22-24, 2014, Nice, pp.1-8.
Collecting GC metrics. We instrumented the JVM so that it collects data about factors that influence GC times, such as the number of allocated objects/bytes between GC runs, the number of objects/bytes reclaimed per GC run, the average and maximum age of surviving objects, the number of references between objects, and the average distance of references (which influences the caching behavior).
Lengauer, P.: VM-Level Memory Monitoring for Resolving Performance Problems. Doctoral Symposium at SPLASH’13, October 28, 2013, Indianapolis, USA, pp.29-32.