Skip to main content
Introduction

Often in a product lifecycle, software engineers are faced with the daunting task of finding memory leaks in the application software. Although the many memory profiling tools (free or paid) available can be used to find memory leaks in a development environment (for example during feature development), it is still a difficult task to find leaks using tools on customer premises in a production environment.

Challenges in Live Deployments

Memory Profiling Tools

Memory profiling tools either provide instrumented binaries (for example, Sanitizer), or a profiling environment (for example, Valgrind). Both kinds of tools are not suitable for deployment on customer premises, where performance and throughput criteria are to be strictly met. A product owner or a service provider cannot expect their customers to accept scaled down and non- deterministic operations of the solution deployed in production, where their customers are affected by these troubleshooting procedures.

While tools like Sanitizer with their instrumented binaries, provide some improvement in the latency factor, they still do not guarantee the same performance as non-instrumented application binaries. On the other hand, tools that monitor application memory operations in a controlled environment like Valgrind, do not provide the same performance; and the throughput is several magnitudes less than when the application is deployed in its normal environment. Typically, customers are either hesitant or refuse to deploy such tools on their premises.

Application Log Levels

The other challenge is that the leak scenario is not known and cannot be deciphered just by using error/warning level logs procured from the customer. For reasons mentioned earlier (performance, throughput, log file sizes, disk space), customers are hesitant to increase log levels to debug level. And even if the customer does agree to increasing the log levels, they would most likely do that only for a very limited time, which may not be sufficient to trigger the leak scenario.

Code Constructs

Some code constructs like smart pointers are not detected as leaks by most memory profiling tools. For example, if a smart pointer is not removed from a list or freed from another object, the tool will still list that as a valid referenced memory, and not report it as a leak. In addition to the above, there are other organizational and political reasons that exist between corporations which inhibit the above deployment scenarios.

Proposed Solution

The proposed solution aims to overcome the above impediments with a simple generation of application logs, and automated (that is, non-manual) analysis of these logs. Although the solution does not claim to cater to all scenarios, and it is explained using a C++ example, it helps provide an approach that can be adapted for different deployments and constructs.

Logging Constructors and Destructors of Objects

As mentioned, the solution approach is to simply log the construction and destruction of an object. This can be achieved by inserting a log statement in the constructor and the destructor definitions, of that object’s class. The log statement should have:

  1. Timestamp
  2. A well-defined signature string, which helps in detecting (using simple string comparison) that it is a constructor/destructor log. The signature should have the name of the object’s class.
  3. The object’s identifier. In C++, this could be achieved using the this construct; which is the object’s pointer (memory location). No more than one object, at a given time, can have the same object identifier (which is true for “this” in C++).

Example

SrvRecord :: SrvRecord (int ttl_value) :
{ cmn_errorlog("constructor SrvRecord: %p",this);}

SrvRecord :: ~SrvRecord()
{cmn_errorlog("destructor SrvRecord: %p",this);}

[09-07-2021:05.36.36.353156] ERR#constructor SrvRecord:
0x7f9564746028#[cmn_dns_interface.cpp:434] 140280653137664 (null) (null) (null)

Log Leak Detector

Once the log output of the application (after the scenario run) is available, a LogLeakDetector program needs to only skim the log entries one by one, and:

  1. When the program encounters a log with a constructor signature, it stores that log line in a map with the key as the object’s identifier (example, pointer value), and then increments a counter for that class.
  2. When the program encounters a log with a destructor signature, it removes the stored constructor log entry using the object identifier as the key (the object pointer printed in the log will be the same for both – constructor and destructor log statements). It then decrements the counter for that class.
  3. It skips lines having neither the constructor nor destructor signatures.
  4. After all the log lines of all the log files are analyzed and consumed in this manner, the counter value of each class is printed. These indicate constructors that did not have the corresponding destructors for them, listed for each class. The class(es) having an (abnormally) high count value is then a leak candidate.
  5. Here you can find a downloadable flow chart that captures the points discussed above.

Considerations

There are several considerations in the above simple algorithm, and we will attempt to go through them.

1. Why do we need to store the constructor log snippets? Can’t we simply increment and decrement the class counter when consuming a constructor and destructor log statement, respectively.

a.  The above statement is true. However, storing the log snippets with timestamps and other signature information could give us vital clues as to the time the object was created, and the related application logs at that time and in that context.
b.  It could also give us other clues, such as when a group of similar objects were created.
c.  The signature of the log provides other contextual information.
d.  Even if the entire log statement is not saved, it is imperative to save some contextual data from that statement, such as timestamp, class, etc.
e.  In spite of the above, we will still treat this requirement as optional, in cases of prohibitively large amounts of logging.
f.  The LogLeakDetector could have an option whether to store the contextual information or just increment/decrement the class counters.

2. The LogLeakDetector program should stop consuming constructor logs somewhere before the end of the log file set (each log file name should have a timestamp). This is especially true when the log file set is provided from a running application deployed at a customer site (that is, the application was not stopped, as it is supposed to keep running and provide its functionality, for example a web server).

a. This is required so that the objects constructed towards the end of the log set are not counted, as the destructors for these objects have not been logged. (for example, the average interval between the creation and destruction of an object is 30 minutes)
b. Since the destructors of such objects have not been logged, there will no decrement in the counter for classes of these objects. This would cause an artificial inflation in their count, giving an impression of memory leak, when that might not be the case.
c. As such, the LogLeakDetector program must have an optional configuration denoting at which time/log file in the log set, it should stop consuming constructors.

3. The LogLeakDetector can take an optional argument, to indicate at what regular intervals it needs to output all the class counter values. This would create a time-series graph denoting the number of objects of various classes existing in the application, at various points in time. Such information could help visualize a pattern in object creation and accumulation, for example, after hours of running a maintenance audit, we see an increase in the number of objects of a particular class.

The Real Deal

The above approach was used in an actual customer situation and proved very useful and effective in determining memory leaks. Here you can find the downloadable C++ implementation, and a variation of the above approach, which was successfully used in detecting leaks in the customer deployment. It can be easily modified for all the features mentioned in the flow chart. Some points to note:

1. As mentioned earlier, the C++ program is available for download

a. It does not have the time-series data logic, however, it can be extended to provide the same.
b. It uses classical C++ constructs (STL etc.) and should work on almost any platform.
c. The constructor and destructor signature detection code should be changed per the log signature implemented in the class definitions

2. The log levels for the constructor and destructor logs were set for Error.

a. This allowed the customer to run in it in production environment, without any additional overhead associated with debug logging, that is, the existing application debug logs will not be written to the logs.
b. Towards the end of the patch run, the log level was increased to debug. This did indeed help us detect the root cause after correlating the debug level logs with the summary output findings.

3. The summary output from the actual deployment is provided in the downloadable pdf.

a. Here is a sample output. The high count of some of the classes, as compared to others, did involve a leak of those objects.
i. Number of instances of sip_msg_t:374
ii. Number of instances of sif_msg:1725
iii. Number of instances of sip_call_context_t:846
iv. Number of instances of DnsNaptrRecord:375561
v. Number of instances of DnsSrvRecord:354708
vi. Number of instances of DnsHostRecord:333856
vii. Number of instances of dns_result_t:20862

Points to Note

It is to be admitted that this approach is specifically suited for cases where the memory allocation and de-allocation routines are defined by the user. As such, the signature logs can be added for user defined types, and not for well-defined types such as int, string, etc. However, one could be motivated to override and intercept the new() construct to inject such logs.

Similarly, the approach is suitable for object-oriented languages. However, again, the alloc/ malloc/ realloc and free invocations can be centered in a routine wherein the log injections are possible.

This approach can be used for leak detection even in languages that have something like a garbage- collector, as in Java for example. In Java, if an object is tucked away in a list or any similar container, then it would be counted as valid, and not reclaimed by the memory module. This approach is perfectly suited in such situations.

Conclusion

We conclude, that though slightly unorthodox, this approach is useful in customer deployments, where the usual memory profiler approaches cannot be used. It will require a patch that has this signature constructor and destructor logs. However, besides the additional logs, there are no other changes from a normal production deployment binary. After the log set is collected, the same can be used with LogLeakDetector for further analysis.

In our case, the customer was assured that there were no other changes in the application except for additional logging. This also helped the management (on both sides) to accept a calculated risk, in the deployment of this approach for a moderate amount of time in the customer’s environment.

We would like to express our gratitude to our co-workers (both technical and managerial) who supported this endeavor, without which it would have remained an idea, only on paper!

Author

Sahil Rangari