timemory  3.0.0
C++ and Python Timing and Memory Tracking

Timing + Memory + Hardware Counter Utilities for C / C++ / CUDA / Python

Build Status Build status codecov

Conda Recipe Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

timemory on GitHub (Source code)

timemory General Documentation

timemory Source Code Documentation (Doxygen)

timemory Testing Dashboard (CDash)

GitHub git clone https://github.com/NERSC/timemory.git
PyPi pip install timemory
Anaconda Cloud conda install -c jrmadsen timemory

Why Use timemory?

  • Direct access to performance analysis data in Python and C++
  • Header-only interface for majority of C++ components
  • Variadic interface to all the utilities from C code
  • Variadic interface to all the utilities from C++ code
  • Variadic interface to all the utilities from Python code
    • Includes context-managers and decorators
  • Create your own components: any one-time measurement or start/stop paradigm can be wrapped with timemory
  • Flexible and easily extensible interface: no data type restrictions in custom components
  • High-performance: template meta-programming and lambdas result in extensive inlining
  • Ability to arbitrarily switch and combine different measurement types anywhere in application
  • Provides static reporting (fixed at compile-time), dynamic reporting (selected at run-time), or hybrid
    • Enable static wall-clock and cpu-clock reporting with ability to dynamically enable hardware-counters at runtime
  • Arbitrarily add support for:
    • CPU hardware counters via PAPI without an explicit PAPI dependency and zero #ifdef
    • GPU hardware counters via CUPTI without an explicit CUPTI dependency and zero #ifdef
    • Generating a Roofline for performance-critical sections
    • Extensive tools provided by Caliper including TAU
    • Colored CUDA NVTX markers
    • Memory usage
    • Wall-clock, cpu-clock, system-clock timing
    • Number of bytes read/written to file-system (and rate)
    • Number of context switches
    • Trip counts
    • CUDA kernel runtime(s)
    • GOTCHA wrappers around external library function calls


Timemory is generic C++11 template library providing a variety of performance components for reporting timing, resource usage, hardware counters for the CPU and GPU, roofline generation, and simplified generation of GOTCHA wrappers to instrument external library function calls.

Timemory provides also provides Python and C interfaces.


The goal of the package is to provide as easy way to regularly report on the performance of your code. If you have ever added something like this in your code:

tstart = time.now()
# do something
tstop = time.now()
print("Elapsed time: {}".format(tstop - tstart))

Timemory streamlines this work. In C++ codes, all you have to do is include the headers. It comes in handy especially when optimizing a certain algorithm or section of your code – you just insert a line of code that specifies what you want to measure and run your code: initialization and output are automated.

Profiling and timemory

Timemory is not a full profiler and is intended to supplement profilers, not be used in lieu of profiling, which are important for discovering where to place timemory markers. The library provides an easy-to-use method for always-on general HPC analysis metrics (i.e. timing, memory usage, etc.) with the same or less overhead than if these metrics were to records and stored in a custom solution (there is zero polymorphism) and, for C++ code, extensively inlined. Functionally, the overhead is non-existant: sampling profilers (e.g. gperftools, VTune) at standard sampling rates barely notice the presence of timemory unless it is been used very unwisely.

Additional tools are provided, such as hardware counters, to increase optimization productivity. What to check whether those changes increased data locality (i.e. decreased cache misses) but don't care about any other sections of the code? Use the following and set TIMEMORY_PAPI_EVENTS="PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM" in the environment:

and delete it when finished. It's three extra LOC that may reduce the time spent: changing code, then runnning profiler, then opening output in profiler, then finding ROI, then comparing to previous results, and then repeating from 4 hours to 1.

In general, profilers are not run frequently enough and performance degradation or memory bloat can go undetected for several commits until a production run crashes or underperforms. This generally leads to a scramble to detect which revision caused the issue. Here, timemory can decrease performance regression identification time. When timemory is combined with a continuous integration reporting system, this scramble can be mitigated fairly quickly because the high-level reporting provided allows one to associate a region and commit with exact performance numbers. Once timemory has been used to help identify the offending commit and identify the general region in the offending code, a full profiler should be launched for the fine-grained diagnosis.

Create Your Own Tools/Components

There are numerous instrumentation APIs available but very few provide the ability for users to create tools/components that will fully integrate with the instrumentation API in their code. The simplicity of creating a custom component can be easily demonstrated in ~30 LOC with the trip_count component:

namespace tim {
namespace component {
struct trip_count : public base<trip_count, int64_t>
using this_type = trip_count;
using base_type = base<this_type, value_type>;
static const short precision = 0;
static const short width = 5;
static const std::ios_base::fmtflags format_flags =
std::ios_base::fixed | std::ios_base::dec | std::ios_base::showpoint;
static int64_t unit() { return 1; }
static std::string label() { return "trip_count"; }
static std::string description() { return "trip counts"; }
static std::string display_unit() { return ""; }
static value_type record() { return 1; }
value_type get_display() const { return accum; }
value_type get() const { return accum; }
void start()
void stop()
} // namespace component
} // namespace tim

<a href="https://github.com/LLNL/GOTCHA">GOTCHA</a> and timemory

C++ codes running on the Linux operating system can take advantage of the built-in GOTCHA functionality to insert timemory markers around external function calls. GOTCHA is similar to LD_PRELOAD but operates via a programmable API. This include limited support for C++ function mangling (in general, mangling template functions are not supported – yet).

Writing a GOTCHA hook in timemory is greatly simplified and applications using timemory can specify their own GOTCHA hooks in a few lines of code instead of being restricted to a pre-defined set of GOTCHA hooks.

Example GOTCHA

If an application wanted to insert tim::auto_timer around (unmangled) MPI_Allreduce and (mangled) ext::do_work in the following executable:

#include <mpi.h>
#include <vector>
int main(int argc, char** argv)
MPI_Init(&argc, &argv);
int sizebuf = 100;
std::vector<double> sendbuf(sizebuf, 1.0);
// ... do some stuff
std::vector<double> recvbuf(sizebuf, 0.0);
MPI_Allreduce(sendbuf.data(), recvbuf.data(), sizebuf, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// ... etc.
int64_t nitr = 10;
std::pair<float, double> settings{ 1.25f, 2.5 };
std::tuple<float, double> result = ext::do_work(nitr, settings);
// ... etc.
return 0;

This would be the required specification using the TIMEMORY_C_GOTCHA macro for unmangled functions and TIMEMORY_CXX_GOTCHA macro for mangled functions:

static constexpr size_t NUM_FUNCS = 2;
void init()
TIMEMORY_C_GOTCHA(gotcha_t, 0, MPI_Allreduce);
TIMEMORY_CXX_GOTCHA(gotcha_t, 1, ext::do_work);

Additional Information

For more information, refer to the documentation.

static value_type record()
Definition: general.hpp:70
void stop(int event_set, long long *values)
Definition: papi.hpp:546
static const short width
Definition: general.hpp:62
int64_t value_type
Definition: general.hpp:57
tim::component::base< trip_count >::value
value_type value
Definition: base.hpp:472
tim::component::base< trip_count >::set_started
void set_started()
Definition: base.hpp:258
int main(int argc, char **argv)
Definition: preload_tests.cpp:147
#define TIMEMORY_C_GOTCHA(type, idx, func)
Definition: gotcha.hpp:527
static int64_t unit()
Definition: general.hpp:66
static std::string description()
Definition: general.hpp:68
Definition: test_ert.cpp:43
Definition: types.hpp:345
static const short precision
Definition: general.hpp:61
provide a default size for papi_array
Definition: auto_timer_extern.cpp:41
void start()
Definition: general.hpp:75
static std::string label()
Definition: general.hpp:67
#define TIMEMORY_AUTO_TUPLE_CALIPER(id, auto_tuple_type,...)
Definition: macros.hpp:107
#define TIMEMORY_CALIPER_APPLY(id, func,...)
Definition: macros.hpp:207
value_type get() const
Definition: general.hpp:73
static std::string display_unit()
Definition: general.hpp:69
Definition: auto_timer_extern.cpp:39
#define TIMEMORY_CXX_GOTCHA(type, idx, func)
Definition: gotcha.hpp:536
static const std::ios_base::fmtflags format_flags
Definition: general.hpp:63
tim::component::base< trip_count >::set_stopped
void set_stopped()
Definition: base.hpp:267
tim::component::base< trip_count >::accum
value_type accum
Definition: base.hpp:473
base< this_type, value_type > base_type
Definition: general.hpp:59
void stop()
Definition: general.hpp:81
value_type get_display() const
Definition: general.hpp:72
Definition: gotcha.hpp:72
void init()
Definition: caliper.hpp:57
trip_count this_type
Definition: general.hpp:58
char ** argv
Definition: timem.cpp:312