timemory  3.0.1
C++ and Python Timing and Memory Tracking
timemory

Timing + Memory + Hardware Counter Utilities for C / C++ / CUDA / Python

Build Status Build status codecov

Conda Recipe Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

timemory on GitHub (Source code)

timemory General Documentation

timemory Source Code Documentation (Doxygen)

timemory Testing Dashboard (CDash)

GitHub git clone https://github.com/NERSC/timemory.git
PyPi pip install timemory
Anaconda Cloud conda install -c jrmadsen timemory

Timemory is a performance measurement and analysis framework.

Why Use timemory?

  • Timemory is arguably the most customizable performance measurement and analysis API available
  • High-performance: very low overhead when enabled and borderline negligible runtime disabled
  • Ability to arbitrarily switch and combine different measurement types anywhere in application
  • Provides static reporting (fixed at compile-time), dynamic reporting (selected at run-time), or hybrid
    • Enable static wall-clock and cpu-clock reporting with ability to dynamically enable hardware-counters at runtime

Support for Multiple Instrumentation Marker APIs

  • NVTX for Nsight-Systems and NVprof
  • LIKWID
  • Caliper
  • TAU
  • ittnotify (Intel VTune and Advisor)

Create Your Own Performance and Analysis Tools

  • Written in C++
  • Direct access to performance analysis data in Python and C++
  • Create your own components: any one-time measurement or start/stop paradigm can be wrapped with timemory
    • Flexible and easily extensible interface: no data type restrictions in custom components

Generic Bundling of Multiple Tools

  • CPU hardware counters via PAPI
  • NVIDIA GPU hardware counters via CUPTI
  • NVIDIA GPU tracing via CUPTI
  • Generating a Roofline for performance-critical sections on the CPU and NVIDIA GPUs
  • Memory usage
  • Tool insertiong around malloc, calloc, free, cudaMalloc, cudaFree
  • Wall-clock, cpu-clock, system-clock timing
  • Number of bytes read/written to file-system (and rate)
  • Number of context switches
  • Trip counts
  • CUDA kernel runtime(s)

Powerful GOTCHA Extensions

  • GOTCHA is an API for LD_PRELOAD
    • Significantly simplify existing implementations
  • Scoped GOTCHA
  • Use gotcha component to replace external function calls with own instrumentation
  • Use gotcha component to instrument external library calls

Multi-language Support

  • Variadic interface to all the utilities from C code
  • Variadic interface to all the utilities from C++ code
  • Variadic interface to all the utilities from Python code
    • Includes context-managers and decorators

Overview

Timemory is generic C++11 template library providing a variety of performance components for reporting timing, resource usage, hardware counters for the CPU and GPU, roofline generation, and simplified generation of GOTCHA wrappers to instrument external library function calls.

Timemory provides also provides Python and C interfaces.

Purpose

The goal of the package is to provide as easy way to regularly report on the performance of your code. If you have ever added something like this in your code:

tstart = time.now()
# do something
tstop = time.now()
print("Elapsed time: {}".format(tstop - tstart))

Timemory streamlines this work. In C++ codes, all you have to do is include the headers. It comes in handy especially when optimizing a certain algorithm or section of your code – you just insert a line of code that specifies what you want to measure and run your code: initialization and output are automated.

Profiling and timemory

Timemory is not a full profiler (yet). The ultimate goal is to create a customizable profiler. Currently, timemory supports explicit instrumentation (i.e. minor modifications to source code) and explicit wrapping of dynamically-linked functions. Using profilers are currently important for discovering where to place timemory markers or which dynamically function calls to wrap with GOTCHA. The library provides an easy-to-use method for always-on general HPC analysis metrics (i.e. timing, memory usage, etc.) with the same or less overhead than if these metrics were to records and stored in a custom solution and, for C++ code, extensively inlined. Functionally, the overhead is non-existant: sampling profilers (e.g. gperftools, VTune) at standard sampling rates barely notice the presence of timemory unless it is been used very unwisely.

Additional tools are provided, such as hardware counters, to increase optimization productivity. What to check whether those changes increased data locality (i.e. decreased cache misses) but don't care about any other sections of the code? Use the following and set TIMEMORY_PAPI_EVENTS="PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM" in the environment:

//
// do something in region of interest...
//

and delete it when finished. It's three extra LOC that may reduce the time spent: changing code, then runnning profiler, then opening output in profiler, then finding ROI, then comparing to previous results, and then repeating from 4 hours to 1.

In general, profilers are not run frequently enough and performance degradation or memory bloat can go undetected for several commits until a production run crashes or underperforms. This generally leads to a scramble to detect which revision caused the issue. Here, timemory can decrease performance regression identification time. When timemory is combined with a continuous integration reporting system, this scramble can be mitigated fairly quickly because the high-level reporting provided allows one to associate a region and commit with exact performance numbers. Once timemory has been used to help identify the offending commit and identify the general region in the offending code, a full profiler should be launched for the fine-grained diagnosis.

Create Your Own Tools/Components

There are numerous instrumentation APIs available but very few provide the ability for users to create tools/components that will fully integrate with the instrumentation API in their code. The simplicity of creating a custom component that inherits category-based formatting properties (is_timing_category) and timing unit conversion (uses_timing_units) can be easily demonstrated in ~50 LOC with the wall_clock component:

namespace tim
{
namespace component { struct wall_clock; }
namespace trait
{
template <> struct is_timing_category<component::wall_clock> : std::true_type {};
template <> struct uses_timing_units<component::wall_clock> : std::true_type {};
} // namespace trait
namespace component
{
//
// the system's real time (i.e. wall time) clock, expressed as the
// amount of time since the epoch.
//
struct wall_clock : public base<wall_clock, int64_t>
{
using ratio_t = std::nano;
using base_type = base<wall_clock, value_type>;
static std::string label() { return "wall"; }
static std::string description() { return "wall time"; }
static value_type record()
{
return tim::get_clock_real_now<int64_t, ratio_t>();
}
double get_display() const { return get(); }
double get() const
{
auto val = (is_transient) ? accum : value;
return static_cast<double>(val) / ratio_t::den * get_unit();
}
void start()
{
}
void stop()
{
auto tmp = record();
accum += (tmp - value);
value = std::move(tmp);
}
};
} // namespace component
} // namespace tim

GOTCHA and timemory

C++ codes running on the Linux operating system can take advantage of the built-in GOTCHA functionality to insert timemory markers around external function calls. GOTCHA is similar to LD_PRELOAD but operates via a programmable API. This include limited support for C++ function mangling (in general, mangling template functions are not supported – yet).

Writing a GOTCHA hook in timemory is greatly simplified and applications using timemory can specify their own GOTCHA hooks in a few lines of code instead of being restricted to a pre-defined set of GOTCHA hooks.

Example GOTCHA

If an application wanted to insert tim::auto_timer around (unmangled) MPI_Allreduce and (mangled) ext::do_work in the following executable:

#include <mpi.h>
#include <vector>
int main(int argc, char** argv)
{
init();
MPI_Init(&argc, &argv);
int sizebuf = 100;
std::vector<double> sendbuf(sizebuf, 1.0);
// ... do some stuff
std::vector<double> recvbuf(sizebuf, 0.0);
MPI_Allreduce(sendbuf.data(), recvbuf.data(), sizebuf, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// ... etc.
int64_t nitr = 10;
std::pair<float, double> settings{ 1.25f, 2.5 };
std::tuple<float, double> result = ext::do_work(nitr, settings);
// ... etc.
return 0;
}

This would be the required specification using the TIMEMORY_C_GOTCHA macro for unmangled functions and TIMEMORY_CXX_GOTCHA macro for mangled functions:

static constexpr size_t NUM_FUNCS = 2;
void init()
{
TIMEMORY_C_GOTCHA(gotcha_t, 0, MPI_Allreduce);
TIMEMORY_CXX_GOTCHA(gotcha_t, 1, ext::do_work);
}

Additional Information

For more information, refer to the documentation.

tim::component::wall_clock::start
void start()
Definition: timing.hpp:87
tim::component::wall_clock::label
static std::string label()
Definition: timing.hpp:76
tim::component::wall_clock::ratio_t
std::nano ratio_t
Definition: timing.hpp:72
tim::component::wall_clock::get_display
double get_display() const
Definition: timing.hpp:80
tim::component::base< wall_clock, int64_t >::value
value_type value
Definition: base.hpp:612
timemory.hpp
tim::component::base< wall_clock, int64_t >::is_transient
bool is_transient
Definition: base.hpp:609
tim::component::gotcha
Definition: gotcha.hpp:346
int64_t
main
int main(int argc, char **argv)
Definition: available.cpp:281
tim::string
tim::apply< std::string > string
Definition: macros.hpp:57
tim::component::wall_clock::stop
void stop()
Definition: timing.hpp:93
argv
char ** argv
Definition: timem.cpp:314
TIMEMORY_CXX_GOTCHA
#define TIMEMORY_CXX_GOTCHA(type, idx, func)
Definition: gotcha.hpp:1260
tim::component::base< wall_clock, int64_t >::get_unit
static int64_t get_unit()
Definition: base.hpp:712
tim::auto_tuple
Definition: auto_tuple.hpp:57
tim::component::wall_clock::record
static value_type record()
Definition: timing.hpp:78
tim::component::base< wall_clock, int64_t >::set_started
void set_started()
Definition: base.hpp:239
init
void init()
Definition: ex_gotcha.cpp:95
TIMEMORY_CALIPER
#define TIMEMORY_CALIPER(id, type,...)
Definition: macros.hpp:177
tim::component::wall_clock::base_type
base< wall_clock, value_type > base_type
Definition: timing.hpp:74
tim::component::wall_clock::value_type
int64_t value_type
Definition: timing.hpp:73
TIMEMORY_CALIPER_APPLY
#define TIMEMORY_CALIPER_APPLY(id, func,...)
Definition: macros.hpp:205
settings
Definition: settings.py:1
tim::component::base< wall_clock, int64_t >::accum
accum_type accum
Definition: base.hpp:613
tim::papi::stop
void stop(int event_set, long long *values)
Definition: papi.hpp:571
TIMEMORY_C_GOTCHA
#define TIMEMORY_C_GOTCHA(type, idx, func)
Definition: gotcha.hpp:1241
tim::component::wall_clock::get
double get() const
Definition: timing.hpp:81
tim::component::wall_clock::description
static std::string description()
Definition: timing.hpp:77
tim::component::base< wall_clock, int64_t >::set_stopped
void set_stopped()
Definition: base.hpp:248
tim
Definition: ert.cpp:37