Using GOTCHA¶
C++ codes running on the Linux operating system can take advantage of the built-in GOTCHA functionality to insert timemory markers around external function calls. GOTCHA is a library that wraps functions and is used to place hook into external libraries. It is similar to LD_PRELOAD, but operates via a programmable API. This enables easy methods of accomplishing tasks like code instrumentation or wholesale replacement of mechanisms in programs without disrupting their source code.
Writing a GOTCHA hook in timemory is greatly simplified and applications using timemory can specify their own GOTCHA hooks in a few lines of code.
The gotcha
component in timemory supports implicit extraction of the wrapped function return type and arguments and
significantly reduces the complexity of a traditional GOTCHA specification.
The GOTCHA library requires specifying the name of the function in the binary being wrapped. This name is simple for
functions using C linkage (just the name of the function) but these names are more complex in C++ due to name mangling.
Timemory includes limited support for using GOTCHA for C++ functions by providing basic “mangler”, e.g.
tim::mangle<decltype(func)>(TIMEMORY_STRINGIZE(func))
where func
would be ext::do_work
for the function
std::tuple<float, double> do_work(int64_t, const std::pair<float, double>&)
in namespace ext
(in general,
mangling template functions are not supported – yet).
Use Cases¶
The gotcha
component in timemory can provide either of the following functionalities:
Scoped instrumentation around external dynamically-linked function calls
Wholesale replacement of external dynamically-linked function calls
Traditional GOTCHA in C¶
Writing a traditional GOTCHA wrapper in C requires a bit of work and the recommended methods require
enabling GNU compiler extensions (-std=gnu11
) in order to simplify using the function pointers
via the typeof
operator:
#include <gotcha/gotcha.h>
static gotcha_wrappee_t wrappee_puts_handle;
// this is the declaration of your wrapper
static int puts_wrapper(const char* str);
static int fputs_wrapper(const char* str, FILE* f);
static gotcha_wrappee_t wrappee_fputs_handle;
struct gotcha_binding_t wrap_actions[] = {
{ "puts", puts_wrapper, &wrappee_puts_handle },
{ "fputs", fputs_wrapper, &wrappee_fputs_handle },
};
int init_mytool()
{
gotcha_wrap(wrap_actions, sizeof(wrap_actions) / sizeof(struct gotcha_binding_t),
"my_tool_name");
}
static int fputs_wrapper(const char* str, FILE* f)
{
// insert clever tool logic here
// ...
// get my wrappee from Gotcha
typeof(&fputs_wrapper) wrappee_fputs = gotcha_get_wrappee(wrappee_fputs_handle);
// wrappee_fputs was directed to the original fputs by gotcha_wrap
return wrappee_fputs(str, f);
}
GOTCHA in timemory¶
A gotcha
component in timemory has three template paramters:
Size
: this is the maximum number of functions that can be wrapped by this GOTCHA instanceThis limit exists because of the use of
std::array
instead ofstd::vector
which ensures that a re-allocation does not invalidate any of the function pointers
Components
: this is specification of the timemory components to wrap around the function, i.e.component_tuple<...>
,component_list<...>
, orcomponent_hybrid<...>
The components specified cannot contain another
gotcha
component
Differentiator
: this is an optional template parameter (defaults tovoid
) and can be any data type. The sole purpose of this template parameter is to differentiate betweengotcha
components with identicalSize
andComponents
template parameters but wrap different functions.
A GOTCHA wrapper with timemory can be defined in a single line of code and there are helper macros provided that eliminate the need for specifying the function signature (return-type and arguments) due to the ability for templates to extract these parameters.
-
template<size_t
Nt
, typenameBundleT
, typenameDiffT
>
structgotcha
: public tim::component::base<gotcha<Nt, BundleT, DiffT>, void>, public tim::concepts::external_function_wrapper The gotcha component rewrites the global offset table such that calling the wrapped function actually invokes either a function which is wrapped by timemory instrumentation or is replaced by a timemory component with an function call operator (
operator()
) whose return value and arguments exactly match the original function. This component is only available on Linux and can only by applied to external, dynamically-linked functions (i.e. functions defined in a shared library). If theBundleT
template parameter is a non-empty component bundle, this component will surround the original function call with:bundle_type _obj{ "<NAME-OF-ORIGINAL-FUNCTION>" }; _obj.construct(_args...); _obj.start(); _obj.audit("<NAME-OF-ORIGINAL-FUNCTION>", _args...); Ret _ret = <CALL-ORIGINAL-FUNCTION> _obj.audit("<NAME-OF-ORIGINAL-FUNCTION>", _ret); _obj.stop();
- tparam Nt
Max number of functions which will wrapped by this component
- tparam BundleT
Component bundle to wrap around the function(s)
- tparam DiffT
Differentiator type to distinguish different sets of wrappers with identical values of
Nt
andBundleT
(or provide function call operator if replacing functions instead of wrapping functions)
If the
BundleT
template parameter is an empty variadic class, e.g.std::tuple<>
,tim::component_tuple<>
, etc., and theDiffT
template parameter is a timemory component, the assumption is that theDiffT
component has a function call operator which should replace the original function call, e.g.void* malloc(size_t)
can be replaced with a component withvoid* operator()(size_t)
, e.g.:// replace 'double exp(double)' struct exp_replace : base<exp_replace, void> { double operator()(double value) { float result = expf(static_cast<float>(value)); return static_cast<double>(result); } };
Example usage:
#include <timemory/timemory.hpp> #include <cassert> #include <cmath> #include <tuple> using empty_tuple_t = std::tuple<>; using base_bundle_t = tim::component_tuple<wall_clock, cpu_clock>; using gotcha_wrap_t = tim::component::gotcha<2, base_bundle_t, void>; using gotcha_repl_t = tim::component::gotcha<2, empty_tuple_t, exp_replace>; using impl_bundle_t = tim::mpl::append_type_t<base_bundle_t, tim::type_list<gotcha_wrap_t, gotcha_repl_t>>; void init_wrappers() { // wraps the sin and cos math functions gotcha_wrap_t::get_initializer() = []() { TIMEMORY_C_GOTCHA(gotcha_wrap_t, 0, sin); // index 0 replaces sin TIMEMORY_C_GOTCHA(gotcha_wrap_t, 1, cos); // index 1 replace cos }; // replaces the 'exp' function which may be 'exp' in symbols table // or '__exp_finite' in symbols table (use `nm <bindary>` to determine) gotcha_repl_t::get_initializer() = []() { TIMEMORY_C_GOTCHA(gotcha_repl_t, 0, exp); TIMEMORY_DERIVED_GOTCHA(gotcha_repl_t, 1, exp, "__exp_finite"); }; } // the following is useful to avoid having to call 'init_wrappers()' explicitly: // use comma operator to call 'init_wrappers' and return true static auto called_init_at_load = (init_wrappers(), true); int main() { assert(called_init_at_load == true); double angle = 45.0 * (M_PI / 180.0); impl_bundle_t _obj{ "main" }; // gotcha wrappers not activated yet printf("cos(%f) = %f\n", angle, cos(angle)); printf("sin(%f) = %f\n", angle, sin(angle)); printf("exp(%f) = %f\n", angle, exp(angle)); // gotcha wrappers are reference counted according to start/stop _obj.start(); printf("cos(%f) = %f\n", angle, cos(angle)); printf("sin(%f) = %f\n", angle, sin(angle)); printf("exp(%f) = %f\n", angle, exp(angle)); _obj.stop(); // gotcha wrappers will be deactivated printf("cos(%f) = %f\n", angle, cos(angle)); printf("sin(%f) = %f\n", angle, sin(angle)); printf("exp(%f) = %f\n", angle, exp(angle)); return 0; }
Function Wrapping¶
If an application wanted to insert tim::auto_timer
around (unmangled) MPI_Allreduce
and
(mangled) ext::do_work
in the following executable:
#include <mpi.h>
#include <vector>
int main(int argc, char** argv)
{
init();
MPI_Init(&argc, &argv);
int sizebuf = 100;
std::vector<double> sendbuf(sizebuf, 1.0);
// ... do some stuff
std::vector<double> recvbuf(sizebuf, 0.0);
MPI_Allreduce(sendbuf.data(), recvbuf.data(), sizebuf,
MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// ... etc.
int64_t nitr = 10;
std::pair<float, double> settings{ 1.25f, 2.5 };
std::tuple<float, double> result = ext::do_work(nitr, settings);
// ... etc.
return 0;
}
This would be the required specification using the TIMEMORY_C_GOTCHA
macro for unmangled functions
and TIMEMORY_CXX_GOTCHA
macro for mangled functions:
#include <timemory/timemory.hpp>
static constexpr size_t NUM_FUNCS = 2;
using gotcha_t = tim::component::gotcha<NUM_FUNCS, tim::auto_timer_t>;
void init()
{
TIMEMORY_C_GOTCHA(gotcha_t, 0, MPI_Allreduce);
TIMEMORY_CXX_GOTCHA(gotcha_t, 1, ext::do_work);
}
// uses comma operator to call init() during static construction of boolean
static bool is_initialized = (init(), true);
Additionally, TIMEMORY_DERIVED_GOTCHA
accepts the same parameters as TIMEMORY_C_GOTCHA
and TIMEMORY_CXX_GOTCHA
+ one extra parameter for specifying the symbol. For example,
TIMEMORY_DERIVED_GOTCHA(gotcha_t, 0, exp, "__finite_exp")
would create a gotcha based
on the function signature of the standard C math function double exp(double)
but it
would wrap the symbol named "__finite_exp"
in the binary (which may exist in lieu of
_exp
with certain compiler flags – in this situation, it is generally safe to just
wrap both).
Function Replacement¶
Suppose that an application is spending a signifincant amount of run-time calling the standard math library
double-precision exp
function and you would like to investigate whether using single-precision expf
is an
acceptable substitute in certain regions. Instead of writing the full specification
shown previously and manually enabling and disabling the wrapper in the region of interest, you can use timemory.
Provided below is the full component specification require to implement the replacement function.
// NOTE: declared in tim::component::
namespace tim
{
namespace component
{
struct exp_intercept : public base<exp_intercept, void>
{
double operator()(double val)
{
return expf(static_cast<float>(val));
}
};
} // namespace component
} // namespace tim
When the exp_intercept
component is appropriately configured within a gotcha
component,
whenever double exp(double)
is invoked, timemory will (via the GOTCHA library) redirect this function call to
double exp_intercept::operator()(double)
– and within this function, the replaced call to expf
is implemented.
Configuring the gotcha
component is slightly different, however. The goal of this component is optimization
instead of measurement or analysis so the gotcha
component is specified as such:
using empty_t = component_tuple<>;
using exp_gotcha_t = gotcha<1, empty_t, exp_intercept>;
In other words, we define a gotcha
component with an empty set of measurement/analysis components and
then we specify a component as the third template parameter. The combination of an empty measurement/analysis
collection as the second template parameter and a component as the third template parameter trigger a special
optimized wrapper around the original function call which is explicitly designed to minimize the overhead of
the redirection to the wrapper.
All that remains is implementing the initializer that specifies which functions are wrapped by the gotcha
component:
__attribute__((constructor))
void init_gotcha()
{
exp_gotcha_t::get_initializer() = [=]()
{ TIMEMORY_C_GOTCHA(exp_gotcha_t, 0, exp); };
}
In the above, using the constructor attribute (only available with certain compilers) creates a function
that is automatically executed before main starts. Since this function configured the gotcha within a call-back,
instead of explicitly invoking TIMEMORY_C_GOTCHA
, the gotcha wrapper is not activated during this function,
meaning that the redirection of exp
to expf
is explicitly tied to the allocation of
at least one instance of exp_gotcha_t
.
Extended Example¶
Reference: source/tests/gotcha_tests.cpp
Lets revisit the unmangled function (i.e. extern "C"
if compiling in C++) MPI_Allreduce
and
introduce a mangled function ext::do_work
.
The MPI_Allreduce
has the following signature (mpich):
int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
and is unmangled in the binary:
$ nm --dynamic test_gotcha_mpi | grep MPI_Allreduce
U MPI_Allreduce
An imaginary C++ function defined in an external library do_work
has the following signature:
namespace ext
{
std::tuple<float, double>
do_work(int, const std::pair<float, double>&);
} // namespace ext
and is mangled in the binary:
$ nm --dynamic test_gotcha_mpi | grep do_work
U _ZN3ext16do_work_crefEiRKSt4pairIfdE
Example Setup¶
Define a component tuple for “always-on” within
gotcha
componentsDefine a component list for runtime options within
gotcha
componentsDefine a hybrid type (
gotcha_hybrid_t
) for bundling tuple and list within thegotcha
componentsDefine a
gotcha
component (mpi_gotcha_t
) for wrappingMPI_Allreduce
Size of 1
Wraps function with
gotcha_hybrid_t
Define a
gotcha
component (work_gotcha_t
) for wrappingext::do_work
Size of 1
Wraps function with
gotcha_hybrid_t
Uses type
int
to differentiate itself frommpi_gotcha_t
Define a component tuple for “always-on” in the application
Includes
mpi_gotcha_t
andwork_gotcha_t
Define a component list for runtime options in the application
Re-use same component list as
gotcha
components
Define a hybrid type (
auto_hybrid_t
) for bundling tuple (includesgotcha
components) and list in in the application
Always-on in gotcha
components¶
real_clock
cpu_clock
peak_rss
Optional in gotcha
components¶
papi_array_t
cpu_roofline_sp_flops
cpu_roofline_dp_flops
Defining hybrid type for gotcha
components¶
using gotcha_tuple_t = component_tuple< real_clock ,
cpu_clock ,
peak_rss >;
using gotcha_list_t = component_list< papi_array_t ,
cpu_roofline_sp_flops ,
cpu_roofline_dp_flops >;
using gotcha_hybrid_t = auto_hybrid< gotcha_tuple_t ,
gotcha_list_t >;
Defining gotcha
components¶
using mpi_gotcha_t = gotcha< 1, auto_hybrid_t >;
using work_gotcha_t = gotcha< 1, auto_hybrid_t, int >;
// use 'int' to differentiate 'mpi_gotcha_t' from 'work_gotcha_t'
Defining hybrid type for application¶
using tuple_t = component_tuple< real_clock ,
cpu_clock ,
peak_rss ,
mpi_gotcha_t ,
work_gotcha_t >;
using list_t = gotcha_list_t;
using auto_hybrid_t = auto_hybrid< tuple_t ,
list_t >;
Configuring GOTCHA wrappers with a gotcha
component¶
Nomenclature | Description |
---|---|
wrappee |
The original function wrapped by the gotcha component |
wrapper |
The new function that adds customization around wrappee |
Each gotcha
component contains a public static function gotcha_type::configure<Index, ReturnType, Args...>
for explicit specification of a GOTCHA wrapper.
Template Parameter | Restrictions | Required | Description |
---|---|---|---|
Index |
0 <= Index < Size |
Yes | The "slot" within component. Sequential order not required |
ReturnType |
Identical to wrappee | Yes | The return type of the original function ("wrappee") |
ReturnType |
Identical to wrappee | No | The arguments to the wrappee (must be completely identical) |
Two helper macros are provided to simplify the explicit specification for two reasons:
Wrapping mangled functions explicitly can occasionally produce a correctly mangled name but failure to perfectly forward the argument types will produce a segmentation fault
E.g. specifying and argument as
int
when it should beconst int&
Functions taking aliases types will common fail to bind when the alias type changes but explicit specification did not use alias
E.g. specifying
unsigned long
instead ofuint64_t
whenuint64_t
is defined by one OS asunsigned long
but another OS usesunsigned long long
For C functions or extern “C” (unmangled) C++ functions, the helper macro is
TIMEMORY_C_GOTCHA(gotcha_type, Index, Function)
.
For mangled C++ functions, the helper macro is TIMEMORY_CXX_GOTCHA(gotcha_type, Index, Function)
.
Macro Parameter | Restrictions | Required |
---|---|---|
gotcha_type |
Fully specified gotcha component |
Yes |
Index |
0 <= Index < Size |
Yes |
Function |
Function call w/o return or args | Yes |
Function | Explicit |
---|---|
int MPI_Init(int*, char***) |
gotcha_t::configure<0, int, int*, char***>("MPI_Init") |
void modify(double&) |
gotcha_t::configure<1, void, double&>(mangle<decltype(modify)>("modify")) |
int& settings::get() |
gotcha_t::configure<2, int&>(mangle<decltype(settings::get)>("settings::get")) |
Function | Macro |
---|---|
int MPI_Init(int*, char***) |
TIMEMORY_C_GOTCHA(gotcha_t, 0, MPI_Init) |
void modify(double&) |
TIMEMORY_CXX_GOTCHA(gotcha_t, 1, modify) |
int& settings::get() |
TIMEMORY_CXX_GOTCHA(gotcha_t, 2, settings::get) |
Testing MPI_Allreduce
gotcha¶
The details::generate
function using a RNG to generate 1000 random entries and details::allreduce
calls MPI_Allreduce
with MPI_SUM
and the appropriate MPI_FLOAT
or MPI_DOUBLE
.
TEST_F(gotcha_tests, mpi_macro)
{
// configure callback when mpi_gotcha_t is activated
mpi_gotcha_t::get_initializer() = []()
{
TIMEMORY_C_GOTCHA(mpi_gotcha_t, 0, MPI_Allreduce);
};
TIMEMORY_BLANK_POINTER(auto_hybrid_t, details::get_test_name());
auto rank = tim::mpi::rank();
auto size = tim::mpi::size();
float fsum = 0.0;
double dsum = 0.0;
for(int i = 0; i < nitr; ++i)
{
auto fsendbuf = details::generate<float>(1000);
auto frecvbuf = details::allreduce(fsendbuf);
fsum += std::accumulate(frecvbuf.begin(), frecvbuf.end(), 0.0);
auto dsendbuf = details::generate<double>(1000);
auto drecvbuf = details::allreduce(dsendbuf);
dsum += std::accumulate(drecvbuf.begin(), drecvbuf.end(), 0.0);
}
for(int i = 0; i < size; ++i)
{
tim::mpi::barrier();
if(i == rank)
{
printf("[%i]> single-precision sum = %8.2f\n", rank, fsum);
printf("[%i]> double-precision sum = %8.2f\n", rank, dsum);
}
tim::mpi::barrier();
}
}
Testing do_work
gotcha¶
The implementation of ext::do_work
sums 1000 iterations of cos
or cosf
on a value and then increments the value by i * p.first
or i * p.second
where
i
is the iteration number.
TEST_F(gotcha_tests, work_macro)
{
// configure callback when work_gotcha_t is activated
work_gotcha_t::get_initializer() = []()
{
TIMEMORY_CXX_GOTCHA(work_gotcha_t, 0, ext::do_work);
};
TIMEMORY_BLANK_POINTER(auto_hybrid_t, details::get_test_name());
float fsum = 0.0;
double dsum = 0.0;
for(int i = 0; i < nitr; ++i)
{
auto ret = ext::do_work(1000, pair_t(0.25, 0.125));
fsum += std::get<0>(ret);
dsum += std::get<1>(ret);
}
auto rank = tim::mpi::rank();
printf("[%i]> single-precision sum = %8.2f\n", rank, fsum);
printf("[%i]> double-precision sum = %8.2f\n", rank, dsum);
}
Default Behavior¶
MPI Test Command¶
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.mpi_macro
MPI Test Output¶
[0]> single-precision sum = 50003576.00
[0]> double-precision sum = 50003802.88
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] mpi_macro : 2.646358 sec real, 1 laps, depth 0 (exclusive: 88.5%)
|0> [cxx] |_MPI_Allreduce : 0.303283 sec real, 200000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] mpi_macro : 2.650000 sec cpu, 1 laps, depth 0 (exclusive: 86.8%)
|0> [cxx] |_MPI_Allreduce : 0.350000 sec cpu, 200000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] mpi_macro : 50256.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_MPI_Allreduce : 0.000000 KB peak_rss, 200000 laps, depth 1
Work Test Command¶
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.work_macro
Work Test Output¶
[0]> single-precision sum = -2416347.50
[0]> double-precision sum = -1829370.79
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] work_macro : 2.879349 sec real, 1 laps, depth 0 (exclusive: 16.6%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.401164 sec real, 100000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] work_macro : 2.870000 sec cpu, 1 laps, depth 0 (exclusive: 16.7%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.390000 sec cpu, 100000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] work_macro : 13312.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 0.000000 KB peak_rss, 100000 laps, depth 1
Enabling PAPI¶
MPI Test + PAPI_DP_OPS Command¶
export TIMEMORY_COMPONENT_LIST_INIT="papi_array"
export TIMEMORY_PAPI_EVENTS="PAPI_DP_OPS"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.mpi_macro
MPI Test + PAPI_DP_OPS Output¶
[0]> single-precision sum = 50004212.00
[0]> double-precision sum = 49998991.47
[papi-1]> Outputting 'timemory-gotcha-tests-output/papi-1_0.txt'... Done
|0> [cxx] mpi_macro : 401800004 DP operations papi-1, 1 laps, depth 0 (exclusive: 99.8%)
|0> [cxx] |_MPI_Allreduce : 800000 DP operations papi-1, 200000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] mpi_macro : 3.216234 sec real, 1 laps, depth 0 (exclusive: 87.5%)
|0> [cxx] |_MPI_Allreduce : 0.400651 sec real, 200000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] mpi_macro : 3.210000 sec cpu, 1 laps, depth 0 (exclusive: 87.9%)
|0> [cxx] |_MPI_Allreduce : 0.390000 sec cpu, 200000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] mpi_macro : 50552.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_MPI_Allreduce : 0.000000 KB peak_rss, 200000 laps, depth 1
Work Test + PAPI_DP_OPS Command¶
export TIMEMORY_COMPONENT_LIST_INIT="papi_array"
export TIMEMORY_PAPI_EVENTS="PAPI_DP_OPS"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.work_macro
Work Test + PAPI_DP_OPS Output¶
[0]> single-precision sum = -2416347.50
[0]> double-precision sum = -1829370.79
[papi-1]> Outputting 'timemory-gotcha-tests-output/papi-1_0.txt'... Done
|0> [cxx] work_macro : 6260000004 DP operations papi-1, 1 laps, depth 0 (exclusive: 0.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 6259500000 DP operations papi-1, 100000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] work_macro : 3.490665 sec real, 1 laps, depth 0 (exclusive: 24.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.653306 sec real, 100000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] work_macro : 3.490000 sec cpu, 1 laps, depth 0 (exclusive: 26.4%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.570000 sec cpu, 100000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] work_macro : 13404.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 0.000000 KB peak_rss, 100000 laps, depth 1
Enabling CPU Roofline¶
MPI Test + Double-Precision Roofline Command¶
export TIMEMORY_COMPONENT_LIST_INIT="cpu_roofline_dp_flops"
export TIMEMORY_ROOFLINE_MODE="op"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.mpi_macro
export TIMEMORY_ROOFLINE_MODE="ai"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.mpi_macro
MPI Test + Double-Precision Roofline Output¶
[0]> single-precision sum = 49997152.00
[0]> double-precision sum = 49995907.40
[ert::executor]> working-set = 64, max-size = 31457280, num-thread = 1, num-stream = 0, grid-size = 0, block-size = 0, align-size = 64, data-type = double
[cpu_roofline_op]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_op_0.txt'... Done
[cpu_roofline_op]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_op_0.json'... Done
|0> [cxx] mpi_macro : 3.215877 sec real, 125315736.339528 (DP operations) / sec, 1 laps, depth 0 (exclusive: 97.8%)
|0> [cxx] |_MPI_Allreduce : 0.360500 sec real, 2773925.735022 (DP operations) / sec, 200000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
[real]> Outputting 'timemory-gotcha-tests-output/real_0.json'... Done
|0> [cxx] mpi_macro : 3.215669 sec real, 1 laps, depth 0 (exclusive: 86.6%)
|0> [cxx] |_MPI_Allreduce : 0.431685 sec real, 200000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.json'... Done
|0> [cxx] mpi_macro : 3.220000 sec cpu, 1 laps, depth 0 (exclusive: 86.6%)
|0> [cxx] |_MPI_Allreduce : 0.430000 sec cpu, 200000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.json'... Done
|0> [cxx] mpi_macro : 50816.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_MPI_Allreduce : 0.000000 KB peak_rss, 200000 laps, depth 1
[0]> single-precision sum = 50004272.00
[0]> double-precision sum = 49999650.33
[ert::executor]> working-set = 64, max-size = 31457280, num-thread = 1, num-stream = 0, grid-size = 0, block-size = 0, align-size = 64, data-type = double
[cpu_roofline_ai]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_ai_0.txt'... Done
[cpu_roofline_ai]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_ai_0.json'... Done
|0> [cxx] mpi_macro : 2.993197 sec real, 6160853780.000000 (L/S completed), 1 laps, depth 0 (exclusive: 59.5%)
|0> [cxx] |_MPI_Allreduce : 0.313040 sec real, 260801261.000000 (L/S completed), 200000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
[real]> Outputting 'timemory-gotcha-tests-output/real_0.json'... Done
|0> [cxx] mpi_macro : 2.992866 sec real, 1 laps, depth 0 (exclusive: 87.4%)
|0> [cxx] |_MPI_Allreduce : 0.378349 sec real, 200000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.json'... Done
|0> [cxx] mpi_macro : 2.980000 sec cpu, 1 laps, depth 0 (exclusive: 88.6%)
|0> [cxx] |_MPI_Allreduce : 0.340000 sec cpu, 200000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.json'... Done
|0> [cxx] mpi_macro : 50720.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_MPI_Allreduce : 0.000000 KB peak_rss, 200000 laps, depth 1
MPI Test + Double-Precision Roofline Plot¶
Work Test + Single-Precision Roofline Command¶
export TIMEMORY_COMPONENT_LIST_INIT="cpu_roofline_sp_flops"
export TIMEMORY_ROOFLINE_MODE="op"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.work_macro
export TIMEMORY_ROOFLINE_MODE="ai"
./source/tests/gotcha_tests --gtest_filter=gotcha_tests.work_macro
Work Test + Single-Precision Roofline Output¶
[0]> single-precision sum = -2416347.50
[0]> double-precision sum = -1829370.79
[ert::executor]> working-set = 64, max-size = 31457280, num-thread = 1, num-stream = 0, grid-size = 0, block-size = 0, align-size = 64, data-type = float
[cpu_roofline_op]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_op_0.txt'... Done
[cpu_roofline_op]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_op_0.json'... Done
|0> [cxx] work_macro : 3.070875 sec real, 97659471.010205 (SP operations) / sec, 1 laps, depth 0
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.414884 sec real, 124146740.673063 (SP operations) / sec, 100000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] work_macro : 3.070598 sec real, 1 laps, depth 0 (exclusive: 20.9%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.427928 sec real, 100000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] work_macro : 3.070000 sec cpu, 1 laps, depth 0 (exclusive: 19.2%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.480000 sec cpu, 100000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] work_macro : 13392.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 0.000000 KB peak_rss, 100000 laps, depth 1
[0]> single-precision sum = -2416347.50
[0]> double-precision sum = -1829370.79
[ert::executor]> working-set = 64, max-size = 31457280, num-thread = 1, num-stream = 0, grid-size = 0, block-size = 0, align-size = 64, data-type = float
[cpu_roofline_ai]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_ai_0.txt'... Done
[cpu_roofline_ai]> Outputting 'timemory-gotcha-tests-output/cpu_roofline_ai_0.json'... Done
|0> [cxx] work_macro : 3.123588 sec real, 9426961903.000000 (L/S completed), 1 laps, depth 0 (exclusive: 1.1%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.457765 sec real, 7338005721.000000 (L/S completed), 100000 laps, depth 1
[real]> Outputting 'timemory-gotcha-tests-output/real_0.txt'... Done
|0> [cxx] work_macro : 3.123375 sec real, 1 laps, depth 0 (exclusive: 21.1%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.465038 sec real, 100000 laps, depth 1
[cpu]> Outputting 'timemory-gotcha-tests-output/cpu_0.txt'... Done
|0> [cxx] work_macro : 3.130000 sec cpu, 1 laps, depth 0 (exclusive: 18.5%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 2.550000 sec cpu, 100000 laps, depth 1
[peak_rss]> Outputting 'timemory-gotcha-tests-output/peak_rss_0.txt'... Done
|0> [cxx] work_macro : 13420.000000 KB peak_rss, 1 laps, depth 0 (exclusive: 100.0%)
|0> [cxx] |_ext::do_work(long, std::pair<float, double> const&) : 0.000000 KB peak_rss, 100000 laps, depth 1