Roofline¶

An overview of the roofline components can be found here.

In general, a roofline plot requires measuring two quantities (NOTE: MOI == metric-of-interest):

Performance: MOI per unit time, e.g. GFLOPs/sec
Arithmetic Intensity (AI): MOI per byte, e.g. FLOPs/byte

Generating Roofline Data¶

Assuming the code contains a tim::component::cpu_roofline<...>:

# execute and enable hardware counters for arithmetic intensity
TIMEMORY_ROOFLINE_MODE=ai ./test_cxx_roofline
# execute and enable hardware counters for operations
TIMEMORY_ROOFLINE_MODE=op ./test_cxx_roofline

Roofline Python Module: `timemory.roofline`¶

../_images/roofline.jpg

Generating Roofline Plot with `timemory.roofline`¶

Currently, some hardware counters cannot be accumulated in a single-pass and as a result, the application must be executed twice to generate a roofline plot:

python -m timemory.roofline \
    -ai timemory-test-cxx-roofline-output/cpu_roofline_ai.json \
    -op timemory-test-cxx-roofline-output/cpu_roofline_op.json \
    -d

Option	Type	Description
`-ai`, `--arithmetic-intensity`	File	Input JSON with AI data
`-op`, `--operations`	File	Input JSON with Operation data
`-d`, `--display`	bool	Open a window with the plot
`-o`, `--output-file`	String	Output filename of roofline plot
`-D`, `--output-dir`	String	Output directory for plot
`--format`	Image file suffix	Image format to render

Executing an Application with `timemory.roofline`¶

python -m timemory.roofline -- ./test_cxx_roofline

Option	Type	Description
`-k`, `--keep-going`	bool	Continue even if execution returned non-zero exit code
`-t`, `--rtype`	Label	Roofline type
`-r`, `--rerun`	`ai`, `op`	Re-run this mode and not the other mode
`-d`, `--display`	bool	Open a window with the plot
`-o`, `--output-file`	bool	Output filename of roofline plot
`-D`, `--output-dir`	bool	Output directory for plot
`-n`, `--num-threads`	integer	Number of threads for the peak roofline calculation
`--format`	bool	Image format to render

Customizing the calculation of the “roof” for the Roofline¶

Timemory will run a customizable set of calculations at the conclusion of the application of calculate these peak (“roof”) values. This functionality is provided through the tim::policy::global_finalize policy. The default behavior of the roofline is targeted towards the multithreaded FMA (fused-multiply-add) peak and calculates the bandwidth limitations for L1, L2, L3, and DRAM.

Configuring number of threads in the Roofline¶

Environment Variable	Function
`TIMEMORY_ROOFLINE_NUM_THREADS`	`std::function<uint64_t()>& get_finalize_threads_function()`

Example:

cpu_roofline_dp_flops::get_finalize_threads_function() = []() { return 1; };

Full Customization of the Roofline Model¶

Full customization of the roofline model can be accomplished through:

tim::ert::exec_data<T> which handles the execution measurements
tim::ert::counter<DeviceT, T, DataT> which handles the accumulation of the execution measurements
tim::ert::configuration<DeviceT, T, DataT> which handles the configuration data such as the number of threads, streams, alignment, etc.
tim::ert::executor<DeviceT, T, DataT> which handles the algorithms and workflow of the

using Tp         = double;
using device_t   = tim::device::cpu;
using params_t   = tim::ert::exec_params;
using wall_t     = tim::component::wall_clock;
using data_t     = tim::ert::exec_data<wall_t>;
using counter_t  = tim::ert::counter<device_t, double, data_t>;
using config_t   = tim::ert::configuration<device_t, double, data_t>;
using data_ptr_t = std::shared_ptr<counter_t>;
using roofline_t = tim::component::cpu_roofline<double>;

// sets up the configuration
config_t::get_executor() = [=](data_ptr_t data) {
    // test getting the cache info
    auto l1_size = tim::ert::cache_size::get<1>();
    auto l2_size = tim::ert::cache_size::get<2>();
    auto l3_size = tim::ert::cache_size::get<3>();
    auto lm_size = tim::ert::cache_size::get_max();

    auto     dtype        = tim::demangle<double>();
    uint64_t max_size     = 8 * lm_size;
    uint64_t align_size   = 64;
    auto     num_threads  = config_t::get_num_threads()();
    auto     working_size = config_t::get_min_working_size()();

    // log the cache info
    std::cout << "[INFO]> L1 cache size: " << (l1_size / tim::units::kilobyte)
                << " KB, L2 cache size: " << (l2_size / tim::units::kilobyte)
                << " KB, L3 cache size: " << (l3_size / tim::units::kilobyte)
                << " KB, max cache size: " << (lm_size / tim::units::kilobyte)
                << " KB\n\n"
                << "[INFO]> num-threads      : " << num_threads << "\n"
                << "[INFO]> min-working-set  : " << working_size << " B\n"
                << "[INFO]> max-data-size    : " << max_size << " B\n"
                << "[INFO]> alignment        : " << align_size << "\n"
                << "[INFO]> data type        : " << dtype << "\n"
                << std::endl;

    params_t  params(working_size, max_size, num_threads);
    counter_t _counter(params, data, align_size);

    return _counter;
};

// does the execution of ERT
auto callback = [=](counter_t& _counter) {
    // these are the kernel functions we want to calculate the peaks with
    auto store_func = [](double& a, const double& b) { a = b; };
    auto add_func   = [](double& a, const double& b, const double& c) { a = b + c; };
    auto fma_func   = [](double& a, const double& b, const double& c) { a = a * b + c; };

    // set bytes per element
    _counter.bytes_per_element = sizeof(double);
    // set number of memory accesses per element from two functions
    _counter.memory_accesses_per_element = 2;

    // set the label
    _counter.label = "scalar_add";
    // run the operation _counter kernels
    tim::ert::ops_main<1>(_counter, add_func, store_func);

    // set the label
    _counter.label = "vector_fma";
    // run the kernels (<4> is ideal for avx, <8> is ideal for KNL)
    tim::ert::ops_main<4, 8>(_counter, fma_func, store_func);
};

// set the callback
roofline_t::set_executor_callback<double>(callback);