PAPI
About PAPI
PAPI (Performance Application Programming Interface) provides a powerful and portable way to measure hardware performance at the lowest level. PAPI is an open-source project available at PAPI Project.
Build recipes and compilation instructions for this installation are available at:
https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/papi
PAPI enables you to:
What PAPI can do for you
Measure Hardware Performance Counters
Access CPU cycles, instruction counts, and execution metrics
Monitor cache behaviour (hits, misses, accesses)
Track branch prediction accuracy
Measure floating-point operation rates
Analyse memory bandwidth and access patterns
Profile Application Performance
Identify performance bottlenecks at the hardware level
Understand why code runs slowly (cache misses, branch mispredictions, etc.)
Compare performance across different algorithms or implementations
Measure the impact of compiler optimizations
Optimize Code Based on Data
Use hardware metrics to guide optimization efforts
Detect cache-unfriendly memory access patterns
Identify branch prediction issues
Measure floating-point efficiency
Cross-Platform Performance Analysis
Write performance measurement code once, run on multiple architectures
Compare performance across different hardware platforms
Conduct reproducible performance studies
Basic Workflow
The typical workflow for using PAPI is straightforward:
Initialize PAPI - Set up the library
Create event set and add events - Choose what to measure
Start counters - Begin measurement
Execute code - Run the code you want to profile
Stop counters and read values - Get the performance data
Cleanup - Release resources
With these steps, you can measure and analyse the performance characteristics of your applications at the hardware level, providing insights that are impossible to obtain from simple timing measurements alone.
Prerequisites
Load required modules:
# Load LLVM module (required for clang and clang++) module load llvm # Load PAPI module module load papi/7/7.2.0
Verify PAPI is available:
papi_version papi_avail
Verify compiler availability:
# Check GCC (usually available by default) gcc --version g++ --version # Check clang and clang++ versions and availability (requires llvm module to be loaded) clang --version clang++ --version
Compiling PAPI applications
Basic compilation
With gcc:
module load papi/7
gcc -o my_program my_program.c -lpapi
With clang:
module load papi/7
module load llvm
clang -o my_program my_program.c -lpapi
With optimization
For production code, use optimization flags:
With gcc:
module load papi/7
gcc -O3 -o my_program my_program.c -lpapi
With clang:
module load papi/7
module load llvm
clang -O3 -o my_program my_program.c -lpapi
Using module environment variables
The PAPI module sets up compiler flags automatically:
With g++:
module load papi/7
gcc $CFLAGS -o my_program my_program.c $LDFLAGS -lpapi
With clang:
module load papi/7
module load llvm
clang $CFLAGS -o my_program my_program.c $LDFLAGS -lpapi
Or explicitly specify paths:
# GCC (usually available by default)
gcc -o my_program my_program.c -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -lpapi
# Clang (requires module load llvm)
module load llvm
clang -o my_program my_program.c -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -lpapi
Basic PAPI usage pattern
1. Initialize PAPI
#include <papi.h>
int retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI init error: %s\n", PAPI_strerror(retval));
exit(1);
}
2. Create an Event Set
int EventSet = PAPI_NULL;
retval = PAPI_create_eventset(&EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "Error creating eventset: %s\n", PAPI_strerror(retval));
exit(1);
}
3. Add Events to Measure
Common events:
PAPI_TOT_CYC- Total CPU cyclesPAPI_TOT_INS- Total instructionsPAPI_L1_DCM- L1 data cache missesPAPI_L2_DCM- L2 data cache missesPAPI_BR_MSP- Branch mispredictionsPAPI_FP_OPS- Floating point operations
retval = PAPI_add_event(EventSet, PAPI_TOT_CYC);
if (retval != PAPI_OK) {
fprintf(stderr, "Error adding event: %s\n", PAPI_strerror(retval));
exit(1);
}
4. Start Measurement
retval = PAPI_start(EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "Error starting counters: %s\n", PAPI_strerror(retval));
exit(1);
}
5. Execute Code to Measure
// Your code here
for (int i = 0; i < iterations; i++) {
// computation
}
6. Stop and Read Values
long long values[1];
retval = PAPI_stop(EventSet, values);
if (retval != PAPI_OK) {
fprintf(stderr, "Error stopping counters: %s\n", PAPI_strerror(retval));
exit(1);
}
printf("Total cycles: %lld\n", values[0]);
7. Cleanup
PAPI_cleanup_eventset(EventSet);
PAPI_destroy_eventset(&EventSet);
PAPI_shutdown();
Complete code examples
C code example
Here is a complete working example in C that demonstrates all the steps together:
/* Created by Veselin Kolev <v.kolev@discoverer.bg> on 31 December 2025
*
* Complete PAPI Example
*
* This program demonstrates basic usage of PAPI to measure hardware
* performance counters. It performs a simple computation and measures
* CPU cycles, instructions, and cache misses.
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <papi.h>
#define NUM_EVENTS 3
#define ERROR_RETURN(retval) { \
fprintf(stderr, "Error %d %s:line %d: \n", retval, __FILE__, __LINE__); \
exit(retval); \
}
int main(int argc, char **argv)
{
int retval, i;
int EventSet = PAPI_NULL;
long long values[NUM_EVENTS];
int events[NUM_EVENTS] = {PAPI_TOT_CYC, PAPI_TOT_INS, PAPI_L1_DCM};
char event_names[NUM_EVENTS][PAPI_MAX_STR_LEN];
int event_indices[NUM_EVENTS]; /* Track which events were successfully added */
int num_added = 0;
/* Initialize PAPI library */
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI library init error: %s\n", PAPI_strerror(retval));
ERROR_RETURN(retval);
}
printf("PAPI initialized successfully\n");
printf("PAPI Version: %d.%d.%d\n",
PAPI_VERSION_MAJOR(PAPI_VERSION),
PAPI_VERSION_MINOR(PAPI_VERSION),
PAPI_VERSION_REVISION(PAPI_VERSION));
/* Create EventSet */
retval = PAPI_create_eventset(&EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI create eventset error: %s\n", PAPI_strerror(retval));
ERROR_RETURN(retval);
}
/* Add events to EventSet */
for (i = 0; i < NUM_EVENTS; i++) {
/* Check if event is available before adding */
retval = PAPI_query_event(events[i]);
if (retval != PAPI_OK) {
retval = PAPI_event_code_to_name(events[i], event_names[i]);
if (retval == PAPI_OK) {
fprintf(stderr, "Warning: Event %s is not available on this platform\n", event_names[i]);
} else {
fprintf(stderr, "Warning: Event %d is not available on this platform\n", events[i]);
}
continue;
}
retval = PAPI_add_event(EventSet, events[i]);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI add event error: %s\n", PAPI_strerror(retval));
retval = PAPI_event_code_to_name(events[i], event_names[i]);
if (retval == PAPI_OK) {
fprintf(stderr, "Event %s may not be available on this platform\n", event_names[i]);
} else {
fprintf(stderr, "Event %d may not be available on this platform\n", events[i]);
}
continue;
}
/* Get event name for display */
retval = PAPI_event_code_to_name(events[i], event_names[i]);
if (retval != PAPI_OK) {
sprintf(event_names[i], "Event_%d", events[i]);
}
/* Track successfully added events */
event_indices[num_added] = i;
num_added++;
}
if (num_added == 0) {
fprintf(stderr, "Error: No events could be added to the eventset\n");
ERROR_RETURN(1);
}
printf("Successfully added %d event(s) to eventset\n", num_added);
/* Start counting */
retval = PAPI_start(EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI start error: %s\n", PAPI_strerror(retval));
ERROR_RETURN(retval);
}
printf("\nStarting measurement...\n");
/* Perform some computation */
volatile double sum = 0.0;
int iterations = 1000000;
for (i = 0; i < iterations; i++) {
sum += i * 1.5;
}
printf("Computation completed (sum = %f)\n", sum);
/* Stop counting and read values */
retval = PAPI_stop(EventSet, values);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI stop error: %s\n", PAPI_strerror(retval));
ERROR_RETURN(retval);
}
/* Display results */
printf("\n=== Performance Counter Results ===\n");
for (i = 0; i < num_added; i++) {
int idx = event_indices[i];
printf("%-30s: %lld\n", event_names[idx], values[i]);
}
/* Calculate derived metrics */
/* Find which events were successfully added and their positions */
int cycles_pos = -1, ins_pos = -1, cache_miss_pos = -1;
for (i = 0; i < num_added; i++) {
int idx = event_indices[i];
if (events[idx] == PAPI_TOT_CYC) cycles_pos = i;
if (events[idx] == PAPI_TOT_INS) ins_pos = i;
if (events[idx] == PAPI_L1_DCM) cache_miss_pos = i;
}
if (cycles_pos >= 0 && ins_pos >= 0 && values[ins_pos] > 0) {
double cpi = (double)values[cycles_pos] / (double)values[ins_pos];
printf("\nCycles per Instruction (CPI): %.4f\n", cpi);
}
if (cache_miss_pos >= 0 && values[cache_miss_pos] > 0) {
if (ins_pos >= 0 && values[ins_pos] > 0) {
/* Calculate miss rate per instruction */
double miss_rate = (double)values[cache_miss_pos] / (double)values[ins_pos] * 100.0;
printf("L1 Data Cache Miss Rate: %.4f%% (misses per instruction)\n", miss_rate);
} else {
/* Just show the raw number if we don't have instruction count */
printf("\nL1 Data Cache Misses: %lld\n", values[cache_miss_pos]);
}
}
/* Cleanup */
retval = PAPI_cleanup_eventset(EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI cleanup eventset error: %s\n", PAPI_strerror(retval));
}
retval = PAPI_destroy_eventset(&EventSet);
if (retval != PAPI_OK) {
fprintf(stderr, "PAPI destroy eventset error: %s\n", PAPI_strerror(retval));
}
PAPI_shutdown();
printf("\nPAPI test completed successfully\n");
return 0;
}
Save this code to a file (e.g., example.c) and compile it:
With gcc:
module load papi/7
gcc $CFLAGS -o example example.c $LDFLAGS -lpapi
With clang:
module load llvm
module load papi/7
clang $CFLAGS -o example example.c $LDFLAGS -lpapi
Or explicitly specify paths:
module load papi/7
gcc -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -o example example.c -lpapi
# Clang (requires module load llvm first)
module load papi/7
module load llvm
clang -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -o example example.c -lpapi
C++ code example
Here is the same example written in C++ using modern C++ features:
/* Created by Veselin Kolev <v.kolev@discoverer.bg> on 31 December 2025
*
* Complete PAPI Example (C++)
*
* This program demonstrates basic usage of PAPI to measure hardware
* performance counters. It performs a simple computation and measures
* CPU cycles, instructions, and cache misses.
*
*/
#include <iostream>
#include <iomanip>
#include <vector>
#include <string>
#include <cstdio>
#include <papi.h>
extern "C" {
#include <stdlib.h>
}
const int NUM_EVENTS = 3;
void error_exit(int retval, const char* file, int line) {
std::cerr << "Error " << retval << " " << file << ":line " << line << std::endl;
std::exit(retval);
}
int main(int argc, char **argv)
{
int retval;
int EventSet = PAPI_NULL;
std::vector<long long> values(NUM_EVENTS);
std::vector<int> events = {PAPI_TOT_CYC, PAPI_TOT_INS, PAPI_L1_DCM};
std::vector<std::string> event_names(NUM_EVENTS);
std::vector<int> event_indices; // Track which events were successfully added
int num_added = 0;
/* Initialize PAPI library */
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT) {
std::cerr << "PAPI library init error: " << PAPI_strerror(retval) << std::endl;
error_exit(retval, __FILE__, __LINE__);
}
std::cout << "PAPI initialized successfully" << std::endl;
std::cout << "PAPI Version: "
<< PAPI_VERSION_MAJOR(PAPI_VERSION) << "."
<< PAPI_VERSION_MINOR(PAPI_VERSION) << "."
<< PAPI_VERSION_REVISION(PAPI_VERSION) << std::endl;
/* Create EventSet */
retval = PAPI_create_eventset(&EventSet);
if (retval != PAPI_OK) {
std::cerr << "PAPI create eventset error: " << PAPI_strerror(retval) << std::endl;
error_exit(retval, __FILE__, __LINE__);
}
/* Add events to EventSet */
for (size_t i = 0; i < events.size(); i++) {
/* Check if event is available before adding */
retval = PAPI_query_event(events[i]);
if (retval != PAPI_OK) {
char name[PAPI_MAX_STR_LEN];
retval = PAPI_event_code_to_name(events[i], name);
if (retval == PAPI_OK) {
std::cerr << "Warning: Event " << name
<< " is not available on this platform" << std::endl;
} else {
std::cerr << "Warning: Event " << events[i]
<< " is not available on this platform" << std::endl;
}
continue;
}
retval = PAPI_add_event(EventSet, events[i]);
if (retval != PAPI_OK) {
std::cerr << "PAPI add event error: " << PAPI_strerror(retval) << std::endl;
char name[PAPI_MAX_STR_LEN];
retval = PAPI_event_code_to_name(events[i], name);
if (retval == PAPI_OK) {
std::cerr << "Event " << name
<< " may not be available on this platform" << std::endl;
} else {
std::cerr << "Event " << events[i]
<< " may not be available on this platform" << std::endl;
}
continue;
}
/* Get event name for display */
char name[PAPI_MAX_STR_LEN];
retval = PAPI_event_code_to_name(events[i], name);
if (retval == PAPI_OK) {
event_names[i] = std::string(name);
} else {
event_names[i] = "Event_" + std::to_string(events[i]);
}
/* Track successfully added events */
event_indices.push_back(i);
num_added++;
}
if (num_added == 0) {
std::cerr << "Error: No events could be added to the eventset" << std::endl;
error_exit(1, __FILE__, __LINE__);
}
std::cout << "Successfully added " << num_added << " event(s) to eventset" << std::endl;
/* Start counting */
retval = PAPI_start(EventSet);
if (retval != PAPI_OK) {
std::cerr << "PAPI start error: " << PAPI_strerror(retval) << std::endl;
error_exit(retval, __FILE__, __LINE__);
}
std::cout << "\nStarting measurement..." << std::endl;
/* Perform some computation */
volatile double sum = 0.0;
const int iterations = 1000000;
for (int i = 0; i < iterations; i++) {
sum += i * 1.5;
}
std::cout << "Computation completed (sum = " << std::fixed
<< std::setprecision(6) << sum << ")" << std::endl;
/* Stop counting and read values */
retval = PAPI_stop(EventSet, values.data());
if (retval != PAPI_OK) {
std::cerr << "PAPI stop error: " << PAPI_strerror(retval) << std::endl;
error_exit(retval, __FILE__, __LINE__);
}
/* Display results */
std::cout << "\n=== Performance Counter Results ===" << std::endl;
for (int i = 0; i < num_added; i++) {
int idx = event_indices[i];
std::cout << std::left << std::setw(30) << event_names[idx]
<< ": " << values[i] << std::endl;
}
/* Calculate derived metrics */
/* Find which events were successfully added and their positions */
int cycles_pos = -1, ins_pos = -1, cache_miss_pos = -1;
for (int i = 0; i < num_added; i++) {
int idx = event_indices[i];
if (events[idx] == PAPI_TOT_CYC) cycles_pos = i;
if (events[idx] == PAPI_TOT_INS) ins_pos = i;
if (events[idx] == PAPI_L1_DCM) cache_miss_pos = i;
}
if (cycles_pos >= 0 && ins_pos >= 0 && values[ins_pos] > 0) {
double cpi = static_cast<double>(values[cycles_pos]) / static_cast<double>(values[ins_pos]);
std::cout << "\nCycles per Instruction (CPI): "
<< std::fixed << std::setprecision(4) << cpi << std::endl;
}
if (cache_miss_pos >= 0 && values[cache_miss_pos] > 0) {
if (ins_pos >= 0 && values[ins_pos] > 0) {
/* Calculate miss rate per instruction */
double miss_rate = static_cast<double>(values[cache_miss_pos])
/ static_cast<double>(values[ins_pos]) * 100.0;
std::cout << "L1 Data Cache Miss Rate: "
<< std::fixed << std::setprecision(4) << miss_rate
<< "% (misses per instruction)" << std::endl;
} else {
/* Just show the raw number if we don't have instruction count */
std::cout << "\nL1 Data Cache Misses: "
<< values[cache_miss_pos] << std::endl;
}
}
/* Cleanup */
retval = PAPI_cleanup_eventset(EventSet);
if (retval != PAPI_OK) {
std::cerr << "PAPI cleanup eventset error: " << PAPI_strerror(retval) << std::endl;
}
retval = PAPI_destroy_eventset(&EventSet);
if (retval != PAPI_OK) {
std::cerr << "PAPI destroy eventset error: " << PAPI_strerror(retval) << std::endl;
}
PAPI_shutdown();
std::cout << "\nPAPI test completed successfully" << std::endl;
return 0;
}
Save this code to a file (e.g., example.cpp) and compile it:
With g++:
module load papi/7
g++ $CXXFLAGS -o example example.cpp $LDFLAGS -lpapi
With clang++:
module load llvm
module load papi/7
clang++ -stdlib=libc++ $CXXFLAGS -o example example.cpp $LDFLAGS -lpapi
Or explicitly specify paths:
module load papi/7
g++ -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -o example example.cpp -lpapi
# Clang++ (requires module load llvm first)
module load papi/7
module load llvm
clang++ -stdlib=libc++ -I$PAPI_ROOT/include -L$PAPI_ROOT/lib -o example example.cpp -lpapi
Common performance events
CPU metrics
PAPI_TOT_CYC- Total CPU cyclesPAPI_TOT_INS- Total instructions executedPAPI_REF_CYC- Reference cycles
Cache metrics
PAPI_L1_DCM- L1 data cache missesPAPI_L1_DCA- L1 data cache accessesPAPI_L2_DCM- L2 data cache missesPAPI_L2_DCA- L2 data cache accessesPAPI_L3_TCM- L3 total cache misses
Branch metrics
PAPI_BR_CN- Conditional branchesPAPI_BR_MSP- Branch mispredictionsPAPI_BR_PRC- Conditional branches correctly predicted
Floating point
PAPI_FP_OPS- Floating point operationsPAPI_SP_OPS- Single precision operationsPAPI_DP_OPS- Double precision operations
Memory
PAPI_LD_INS- Load instructionsPAPI_SR_INS- Store instructions
Discovering available events
List all available events
papi_avail
List native events
papi_native_avail
Check specific event
papi_event_chooser PAPI_TOT_CYC
Error handling
Always check return values from PAPI functions:
int retval = PAPI_function(...);
if (retval != PAPI_OK) {
fprintf(stderr, "Error: %s\n", PAPI_strerror(retval));
// Handle error appropriately
}
Common error codes
PAPI_OK- SuccessPAPI_EINVAL- Invalid argumentPAPI_ENOMEM- Out of memoryPAPI_ECNFLCT- Event conflictPAPI_ENOEVNT- Event not available
Advanced usage
Multiple events
int events[] = {PAPI_TOT_CYC, PAPI_TOT_INS, PAPI_L1_DCM};
long long values[3];
// Add all events
for (int i = 0; i < 3; i++) {
PAPI_add_event(EventSet, events[i]);
}
PAPI_start(EventSet);
// ... code ...
PAPI_stop(EventSet, values);
// values[0] = cycles, values[1] = instructions, values[2] = cache misses
High-level API
PAPI also provides a high-level API for common metrics:
float real_time, proc_time, mflops;
long long flpops;
PAPI_flops(&real_time, &proc_time, &flpops, &mflops);
printf("MFLOPS: %f\n", mflops);
Thread safety
PAPI is thread-safe. Each thread should:
Create its own EventSet
Initialize counters independently
Clean up its own resources
Performance considerations
Overhead: PAPI has minimal overhead, but frequent start/stop operations can add up
Counter Limits: Hardware has a limited number of simultaneous counters (typically 2-8)
Multiplexing: Use PAPI multiplexing to measure more events than available counters
Sampling: For long-running codes, consider sampling instead of continuous measurement
Troubleshooting
“Event not available”
Some events may not be available on all platforms: - Check with
papi_avail - Use PAPI_query_event() to check availability before
adding - Have fallback events ready
Permission issues
Some counters require special permissions:
May need to run as root
Or adjust
/proc/sys/kernel/perf_event_paranoidto allow user access
Linking issues
If you get undefined references: - Ensure -lpapi is at the end of the link command - Check that LD_LIBRARY_PATH includes PAPI library directory - Verify module is loaded: module list
Additional resources
PAPI User’s Guide:
man papiPAPI API Reference:
man PAPI_startExample programs in PAPI distribution
Online PAPI documentation
Example makefile
Using gcc
CC = gcc
CFLAGS = -O3 -Wall
LDFLAGS = -lpapi
example: example.c
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Using clang
Note
Requires module load llvm before running make.
CC = clang
CFLAGS = -O3 -Wall
LDFLAGS = -lpapi
example: example.c
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Compiler-Agnostic Makefile
# Default to gcc (more widely available), but can be overridden
# For clang: make CC=clang (requires module load llvm)
CC ?= gcc
CFLAGS = -O3 -Wall
LDFLAGS = -lpapi
example: example.c
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Usage:
# GCC (usually available by default)
module load papi/7
make CC=gcc # Explicitly use gcc
# Clang (requires module load llvm first)
module load papi/7
module load llvm
make CC=clang # Use clang
C++ Makefile Examples
Using g++
CXX = g++
CXXFLAGS = -O3 -Wall
LDFLAGS = -lpapi
example: example.cpp
$(CXX) $(CXXFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Using clang++
Note
Requires module load llvm before running make.
CXX = clang++
CXXFLAGS = -O3 -Wall -stdlib=libc++
LDFLAGS = -lpapi
example: example.cpp
$(CXX) $(CXXFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Compiler-Agnostic Makefile for C++
# Default to g++ (more widely available), but can be overridden
# For clang++: make CXX=clang++ (requires module load llvm)
CXX ?= g++
CXXFLAGS = -O3 -Wall
# Add -stdlib=libc++ if using clang++
ifeq ($(CXX),clang++)
CXXFLAGS += -stdlib=libc++
endif
LDFLAGS = -lpapi
example: example.cpp
$(CXX) $(CXXFLAGS) -o $@ $< $(LDFLAGS)
clean:
rm -f example
Usage:
# G++ (usually available by default)
module load papi/7
make CXX=g++ # Explicitly use g++
# Clang++ (requires module load llvm first)
module load papi/7
module load llvm
make CXX=clang++ # Use clang++