Understanding how memory access patterns affect performance is crucial in systems programming, especially in languages like C where manual memory management is central. This post explores how different ways of traversing a matrixâeither by rows or by columnsâimpact execution time, especially when matrices grow in size. This exploration provides insights into cache utilization and access efficiency in C.
The Problem
In computing, matrix traversal is a common operation in numerous applications, from graphics rendering to scientific computing. However, not all traversal methods yield the same performance, particularly on modern processors where cache memory plays a pivotal role. Cache memory, which sits between the CPU and the main memory, speeds up data access when it can predictably store and retrieve frequently accessed data. If our access pattern is unpredictable or inefficient, it may lead to frequent cache misses, slowing down the program.
In this exercise, we analyze two primary access patterns:
- Row-wise traversal: Iterating over matrix elements by rows, where elements stored consecutively in memory are accessed sequentially.
- Column-wise traversal: Iterating over matrix elements by columns, accessing non-consecutive elements in memory, potentially increasing cache misses.
Analysis Goals
Through the next test, we aim to understand:
- How access patterns influence performance: Observing execution time when traversing by rows vs. columns.
- The effect of matrix size on cache efficiency: Determining if larger matrices lead to a noticeable difference in performance between row and column access.
- Practical optimization insights: Applying these findings to write more cache-efficient code, especially for performance-critical applications.
Performance Evaluation
Itâs very important to know how to evaluate the performance of out codes. Know how much time out code takes. In my personal context ( gamedev ) it is imperative since this defines the performance of our games.
Measuring the time it takes to execute a piece of code is relatively simple if we define a function captureTime
that works as a stopwatch:
t1 = captureTime();
// code to measure
t2 = captureTime();
time = t2 - t1; // time it takes to execute the code
The problem with this method is that the result and cost of this captureTime
function is hardware and operating system dependent.
In addition to this problem, we have to have in mind that the same exact code doesnât always takes the same time to execute due to the complexity of the factors involved nowadays. The code to be measure does not run directly on the architecture, but rather there is an operating system and devices that interrupt the usual flow of execution. So what we do is run the same code several times and do a statistical analysis of the result to get a more accurate values.
Evaluating matrix
To analyze performance, we evaluate a simple matrix traversal code to compare execution times for row-wise and column-wise accesses. For small matrices (e.g., 8x8
), the execution times are almost identical for both traversal methods. However, as matrix size increases, the differences become significant.
void init_data()
{
int i, j;
for(i=0; i<ROWS; i++) {
for(j=0; j<COLS; j++) {
M[i][j] = i+j*10;
}
}
}
void matrix_rows()
{
int i, j;
for(i=0; i<ROWS; i++) {
for(j=0; j<COLS; j++) {
M[i][j] = i+j;
}
}
}
void matrix_cols()
{
int i, j;
for(j=0; j<COLS; j++) {
for(i=0; i<ROWS; i++) {
M[i][j] = i+j;
}
}
}
In our test setup, the traversal code is executed multiple times, and the lowest time (in microseconds) is selected after up to 10 repetitions. To ensure accuracy, the three lowest execution times are validated to have an error margin below 1% (0.01).
double measure( void (*f)(void) )
{
return measure_full((int(*)(int,int))f, 0xDEAD, 0xDEAD, 3, 0.01, 10); // 1% error
}
Matrix sizes are incremented in powers of two (8, 16, 32, âĶ) for both rows and columns, and execution times are recorded for each case.
Matrix Size | Time init | Time row 1 | Time row 2 | Time row 3 | Time col 1 | Time col 2 | Time col 3 |
---|---|---|---|---|---|---|---|
16 x 16 | 0Ξs | 0Ξs | 0Ξs | 0Ξs | 0Ξs | 0Ξs | 0Ξs |
32 x 32 | 2Ξs | 2Ξs | 2Ξs | 2Ξs | 3Ξs | 2Ξs | 2Ξs |
64 x 64 | 6Ξs | 6Ξs | 6Ξs | 6Ξs | 3Ξs | 3Ξs | 3Ξs |
128 x 128 | 19Ξs | 21Ξs | 10Ξs | 19Ξs | 16Ξs | 13Ξs | 15Ξs |
256 x 256 | 62Ξs | 63Ξs | 63Ξs | 63Ξs | 114Ξs | 116Ξs | 116Ξs |
512 x 512 | 343Ξs | 332Ξs | 332Ξs | 332Ξs | 508Ξs | 503Ξs | 503Ξs |
1024 x 1024 | 1319Ξs | 1324Ξs | 1323Ξs | 1321Ξs | 6663Ξs | 5364Ξs | 6067Ξs |
2048 x 2048 | 5293Ξs | 5312Ξs | 5296Ξs | 5320Ξs | 31473Ξs | 31377Ξs | 31384Ξs |
4096 x 4096 | 14514Ξs | 15338Ξs | 14309Ξs | 14742Ξs | 108086Ξs | 105834Ξs | 105835Ξs |
xychart-beta
title "Matrix Traversal Times Comparison"
x-axis ["16", "32", "64", "128", "256", "512", "1024", "2048", "4096"]
y-axis "Time (Ξs)" 0 --> 120000
line "Row Test 1: Traversal by Row" [0, 2, 6, 21, 63, 332, 1324, 5312, 15338]
line "Row Test 2: Traversal by Row" [0, 2, 6, 10, 63, 332, 1323, 5296, 14309]
line "Row Test 3: Traversal by Row" [0, 2, 6, 19, 63, 332, 1321, 5320, 14742]
line "Column Test 1: Traversal by Column" [0, 3, 3, 16, 115, 505, 6031, 31473, 108086]
line "Column Test 2: Traversal by Column" [0, 2, 3, 13, 116, 503, 5364, 31377, 105834]
line "Column Test 3: Traversal by Column" [0, 2, 3, 15, 116, 503, 6067, 31384, 105835]
Conclusions
Is the execution time similar when traversing by rows or columns?
The execution time depends on the matrix size. As we can observe in the table above, for smaller matrix sizes like 16x16
or 32x32
, the access times stabilize for both rows and columns, showing almost identical performance.
What happens as the matrix size increases?
When moving to slightly larger sizes, such as 64x64
or 128x128
, access times are shorter for columns than for rows. This difference becomes even more significant as the matrix size continues to grow.
What happens with matrices of size 256x256
and larger?
For matrices equal to or larger than 256x256
, access times change. Row-wise access starts showing lower access times compared to column-wise access, and this difference increases as the matrix size grows. This behavior is particularly noticeable with larger matrices like 4096x4096
.
What can be concluded from the table? Is there a change in execution times between the two traversal methods?
Execution times vary based on the matrix size and whether we are accessing by row or by column. The time for accessing rows and columns is very similar for smaller matrices, but as the matrix size increases, the differences become more apparent.
How much space does one row occupy? Why does this happens?
The size of a row depends on the size of the matrix. For example, in a matrix, the size of a row is elements. If we calculate the size of one row, we get approximately 4KB (). This is important because the cache behavior influences how data is loaded and accessed.
How do the cache levels of your CPU impact the results?
CPU
11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
Base speed: 2,50 GHz
Sockets: 1
Cores: 8
Threads: 16
Cache L1: 80K (per core)
Cache L2: 512K (per core)
Cache L3: 16MB (shared)
For smaller matrices, where the entire matrix fits into the L1 cache, the execution times are very fast and nearly identical for both row-wise and column-wise traversals. However, as the matrix size increases and it no longer fits into the L1 cache, it has to be loaded into L2 or even L3 cache, which increases the access time. This explains the increase in access time for column traversal: when accessing data by column first, followed by row, the elements we need to access may not be present in the lower-level caches, resulting in slower memory access.
In summary, matrix traversal patterns can significantly impact performance, especially as matrix size increases. For small matrices, both row-wise and column-wise accesses show similar performance due to the data fitting within the CPUâs L1 cache. However, as the matrix size grows, row-wise traversal becomes increasingly efficient. This behavior is driven by how data is stored in memory and accessed, and how the CPUâs cache hierarchy influences performance.
Key Takeaways:
- Cache locality plays a significant role in matrix traversal performance, where row-wise traversal benefits from better cache utilization due to sequential memory access.
- Smaller matrices show little difference in access times, but as matrix sizes increase and exceed cache sizes, the differences become more pronounced.
- Optimization insight: In real-world applications, such as gamedev or scientific computing, choosing the correct traversal method based on matrix size can lead to significant performance improvements.
If youâve made it this far, thank you for reading! I hope this analysis offers valuable insights for optimizing performance in your own projects.