1D or 2D array, what’s faster?

Question

tl;dr : You should probably use a one-dimensional approach.

Note: One cannot dig into detail affecting performance when comparing dynamic 1d or dynamic 2d storage patterns without filling books since the performance of code is dependent one a very large number of parameters. Profile if possible.

1. What’s faster?

For dense matrices the 1D approach is likely to be faster since it offers better memory locality and less allocation and deallocation overhead.

2. What’s smaller?

Dynamic-1D consumes less memory than the 2D approach. The latter also requires more allocations.

Remarks

I laid out a pretty long answer beneath with several reasons but I want to make some remarks on your assumptions first.

I can imagine, that recalculating indices for 1D arrays (y + x*n) could be slower than using 2D array (x, y)

Let’s compare these two functions:

int get_2d (int **p, int r, int c) { return p[r][c]; }
int get_1d (int *p, int r, int c)  { return p[c + C*r]; }

The (non-inlined) assembly generated by Visual Studio 2015 RC for those functions (with optimizations turned on) is:

?get_1d@@YAHPAHII@Z PROC
push    ebp
mov ebp, esp
mov eax, DWORD PTR _c$[ebp]
lea eax, DWORD PTR [eax+edx*4]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0

?get_2d@@YAHPAPAHII@Z PROC
push ebp
mov ebp, esp
mov ecx, DWORD PTR [ecx+edx*4]
mov eax, DWORD PTR _c$[ebp]
mov eax, DWORD PTR [ecx+eax*4]
pop ebp
ret 0

The difference is mov (2d) vs. lea (1d).
The former has a latency of 3 cycles and a a maximum throughput of 2 per cycle while the latter has a latency of 2 cycles and a maximum throughput of 3 per cycle. (According to Instruction tables – Agner Fog
Since the differences are minor, I think there should not be a big performance difference arising from index recalculation. I expect it to be very unlikely to identify this difference itself to be the bottleneck in any program.

This brings us to the next (and more interesting) point:

… but I could image that 1D could be in CPU cache …

True, but 2d could be in CPU cache, too. See The Downsides: Memory locality for an explanation why 1d is still better.

The long answer, or why dynamic 2 dimensional data storage (pointer-to-pointer or vector-of-vector) is “bad” for simple / small matrices.

Note: This is about dynamic arrays/allocation schemes [malloc/new/vector etc.]. A static 2d array is a contiguous block of memory and therefore not subject to the downsides I’m going to present here.

The Problem

To be able to understand why a dynamic array of dynamic arrays or a vector of vectors is most likely not the data storage pattern of choice, you are required to understand the memory layout of such structures.

Example case using pointer to pointer syntax

int main (void)
{
    // allocate memory for 4x4 integers; quick & dirty
    int ** p = new int*[4];
    for (size_t i=0; i<4; ++i) p[i] = new int[4]; 

    // do some stuff here, using p[x][y] 

    // deallocate memory
    for (size_t i=0; i<4; ++i) delete[] p[i];
    delete[] p;
}

The downsides

Memory locality

For this “matrix” you allocate one block of four pointers and four blocks of four integers. All of the allocations are unrelated and can therefore result in an arbitrary memory position.

The following image will give you an idea of how the memory may look like.

For the real 2d case:

The violet square is the memory position occupied by p itself.
The green squares assemble the memory region p points to (4 x int*).
The 4 regions of 4 contiguous blue squares are the ones pointed to by each int* of the green region

For the 2d mapped on 1d case:

The green square is the only required pointer int *
The blue squares ensemble the memory region for all matrix elements (16 x int).

This means that (when using the left layout) you will probably observe worse performance than for a contiguous storage pattern (as seen on the right), due to caching for instance.

Let’s say a cache line is “the amount of data transfered into the cache at once” and let’s imagine a program accessing the whole matrix one element after another.

If you have a properly aligned 4 times 4 matrix of 32 bit values, a processor with a 64 byte cache line (typical value) is able to “one-shot” the data (4*4*4 = 64 bytes).
If you start processing and the data isn’t already in the cache you’ll face a cache miss and the data will be fetched from main memory. This load can fetch the whole matrix at once since it fits into a cache line, if and only if it is contiguously stored (and properly aligned).
There will probably not be any more misses while processing that data.

In case of a dynamic, “real two-dimensional” system with unrelated locations of each row/column, the processor needs to load every memory location seperately.
Eventhough only 64 bytes are required, loading 4 cache lines for 4 unrelated memory positions would -in a worst case scenario- actually transfer 256 bytes and waste 75% throughput bandwidth.
If you process the data using the 2d-scheme you’ll again (if not already cached) face a cache miss on the first element.
But now, only the first row/colum will be in the cache after the first load from main memory because all other rows are located somewhere else in memory and not adjacent to the first one.
As soon as you reach a new row/column there will again be a cache miss and the next load from main memory is performed.

Long story short: The 2d pattern has a higher chance of cache misses with the 1d scheme offering better potential for performance due to locality of the data.

Frequent Allocation / Deallocation

As many as N + 1 (4 + 1 = 5) allocations (using either new, malloc, allocator::allocate or whatever) are necessary to create the desired NxM (4×4) matrix.
The same number of proper, respective deallocation operations must be applied as well.

Therefore, it is more costly to create/copy such matrices in contrast to a single allocation scheme.

This is getting even worse with a growing number of rows.

Memory consumption overhead

I’ll asumme a size of 32 bits for int and 32 bits for pointers. (Note: System dependency.)

Let’s remember: We want to store a 4×4 int matrix which means 64 bytes.

For a NxM matrix, stored with the presented pointer-to-pointer scheme we consume

N*M*sizeof(int) [the actual blue data] +
N*sizeof(int*) [the green pointers] +
sizeof(int**) [the violet variable p] bytes.

That makes 4*4*4 + 4*4 + 4 = 84 bytes in case of the present example and it gets even worse when using std::vector<std::vector<int>>.
It will require N * M * sizeof(int) + N * sizeof(vector<int>) + sizeof(vector<vector<int>>) bytes, that is 4*4*4 + 4*16 + 16 = 144 bytes in total, intead of 64 bytes for 4 x 4 int.

In addition -depending on the used allocator- each single allocation may well (and most likely will) have another 16 bytes of memory overhead. (Some “Infobytes” which store the number of allocated bytes for the purpose of proper deallocation.)

This means the worst case is:

N*(16+M*sizeof(int)) + 16+N*sizeof(int*) + sizeof(int**)
= 4*(16+4*4) + 16+4*4 + 4 = 164 bytes ! _Overhead: 156%_

The share of the overhead will reduce as the size of the matrix grows but will still be present.

Risk of memory leaks

The bunch of allocations requires an appropriate exception handling in order to avoid memory leaks if one of the allocations will fail!
You’ll need to keep track of allocated memory blocks and you must not forget them when deallocating the memory.

If new runs of of memory and the next row cannot be allocated (especially likely when the matrix is very large), a std::bad_alloc is thrown by new.

Example:

In the above mentioned new/delete example, we’ll face some more code if we want to avoid leaks in case of bad_alloc exceptions.

  // allocate memory for 4x4 integers; quick & dirty
  size_t const N = 4;
  // we don't need try for this allocation
  // if it fails there is no leak
  int ** p = new int*[N];
  size_t allocs(0U);
  try 
  { // try block doing further allocations
    for (size_t i=0; i<N; ++i) 
    {
      p[i] = new int[4]; // allocate
      ++allocs; // advance counter if no exception occured
    }
  }
  catch (std::bad_alloc & be)
  { // if an exception occurs we need to free out memory
    for (size_t i=0; i<allocs; ++i) delete[] p[i]; // free all alloced p[i]s
    delete[] p; // free p
    throw; // rethrow bad_alloc
  }
  /*
     do some stuff here, using p[x][y] 
  */
  // deallocate memory accoding to the number of allocations
  for (size_t i=0; i<allocs; ++i) delete[] p[i];
  delete[] p;

Summary

There are cases where “real 2d” memory layouts fit and make sense (i.e. if the number of columns per row is not constant) but in the most simple and common 2D data storage cases they just bloat the complexity of your code and reduce the performance and memory efficiency of your program.

Alternative

You should use a contiguous block of memory and map your rows onto that block.

The “C++ way” of doing it is probably to write a class that manages your memory while considering important things like

Example

To provide an idea of how such a class may look like, here’s a simple example with some basic features:

2d-size-constructible
2d-resizable
operator(size_t, size_t) for 2d- row major element access
at(size_t, size_t) for checked 2d-row major element access
Fulfills Concept requirements for Container

Source:

#include <vector>
#include <algorithm>
#include <iterator>
#include <utility>

namespace matrices
{

  template<class T>
  class simple
  {
  public:
    // misc types
    using data_type  = std::vector<T>;
    using value_type = typename std::vector<T>::value_type;
    using size_type  = typename std::vector<T>::size_type;
    // ref
    using reference       = typename std::vector<T>::reference;
    using const_reference = typename std::vector<T>::const_reference;
    // iter
    using iterator       = typename std::vector<T>::iterator;
    using const_iterator = typename std::vector<T>::const_iterator;
    // reverse iter
    using reverse_iterator       = typename std::vector<T>::reverse_iterator;
    using const_reverse_iterator = typename std::vector<T>::const_reverse_iterator;

    // empty construction
    simple() = default;

    // default-insert rows*cols values
    simple(size_type rows, size_type cols)
      : m_rows(rows), m_cols(cols), m_data(rows*cols)
    {}

    // copy initialized matrix rows*cols
    simple(size_type rows, size_type cols, const_reference val)
      : m_rows(rows), m_cols(cols), m_data(rows*cols, val)
    {}

    // 1d-iterators

    iterator begin() { return m_data.begin(); }
    iterator end() { return m_data.end(); }
    const_iterator begin() const { return m_data.begin(); }
    const_iterator end() const { return m_data.end(); }
    const_iterator cbegin() const { return m_data.cbegin(); }
    const_iterator cend() const { return m_data.cend(); }
    reverse_iterator rbegin() { return m_data.rbegin(); }
    reverse_iterator rend() { return m_data.rend(); }
    const_reverse_iterator rbegin() const { return m_data.rbegin(); }
    const_reverse_iterator rend() const { return m_data.rend(); }
    const_reverse_iterator crbegin() const { return m_data.crbegin(); }
    const_reverse_iterator crend() const { return m_data.crend(); }

    // element access (row major indexation)
    reference operator() (size_type const row,
      size_type const column)
    {
      return m_data[m_cols*row + column];
    }
    const_reference operator() (size_type const row,
      size_type const column) const
    {
      return m_data[m_cols*row + column];
    }
    reference at() (size_type const row, size_type const column)
    {
      return m_data.at(m_cols*row + column);
    }
    const_reference at() (size_type const row, size_type const column) const
    {
      return m_data.at(m_cols*row + column);
    }

    // resizing
    void resize(size_type new_rows, size_type new_cols)
    {
      // new matrix new_rows times new_cols
      simple tmp(new_rows, new_cols);
      // select smaller row and col size
      auto mc = std::min(m_cols, new_cols);
      auto mr = std::min(m_rows, new_rows);
      for (size_type i(0U); i < mr; ++i)
      {
        // iterators to begin of rows
        auto row = begin() + i*m_cols;
        auto tmp_row = tmp.begin() + i*new_cols;
        // move mc elements to tmp
        std::move(row, row + mc, tmp_row);
      }
      // move assignment to this
      *this = std::move(tmp);
    }

    // size and capacity
    size_type size() const { return m_data.size(); }
    size_type max_size() const { return m_data.max_size(); }
    bool empty() const { return m_data.empty(); }
    // dimensionality
    size_type rows() const { return m_rows; }
    size_type cols() const { return m_cols; }
    // data swapping
    void swap(simple &rhs)
    {
      using std::swap;
      m_data.swap(rhs.m_data);
      swap(m_rows, rhs.m_rows);
      swap(m_cols, rhs.m_cols);
    }
  private:
    // content
    size_type m_rows{ 0u };
    size_type m_cols{ 0u };
    data_type m_data{};
  };
  template<class T>
  void swap(simple<T> & lhs, simple<T> & rhs)
  {
    lhs.swap(rhs);
  }
  template<class T>
  bool operator== (simple<T> const &a, simple<T> const &b)
  {
    if (a.rows() != b.rows() || a.cols() != b.cols())
    {
      return false;
    }
    return std::equal(a.begin(), a.end(), b.begin(), b.end());
  }
  template<class T>
  bool operator!= (simple<T> const &a, simple<T> const &b)
  {
    return !(a == b);
  }

}

Note several things here:

T needs to fulfill the requirements of the used std::vector member functions
operator() doesn’t do any “of of range” checks
No need to manage data on your own
No destructor, copy constructor or assignment operators required

So you don’t have to bother about proper memory handling for each application but only once for the class you write.

Restrictions

There may be cases where a dynamic “real” two dimensional structure is favourable. This is for instance the case if

the matrix is very large and sparse (if any of the rows do not even need to be allocated but can be handled using a nullptr) or if
the rows do not have the same number of columns (that is if you don’t have a matrix at all but another two-dimensional construct).