C++ STL, Boost and native array Performance and Custom Allocator

Array performance measures the impact of bounds checking on stl::vector and compares its three common access operators. Visual Studio 2008 has STL bounds checking enabled by default in Release build while VS2010 and newer have it disabled. The removal of bounds checking in Release builds significantly improves performance.

The Update Performance test measures the performance to read and write a value in a container object.
The measurments are collected on four different containers:

The Read Access test measures read access only and use 12x more iterations than Update test.
The measurements are collected on three different access methods:

Array Performance Details

Code compiled with default Release optimization

Elapsed #ticks	Ratio	Container Object	Bounds Checking	Compiler /Target
Update Performance obj[i]=obj[i] + n
7,598	8.2	array[i]	Yes	VS2008/x64
7,614	8.3	array[i]	Yes	VS2008/x86
925	1.0	array[i]	No	VS2010/x64
1,648	1.8	array[i]	No	VS2010/x86
939	1.0	array[i]	No	VS2013/x64
1,652	1.8	array[i]	No	VS2013/x86

15,997	17.4	DataVector[i]	No	VS2008/x64
20,378	22.1	DataVector[i]	No	VS2008/x86
925	1.0	DataVector[i]	No	VS2010/x64
1,888	2.0	DataVector[i]	No	VS2010/x86
942	1.0	DataVector[i]	No	VS2013/x64
1,767	1.9	DataVector[i]	No	VS2013/x86

28,240	30.6	std::vector[i]	Yes	VS2008/x64
38,503	41.8	std::vector[i]	Yes	VS2008/x86
936	1.0	std::vector[i]	No	VS2010/x64
1,915	2.1	std::vector[i]	No	VS2010/x86
943	1.0	std::vector[i]	No	VS2013/x64
1,804	2.0	std::vector[i]	No	VS2013/x86

21,249	23.0	boost::vector[i]	Yes	VS2008/x64
33,801	36.7	boost::vector[i]	Yes	VS2008/x86
923	1.0	boost::vector[i]	No	VS2010/x64
1,839	2.0	boost::vector[i]	No	VS2010/x86
922	1.0	boost::vector[i]	No	VS2013/x64
1,978	2.1	boost::vector[i]	No	VS2013/x86

Elapsed #Ticks	Ratio	Container Object	Bounds Checking	Compiler /Target
std::vector Read Access
21,625	15.5	std::vector[i]	Yes	VS2008/x64
26,443	18.9	std::vector[i]	Yes	VS2008/x86
1,525	1.1	std::vector[i]	No	VS2010/x64
1,611	1.2	std::vector[i]	No	VS2010/x86
1,522	1.1	std::vector[i]	No	VS2013/x64
1,397	1.0	std::vector[i]	No	VS2013/x86

196,639	140.8	std::vector.at(i)	Yes	VS2008/x64
105,111	75.2	std::vector.at(i)	Yes	VS2008/x86
1,914	1.4	std::vector.at(i)	Yes	VS2010/x64
1,846	1.3	std::vector.at(i)	Yes	VS2010/x86
1,914	1.4	std::vector.at(i)	Yes	VS2013/x64
1,765	1.3	std::vector.at(i)	Yes	VS2013/x86

48,673	34.8	std::vector::iterator	Yes	VS2008/x64
56,947	40.8	std::vector::iterator	Yes	VS2008/x86
1,863	1.3	std::vector::iterator	No	VS2010/x64
1,767	1.3	std::vector::iterator	No	VS2010/x86
1,870	1.3	std::vector::iterator	No	VS2013/x64
1,987	1.4	std::vector::iterator	No	VS2013/x86

Elapsed #Ticks	Ratio	Container Object	Bounds Checking	Compiler /Target
boost::vector Read Access
15,753	11.3	boost::vector[i]	No	VS2008/x64
23,315	16.8	boost::vector[i]	No	VS2008/x86
1,520	1.1	boost::vector[i]	No	VS2010/x64
1,607	1.2	boost::vector[i]	No	VS2010/x86
1,521	1.1	boost::vector[i]	No	VS2013/x64
1,390	1.0	boost::vector[i]	No	VS2013/x86

19,151	13.8	boost::vector.at(i)	Yes	VS2008/x64
30,230	21.7	boost::vector.at(i)	Yes	VS2008/x86
1,912	1.4	boost::vector.at(i)	Yes	VS2010/x64
1,756	1.3	boost::vector.at(i)	Yes	VS2010/x86
1,913	1.4	boost::vector.at(i)	Yes	VS2013/x64
1,770	1.3	boost::vector.at(i)	Yes	VS2013/x86

16,863	12.1	boost::vector::iterator	No	VS2008/x64
24,542	17.7	boost::vector::iterator	No	VS2008/x86
1,914	1.4	boost::vector::iterator	No	VS2010/x64
1,766	1.3	boost::vector::iterator	No	VS2010/x86
1,869	1.3	boost::vector::iterator	No	VS2013/x64
1,987	1.4	boost::vector::iterator	No	VS2013/x86

Array Performance Conclusion

Contaners

Test Engine

Ordered Test Engine:

///////////////////////////////////////////////////////////////////////////////
/// Exercise ordered containers 
///
/// 1. Add words to container until it has 'groupSize' elements.
/// 2. Scan remainder of words counting duplicate.
/// 3. When scanning sLookupScale * groupSize words, clear list and goto #1

template <typename Container_t>
size_t Test(const Words& words, size_t groupSize, const char* objName, size_t& mallocCnt)
{
    Container_t container;

    size_t lookupCnt = 0;
    size_t cnt = words.size();

    // One shot pre-allocation value use to allocate next set of data nodes.
    // Afterwards it goes back to sAllocRate which is 16.
    container.get_allocator().SetPrealloc(groupSize);

    typename Container_t::iterator found;
    for (size_t idx = 0; idx < cnt; idx++)
    {
        const std::string& word = words[idx];
        if (container.size() < groupSize)
        {
            // Fill - find dup and count, else add new word.
            found = container.find(word);
            if (found != container.end())
                found->second++;    // count duplicates
            else
                container.insert(PairSI(words[idx], 1));
        }
        else
        {
            // Find dup and count, else ignore.
            found = container.find(word);
            if (found != container.end())
                found->second++;    // only count duplicates

            // Occassionally empty list and repeat fill cycle.
            if (++lookupCnt == groupSize * sLookupScale)
            {
                lookupCnt = 0;
                container.clear();
            }
        }
    }

    mallocCnt = container.get_allocator().GetAllocCnt();

    // The following is diagnostic code to sum word count and return its value
    // to compare to other containers to make sure all instances return the same value.
    size_t total = 0;
    for (typename Container_t::const_iterator iter = container.begin(); iter != container.end(); ++iter)
    {
        total += iter->second;
    }

    return total;
}

Custon Allocator

The custom allocate is geared towards node containers such as map. If a multiple element request is made, it by-passes the custom logic and falls into malloc and free adding no benefit. Single element requests will either provide a response from the free list or allocate a group of objects using malloc, returning one of them and placing the remainder on the free list. If the container never erases elements and only grows, there will be only a small improvement due to the chunking of allocations (16 per call to malloc). The biggest gains with this allocator is achieved when the free-list provides a quick response when the client balances calls to insert with erase.

The Test Engine above has been geared towards doing just this, it periodically clears and refills the container.

The custom allocator uses a reference counted SharedInfo object to track its private data. This shared object allows containers like map to create multiple allocator instances of different sizes and still provide statistics to the client code and cleanup correctly. Map can create 3 different allocators which come and go independent of their allocation requests. Meaning the state information could be lost by the time you are done using the Map if it were not shared between the allocators.

Even with the allocator's shared information, it is still associated with a single container. Meaning it is not a global pool used across multiple STL containers. This custom allocator has no thread protection. STL containers are are not thread safe.

Update: Sept, 15th 2013 - Fixed memory leak
Thank you to Marian Kechlibar for detecting a memory leak using valgrind and subsequent verification the leak was fixed.

///////////////////////////////////////////////////////////////////////////////
// Interface for Allocator
class  IAlloc
{
public:
    virtual size_t GetAllocCnt() const = 0;

    typedef size_t            size_type;

    // Optionally 
    // One shot pre-allocation value use to allocate next set of data nodes.
    // Afterwards it goes back to sAllocRate which is 16.
    virtual void SetPrealloc(size_type numElements)
    { }
};

///////////////////////////////////////////////////////////////////////////////
// Custom allocator - groups allocation in chunks of 16.
// Recyles deleted nodes by placing them on free list and using them for allocations.
template <class TT>
class PoolAlloc : public IAlloc
{
    // Rate to allocate extra objects.
    static const unsigned sAllocRate = 16;

public:
    typedef size_t          size_type;
    typedef ptrdiff_t       difference_type;
    typedef TT*             pointer;
    typedef const TT*       const_pointer;
    typedef TT&             reference;
    typedef const TT&       const_reference;
    typedef TT              value_type;   

    template <class UU>    
    struct rebind
    {
        typedef PoolAlloc<UU> other;
    };

    // Wrapper to add singly link list pointers to track next 
    // free object and allocated pointers.
    class Wrapper
    {
    public:
        Wrapper() : pNext(0), pAlloc(0)
        { }

        Wrapper*    pNext;
        Wrapper*    pAlloc;

        // Place client data at end since its size can vary becaue
        // std::map has several internal lists of varying structure.
#if 1
        char        value[sizeof(TT)];  // reserve size for TT, avoid its constructor
#else
        TT          value;   // Safer, but slower
#endif
    };


    // Allocator shared state information.
    class SharedInfo
    {
    public:
        SharedInfo() : allocCnt(0), numToAlloc(sAllocRate), refCnt(1), pAlloc(0)
        { }

        ~SharedInfo()
        { FreeList(); }

        size_t   allocCnt;          // DEBUG - track number of allocations
        size_t   numToAlloc;        // # of allocations in next malloc, used to preallocate.
        int      refCnt;            // Reference counter, std::map has multiple internal list which share this info.
        Wrapper* pAlloc;            // List of all allocations

        void FreeList()
        {
            // Free any allocations (follow pAlloc link list)
            while (pAlloc != NULL)
            {
                Wrapper* pNext = pAlloc->pAlloc;
                free(pAlloc);
                pAlloc = pNext;
            }
        }
    };

    SharedInfo* pSharedInfo;
    Wrapper*    pFree;

    PoolAlloc() throw() : pSharedInfo(new SharedInfo()), pFree(0)
    { }

    PoolAlloc(const PoolAlloc& other) : pSharedInfo(other.pSharedInfo), pFree(0)
    { pSharedInfo->refCnt++; }

    template <class U> 
    PoolAlloc(const PoolAlloc<U>& other) : pSharedInfo((SharedInfo*)other.pSharedInfo), pFree(0)
    { pSharedInfo->refCnt++;}

    bool operator==(const PoolAlloc& other) 
    { return pSharedInfo == other.pSharedInfo; } 

    ~PoolAlloc () throw()
    { if (--pSharedInfo->refCnt == 0) delete pSharedInfo;  }

    // Return number of calls to malloc executed so far.
    size_t GetAllocCnt() const
    {  return pSharedInfo->allocCnt; }

    // One shot pre-allocation value use to allocate next set of data nodes.
    // Afterwards it goes back to sAllocRate which is 16.
    void SetPrealloc(size_type numToAlloc)
    {
        pSharedInfo->numToAlloc = numToAlloc;
    }

    // Pre-allocate nodes on free-list.
    void Preallocate(size_type numElements)
    {
        pSharedInfo->allocCnt++;

        // malloc faster than new[], avoids constructors
        Wrapper* pWrapper = (Wrapper*)malloc(numElements * sizeof(Wrapper));
        pWrapper->pNext = 0;

        // Keep track of allocated memory, wire up link list.
        pWrapper->pAlloc = pSharedInfo->pAlloc;
        pSharedInfo->pAlloc = pWrapper;

        // Split up allocation and store on free list.
        for (unsigned idx = 0; idx != numElements; idx++)
        {
            pWrapper->pNext = pFree;
            pFree = pWrapper;
            pWrapper++;
        }
    }

    pointer address(reference x) const
    { return &x; }

    const_pointer address(const_reference x) const
    { return &x; }

    pointer allocate(size_type numElements, typename PoolAlloc<TT>::const_pointer hint = 0)
    {
        if (numElements != 1)
        {
            // Pass multile element allocation directly to malloc, no special handling.
            pSharedInfo->allocCnt++;
            return static_cast<pointer>(malloc(numElements * sizeof(TT)));
        }

        if (pFree == NULL)
        {
            if (pSharedInfo->allocCnt > 4)
            {
                // Once we have complete basic setup, allocate in larger chunks.
                Preallocate(pSharedInfo->numToAlloc);
                pSharedInfo->numToAlloc = sAllocRate;
            }
            else
            {
                // During setup, allocate in small chunks.
                Preallocate(1);
            }
        }

        if (pFree == NULL)
            return 0;   // or should we throw.

        Wrapper* pWrapper = pFree; 
        pFree = pWrapper->pNext;

        pWrapper->pNext = (Wrapper*)0;

        return (TT*)&pWrapper->value;
    }

    void deallocate(pointer ptr, size_type numElements)
    {
        if (numElements != 1)
        {
            // Pass multi-element allocation directly to free.
            free(ptr);
        }
        else
        {
            // Place memory on free list.
            const size_t sToWrapper = offsetof(Wrapper, value);
            Wrapper* pWrapper = (Wrapper*)((char*)ptr - sToWrapper);
            pWrapper->pNext = pFree;
            pFree = pWrapper;
        }
    }

    size_type max_size() const throw()
    { return size_t(-1) / sizeof(value_type);  }

    void construct(pointer p, const TT& val)
    { new(static_cast<void*>(p)) TT(val);  }

    void construct(pointer p)
    { new(static_cast<void*>(p)) TT(); }

    void destroy(pointer p)
    { p->~TT(); }
};

Custon Allocator Counts Memory Calls (malloc)


///////////////////////////////////////////////////////////////////////////////
// Custom allocator - counts memory allocation calls.
template <class T>
class CountAlloc  
{
public:
    typedef size_t            size_type;
    typedef ptrdiff_t         difference_type;
    typedef T*                pointer;
    typedef const T*          const_pointer;
    typedef T&                reference;
    typedef const T&          const_reference;
    typedef T                 value_type;

    template <class U>
    struct rebind
    {
        typedef CountAlloc<U> other;
    };

    struct SharedInfo
    {
    public:
        SharedInfo() : allocCnt(0), refCnt(1)
        {  }

        size_t allocCnt;
        int    refCnt;
    };
    SharedInfo* pSharedInfo;

    CountAlloc () throw() : pSharedInfo(new SharedInfo())
    {  }

    CountAlloc (const CountAlloc& other) : pSharedInfo(other.pSharedInfo)
    { pSharedInfo->refCnt++; }

    template <class U>
    CountAlloc(const CountAlloc<U>& other) : pSharedInfo((SharedInfo*)other.pSharedInfo)
    { pSharedInfo->refCnt++;}

    bool CountAlloc::operator==(const CountAlloc& other)
    { return this == &other;  }

    ~CountAlloc ()
    { if (--pSharedInfo->refCnt == 0) delete pSharedInfo;  }

    size_t GetAllocCnt() const
    { return pSharedInfo->allocCnt; }

    pointer address(reference x) const
    { return &x; }

    const_pointer address(const_reference x) const
    { return &x; }

    pointer allocate(size_type size, typename CountAlloc<T>::const_pointer hint = 0)
    {
        pSharedInfo->allocCnt++;
        return static_cast<pointer>(malloc(size * sizeof(T)));
    }

    void deallocate(pointer p, size_type n)
    {
        free(p);
        return;
    }

    size_type max_size() const throw()
    { return size_t(-1) / sizeof(value_type); }
    void construct(pointer p, const T& val)
    { new(static_cast<void*>(p)) T(val); }
    void construct(pointer p)
    { new(static_cast<void*>(p)) T(); }
    void destroy(pointer p)
    { p->~T(); }
};

Test Results

Group Size is the maximum size the container is allowed to grow during the process of counting duplicate words. The test input contains 1409 unique words, so any container smaller than this will fail to count some words. The larger the Group Size the larger the container and the harder it works to find duplicates.

The graph and report plot the ratio of the elapsed time of the test divided by the elapsed time of the fastest test.
Smaller ratio values are faster, with 1.00 the fastest. A ratio value of 2 is twice as slow as the fastest.

The following results are with 4 background threads running the same test in parallel as the foreground test. The background threads are used to stress the memory sub-system to make it more expensive to allocate memory. The tests are repeated 4 times and averaged. The test are built using Visual Studio 2010 and 2013 and run on Windows 7/x64.

Tabular Test Results

Microsoft Standard Template Library
Performance (1.00=fastest) - Ratio of elapsedTicks/fastestTicks(15,634,428):

Memory - Number of Memory Allocations called
HasMpAlloc and MapAlloc use custom allocator which reduces memory allocations.

Boost 1.57 Container Performance
Performance (1.00=fastest) - Ratio of elapsedTicks/fastestTicks(29,787,329):

	Array Performance	Download Source code (Apr 2011): perf-array-src.zip
	Heap, Map, MultiMap Commparison	Download Source code (Jan 2015): perf-hash-map-src.zip
	VS2008, VS2010, VS2013	Compiled code excution speed comparison

Update Performance:	container[i] = container[i] + n;
Read Access:	total += container[i];

Hash Map	stdext::hash_map<std::string, int>
Map	std::map<std::string, int>
Hash MultiMap	stdext::hash_multimap<std::string, int>
Multimap	std::multimap<std::string, int>
Vector	std::vector< std::pair<std::string, int> >
Deque	std::deque< std::pair<std::string, int> >

HashAlloc	stdext::hash_map with custom allocator
MapAlloc	std::map with custom allocator

Map	boost::container::map<std::string, int>
Flat Map	boost::container::flat_map<std::string, int>
Multimap	boost::container::multimap<std::string, int>
Vector	boost::container::vector< std::pair<std::string, int> >
Deque	boost::container::deque< std::pair<std::string, int> >