A dictionary class

Discussion:

A dictionary class

(too old to reply)

Anton Shepelev

2017-02-23 20:34:30 UTC

Hello, all

By way of excercise I have written a small dictio-
nary class and measured its performace in random ac-
cess, paying no attention to the insertion and dele-
tion of elements. My class does not implement ICol-
lection and has neither memory management nor error
handling.

May I ask you to review my code and indicate how
further to optimise random access, i.e. the indexer?
I have tried passing the the structs by value and by
reference, as well as operating bare array in-
dices -- only to observe no difference is perfor-
mance.

You can download the class, together with a small
test project, here:

https://drive.google.com/file/d/0B2hVk2IFUZDGcHFGbFpIOEp6OVk/view?usp=sharing

The principal source file I reproduce below:

using System;
using System.Collections.Generic;

namespace AntCols
{
struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}

class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}

enum ProbeRes { Done, More, Fail };

class DictEnumerator<TKey, TVal>: IEnumerator< KeyValuePair<TKey, TVal> >
{ AntDict<TKey, TVal> dict;
IEnumerator<ListEntry> list_enum;

public void Reset()
{ list_enum.Reset(); }

public bool MoveNext()
{ return list_enum.MoveNext(); }

public KeyValuePair<TKey, TVal> Current
{ get
{ int pos = list_enum.Current.Pos;
TKey key = dict.Data[ pos ].Key;
TVal val = dict.Data[ pos ].Value;
KeyValuePair<TKey, TVal> pr =
new KeyValuePair< TKey, TVal >( key, val );
return pr;
}
}

object System.Collections.IEnumerator.Current
{ get
{ return Current; }
}

public void Dispose()
{
// TODO
}

public DictEnumerator( AntDict< TKey, TVal > dict )
{ this.dict = dict;
this.list_enum = dict.List.GetEnumerator();
}
}

public class AntDict< TKey, TVal >: IEnumerable<KeyValuePair<TKey, TVal>>
{ // (stored index in Data)
internal AntList<ListEntry> List = new AntList<ListEntry>();
internal TDictRec<TKey, TVal>[] Data;
int Capacity;
int Count = 0;

public IEnumerator<KeyValuePair<TKey, TVal>> GetEnumerator()
{ return new DictEnumerator< TKey, TVal> ( this ); }

System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{ return GetEnumerator(); }

public TVal this[ TKey key ]
{ get
{ TVal value;
Find( key, out value );
return value;
//return Data[ Find( key ) ].Value;
}
}

public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }
AntListNode<ListEntry> list_elem =
List.Append( new ListEntry( empty ) );
PutEntry( key, value, list_elem, ref Data[empty] );
}

public void Remove( TKey key )
{ TVal dummy;
Count--;
int pos = Find( key, out dummy );
List.Remove( Data[pos].ListElem );
Data[pos].ListElem = null; // TODO: Clear it properly
}

void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}

delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;

ProbeRes IsEmpty( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( rec.ListElem == null )
{ return ProbeRes.Done; }
else
{ return ProbeRes.More; }
}

ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}

int Find( TKey key, out TVal value )
{ return Probe( CritEqual, key, out value ); }

int FindEmpty( TKey key )
{ TVal dummy;
return Probe( CritEmpty, key, out dummy );
}

public AntDict()
{ Capacity = 1;
IncCapacity();
CritEqual = new ProbeCrit( IsEqual );
CritEmpty = new ProbeCrit( IsEmpty );
}

void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;
if( index >= Capacity )
{ index = index % Capacity; }
}

int ProbeTest0( ProbeCrit crit )
{ return 0;
}

int Probe( ProbeCrit crit, TKey key, out TVal value )
{ // HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}
}

void IncCapacity()
{ TDictRec<TKey, TVal>[] old_data = Data;
Capacity *= 2;
Data = new TDictRec<TKey, TVal>[ Capacity ];
if( old_data != null )
{ Resize( old_data );
}
}

void Resize( TDictRec<TKey, TVal>[] old_data )
{ int pos;
TDictRec<TKey, TVal> record;
foreach( ListEntry entry in List )
{ record = old_data[ entry.Pos ];
pos = FindEmpty( record.Key );
Data[pos] = record;
entry.Pos = pos;
}
}
}
}

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Marcel Mueller

2017-02-23 21:45:43 UTC

Permalink

Post by Anton Shepelev
May I ask you to review my code and indicate how
further to optimise random access, i.e. the indexer?
I have tried passing the the structs by value and by
reference, as well as operating bare array in-
dices -- only to observe no difference is perfor-
mance.

The performance is probably limited by other faux pas. See below.

Post by Anton Shepelev
using System;
using System.Collections.Generic;
namespace AntCols
{
struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}

The performance advantage of the struct might vanish if AntListNode
happens to be a class.

Post by Anton Shepelev
class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}

Why do you use a class for the trivial type int?
This is counterproductive. It causes a big number of very small object
allocations just for nothing.
Use struct or eliminate the type entirely.

Post by Anton Shepelev
class DictEnumerator<TKey, TVal>: IEnumerator< KeyValuePair<TKey, TVal> >

[...]

By the way: the internal types of your implementation should not be
exposed in the namespace unless they are used by other classes of your
class library as well. Consider to use private nested types.

Post by Anton Shepelev
public class AntDict< TKey, TVal >: IEnumerable<KeyValuePair<TKey, TVal>>
{ // (stored index in Data)
internal AntList<ListEntry> List = new AntList<ListEntry>();
internal TDictRec<TKey, TVal>[] Data;

When the above helper types are nested types you need no longer declare
this members as internal.

Post by Anton Shepelev
int Capacity;
int Count = 0;

This is superfluous. List.Count does the job as well.

Post by Anton Shepelev
public IEnumerator<KeyValuePair<TKey, TVal>> GetEnumerator()
{ return new DictEnumerator< TKey, TVal> ( this ); }
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{ return GetEnumerator(); }
public TVal this[ TKey key ]
{ get
{ TVal value;
Find( key, out value );
return value;
//return Data[ Find( key ) ].Value;
}
}
public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }

Auto shrink could be counterproductive to objects that change the size
often largely.

Post by Anton Shepelev
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }

Err, this should be a job for IncCapacity.

Post by Anton Shepelev
AntListNode<ListEntry> list_elem =
List.Append( new ListEntry( empty ) );
PutEntry( key, value, list_elem, ref Data[empty] );
}
public void Remove( TKey key )
{ TVal dummy;
Count--;
int pos = Find( key, out dummy );
List.Remove( Data[pos].ListElem );

What is the performance of List.Remove? If it is not O(1) - probably -
this could degrade the entire performance.

Post by Anton Shepelev
Data[pos].ListElem = null; // TODO: Clear it properly
}
void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}

Again a static function w/o static.
By the way it is misplaced. It should be up to the TDictRec constructor
to initialize its contents.

Post by Anton Shepelev
delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;

The compare functions are side effect free and so static by design.
You should not allocate them anew for each class instance.

Post by Anton Shepelev
ProbeRes IsEmpty( ref TDictRec<TKey, TVal> rec, TKey compare )

The static function misses a 'static' declaration.

Post by Anton Shepelev
{ if( rec.ListElem == null )
{ return ProbeRes.Done; }
else
{ return ProbeRes.More; }
}
ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )

Same here.

Post by Anton Shepelev
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }

This criterion is not reliable. You might hit an entry with a different
hash code. In this case or in case your key is not found you may end up
in an infinite loop at Probe.

Post by Anton Shepelev
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}
int Find( TKey key, out TVal value )
{ return Probe( CritEqual, key, out value ); }
int FindEmpty( TKey key )
{ TVal dummy;
return Probe( CritEmpty, key, out dummy );
}
public AntDict()
{ Capacity = 1;

An empty dictionary should be empty and avoid further allocations.

Post by Anton Shepelev
IncCapacity();

For whatever reason you increase the capacity here. The dictionary is
still empty.

Post by Anton Shepelev
CritEqual = new ProbeCrit( IsEqual );
CritEmpty = new ProbeCrit( IsEmpty );
}
void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;

For whatever reason you use a effectively quadratic function in case of
hash collisions here. Normally there is no need for that. Keep in mind
that in the collision algorithm must touch *all* slots before it returns
to the starting place. Can you ensure this in conjunction with the
modulus operation below? Probably not.
There are smarter collision handlings by the way.

Post by Anton Shepelev
if( index >= Capacity )
{ index = index % Capacity; }
}
int ProbeTest0( ProbeCrit crit )
{ return 0;
}
int Probe( ProbeCrit crit, TKey key, out TVal value )
{ // HACK: how to use the entire hash value?

Use uint.

Post by Anton Shepelev
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;

GetHashCode could be quite expensive. Most Dictionary implementations
cache the hash code of objects. This is especially important in case of
resize operations which require all has values to be available for new
placement.

Post by Anton Shepelev
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;

This algorithm will result in a full table scan or even a regression in
case the key is not in the dictionary. It does not stop searching for a
hit, if all matching hash codes have been processed.

Post by Anton Shepelev
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}
}
void IncCapacity()
{ TDictRec<TKey, TVal>[] old_data = Data;
Capacity *= 2;

The size of a hash table should be a prime number. At least it should
never be even as this would discard bits from the hash code at the
modulus operation.

Post by Anton Shepelev
Data = new TDictRec<TKey, TVal>[ Capacity ];
if( old_data != null )
{ Resize( old_data );
}
}
void Resize( TDictRec<TKey, TVal>[] old_data )
{ int pos;
TDictRec<TKey, TVal> record;
foreach( ListEntry entry in List )
{ record = old_data[ entry.Pos ];
pos = FindEmpty( record.Key );
Data[pos] = record;
entry.Pos = pos;
}
}
}
}

No idea about the other, missing classes. (Google drive is for members
only.)

Hint: Have a look to the reference source of the .NET Dictionary<,>
class. The implementation is quite sophisticated in many respects. (Even
tough it contains a hack against a DOS attack of the web server.)

Marcel

Anton Shepelev

2017-02-24 13:58:32 UTC

Permalink

May I ask you to review my code and indicate how
further to optimise random access, i.e. the in-
dexer? I have tried passing the the structs by
value and by reference, as well as operating
bare array indices -- only to observe no differ-
ence is performance.

The performance is probably limited by other faux
pas. See below.

For whatever it be worth, random access to my dic-
tionary seems to be only 3-5% slower than to System.
Collections.Generic.Dictionary.

struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}

The performance advantage of the struct might van-
ish if AntListNode happens to be a class.

Why? It is a class, from my doubly linked generic
list. This causes the struct to contain only a
pointer to a list node. As I wrote above, I was fo-
cused on optimising access by key -- which does not
involve the list at all -- rather than insertion or
deletion, which do work with the list and are not
very good in my implementation because I do not
cache the hashes, as you suggest.

class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}

Why do you use a class for the trivial type int?
This is counterproductive. It causes a big number
of very small object allocations just for nothing.
Use struct or eliminate the type entirely.

My mistake due to obsolete overgeneralization. Ini-
tially I intended to store more in it but have ended
up with a single int. I shall replace it with the
bare int.

By the way: the internal types of your implementa-
tion should not be exposed in the namespace unless
they are used by other classes of your class li-
brary as well. Consider to use private nested
types.

I declared them in the namespace merely because I
found code with nesting too cumbersome to look at.
Whereas "internal" classes are not visible to the
actual user from other assemblies I thought that
should not be a problem.

int Capacity;
int Count = 0;

This is superfluous. List.Count does the job as
well.

My generic list has no Count property. It is cur-
rently a bare IEnumerable<T>.

public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }

Auto shrink could be counterproductive to objects
that change the size often largely.

There is no autoshrink here, only auto-grow.

[the Add() method continued]
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }

Err, this should be a job for IncCapacity.

IncCapacity() makes certain that the capacity is at
least three times the number of elements, in which
case an empty cell shall always be found in theory.
If it is not, then there is an error in the code.
The dictionary must look for an empty cell at every
insertion, therefore this is part of the Add()
method. I cannot move it to IncCapacity() because
the latter is not called at every insertion.

public void Remove( TKey key ) { TVal dum-
my; Count--; int pos = Find(
key, out dummy ); List.Remove( Da-
ta[pos].ListElem ); Data[pos].ListElem =
null; // TODO: Clear it proper-
ly }

What is the performance of List.Remove?
If it is not O(1) - probably -
this could degrade the entire performance.

I have taken care of that. It indeed is O(1), for
it is a doubly linked list and .ListElem its node.
No lookups are required.

void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}

Again a static function w/o static.

Agreed.

By the way it is misplaced. It should be up to
the TDictRec constructor to initialize its con-
tents.

In order to pass less data through the stack, I have
preferred by-reference (should be out instead of
ref) access to a constructor.

delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;

The compare functions are side effect free and so
static by design. You should not allocate them
anew for each class instance.

I thought the presence of a generic type template
required it. Fixed now:

static ProbeCrit CritEqual = new ProbeCrit( IsEqual );
static ProbeCrit CritEmpty = new ProbeCrit( IsEmpty );

IsEqual() and IsEmpty() are also static now as you
recommended.

ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}

This criterion is not reliable. You might hit an
entry with a different hash code. In this case or
in case your key is not found you may end up in an
infinite loop at Probe.

My dictionary always has empty elements by construc-
tion, therefore lookup for an absent key will fail
when the first empty cell is hit. If the hash codes
are different then the elements are different as
well and the Equals() check will fail. I don't un-
derstand your point here. Can you provide a simple
test case that will cause an endless loop?

public AntDict()
{ Capacity = 1;

An empty dictionary should be empty and avoid fur-
ther allocations.

[AntDict() continued]
IncCapacity();

For whatever reason you increase the capacity
here. The dictionary is still empty.

To avoid a special case and therefore an extra
branch operator at insertion and lookup.

void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;
if( index >= Capacity )
{ index = index % Capacity; }
}

For whatever reason you use a effectively quadrat-
ic function in case of hash collisions here. Nor-
mally there is no need for that. Keep in mind
that in the collision algorithm must touch *all*
slots before it returns to the starting place.
Can you ensure this in conjunction with the modu-
lus operation below? Probably not.

Yes, I can. It is a well-known property or triangu-
lar numbers that 2^n mod t scans the numbers from 0
to 2^n-1 as t goes through consecutive triangular
numbers.

There are smarter collision handlings by the way.

Which ones? Bucketing? Linear probing?

// HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;

Use uint.

uint and an extra subtraction for each hash calcula-
tion. I wish the standard GetHash() returned a
uint.

GetHashCode() could be quite expensive. Most Dic-
tionary implementations cache the hash code of ob-
jects. This is especially important in case of
resize operations which require all has values to
be available for new placement.

Good idea, I have not thought about that.

[Probe() method continued]
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}

This algorithm will result in a full table scan or
even a regression in case the key is not in the
dictionary. It does not stop searching for a hit,
if all matching hash codes have been processed.

With crit = CritEquals() is will stop at the first
empty entry as I wrote above. If you disagree, then
please give me a test case using the following Con-
tains() method:

public bool Contains( TKey key )
{ TVal value;
return Find( key, out value ) != -1;
}

The size of a hash table should be a prime number.
At least it should never be even as this would
discard bits from the hash code at the modulus op-
eration.

This depends on the hashing algorithm. With
quadratic probing, which I use, it must be a power
of two.

No idea about the other, missing classes. (Google
drive is for members only.)

Nope, my link works for anonymous users. I have
just tested it without logging in Google.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev

2017-02-24 14:08:54 UTC

Permalink

Post by Anton Shepelev
delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;

The compare functions are side effect free and
so static by design. You should not allocate
them anew for each class instance.

I thought the presence of a generic type template
static ProbeCrit CritEqual = new ProbeCrit( IsEqual );
static ProbeCrit CritEmpty = new ProbeCrit( IsEmpty );
IsEqual() and IsEmpty() are also static now as you
recommended.

Strange as it may be, static CritEqual and CritEmpty
work slower than dynamic ones.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Marcel Mueller

2017-02-24 21:04:08 UTC