Discussion:
A dictionary class
(too old to reply)
Anton Shepelev
2017-02-23 20:34:30 UTC
Permalink
Hello, all

By way of excercise I have written a small dictio-
nary class and measured its performace in random ac-
cess, paying no attention to the insertion and dele-
tion of elements. My class does not implement ICol-
lection and has neither memory management nor error
handling.

May I ask you to review my code and indicate how
further to optimise random access, i.e. the indexer?
I have tried passing the the structs by value and by
reference, as well as operating bare array in-
dices -- only to observe no difference is perfor-
mance.

You can download the class, together with a small
test project, here:

https://drive.google.com/file/d/0B2hVk2IFUZDGcHFGbFpIOEp6OVk/view?usp=sharing

The principal source file I reproduce below:


using System;
using System.Collections.Generic;

namespace AntCols
{
struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}

class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}

enum ProbeRes { Done, More, Fail };

class DictEnumerator<TKey, TVal>: IEnumerator< KeyValuePair<TKey, TVal> >
{ AntDict<TKey, TVal> dict;
IEnumerator<ListEntry> list_enum;

public void Reset()
{ list_enum.Reset(); }

public bool MoveNext()
{ return list_enum.MoveNext(); }

public KeyValuePair<TKey, TVal> Current
{ get
{ int pos = list_enum.Current.Pos;
TKey key = dict.Data[ pos ].Key;
TVal val = dict.Data[ pos ].Value;
KeyValuePair<TKey, TVal> pr =
new KeyValuePair< TKey, TVal >( key, val );
return pr;
}
}

object System.Collections.IEnumerator.Current
{ get
{ return Current; }
}

public void Dispose()
{
// TODO
}

public DictEnumerator( AntDict< TKey, TVal > dict )
{ this.dict = dict;
this.list_enum = dict.List.GetEnumerator();
}
}

public class AntDict< TKey, TVal >: IEnumerable<KeyValuePair<TKey, TVal>>
{ // (stored index in Data)
internal AntList<ListEntry> List = new AntList<ListEntry>();
internal TDictRec<TKey, TVal>[] Data;
int Capacity;
int Count = 0;

public IEnumerator<KeyValuePair<TKey, TVal>> GetEnumerator()
{ return new DictEnumerator< TKey, TVal> ( this ); }

System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{ return GetEnumerator(); }

public TVal this[ TKey key ]
{ get
{ TVal value;
Find( key, out value );
return value;
//return Data[ Find( key ) ].Value;
}
}

public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }
AntListNode<ListEntry> list_elem =
List.Append( new ListEntry( empty ) );
PutEntry( key, value, list_elem, ref Data[empty] );
}

public void Remove( TKey key )
{ TVal dummy;
Count--;
int pos = Find( key, out dummy );
List.Remove( Data[pos].ListElem );
Data[pos].ListElem = null; // TODO: Clear it properly
}

void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}

delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;

ProbeRes IsEmpty( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( rec.ListElem == null )
{ return ProbeRes.Done; }
else
{ return ProbeRes.More; }
}

ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}

int Find( TKey key, out TVal value )
{ return Probe( CritEqual, key, out value ); }

int FindEmpty( TKey key )
{ TVal dummy;
return Probe( CritEmpty, key, out dummy );
}

public AntDict()
{ Capacity = 1;
IncCapacity();
CritEqual = new ProbeCrit( IsEqual );
CritEmpty = new ProbeCrit( IsEmpty );
}

void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;
if( index >= Capacity )
{ index = index % Capacity; }
}

int ProbeTest0( ProbeCrit crit )
{ return 0;
}

int Probe( ProbeCrit crit, TKey key, out TVal value )
{ // HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}
}

void IncCapacity()
{ TDictRec<TKey, TVal>[] old_data = Data;
Capacity *= 2;
Data = new TDictRec<TKey, TVal>[ Capacity ];
if( old_data != null )
{ Resize( old_data );
}
}

void Resize( TDictRec<TKey, TVal>[] old_data )
{ int pos;
TDictRec<TKey, TVal> record;
foreach( ListEntry entry in List )
{ record = old_data[ entry.Pos ];
pos = FindEmpty( record.Key );
Data[pos] = record;
entry.Pos = pos;
}
}
}
}
--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Marcel Mueller
2017-02-23 21:45:43 UTC
Permalink
Post by Anton Shepelev
May I ask you to review my code and indicate how
further to optimise random access, i.e. the indexer?
I have tried passing the the structs by value and by
reference, as well as operating bare array in-
dices -- only to observe no difference is perfor-
mance.
The performance is probably limited by other faux pas. See below.
Post by Anton Shepelev
using System;
using System.Collections.Generic;
namespace AntCols
{
struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}
The performance advantage of the struct might vanish if AntListNode
happens to be a class.
Post by Anton Shepelev
class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}
Why do you use a class for the trivial type int?
This is counterproductive. It causes a big number of very small object
allocations just for nothing.
Use struct or eliminate the type entirely.
Post by Anton Shepelev
class DictEnumerator<TKey, TVal>: IEnumerator< KeyValuePair<TKey, TVal> >
[...]

By the way: the internal types of your implementation should not be
exposed in the namespace unless they are used by other classes of your
class library as well. Consider to use private nested types.
Post by Anton Shepelev
public class AntDict< TKey, TVal >: IEnumerable<KeyValuePair<TKey, TVal>>
{ // (stored index in Data)
internal AntList<ListEntry> List = new AntList<ListEntry>();
internal TDictRec<TKey, TVal>[] Data;
When the above helper types are nested types you need no longer declare
this members as internal.
Post by Anton Shepelev
int Capacity;
int Count = 0;
This is superfluous. List.Count does the job as well.
Post by Anton Shepelev
public IEnumerator<KeyValuePair<TKey, TVal>> GetEnumerator()
{ return new DictEnumerator< TKey, TVal> ( this ); }
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{ return GetEnumerator(); }
public TVal this[ TKey key ]
{ get
{ TVal value;
Find( key, out value );
return value;
//return Data[ Find( key ) ].Value;
}
}
public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }
Auto shrink could be counterproductive to objects that change the size
often largely.
Post by Anton Shepelev
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }
Err, this should be a job for IncCapacity.
Post by Anton Shepelev
AntListNode<ListEntry> list_elem =
List.Append( new ListEntry( empty ) );
PutEntry( key, value, list_elem, ref Data[empty] );
}
public void Remove( TKey key )
{ TVal dummy;
Count--;
int pos = Find( key, out dummy );
List.Remove( Data[pos].ListElem );
What is the performance of List.Remove? If it is not O(1) - probably -
this could degrade the entire performance.
Post by Anton Shepelev
Data[pos].ListElem = null; // TODO: Clear it properly
}
void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}
Again a static function w/o static.
By the way it is misplaced. It should be up to the TDictRec constructor
to initialize its contents.
Post by Anton Shepelev
delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;
The compare functions are side effect free and so static by design.
You should not allocate them anew for each class instance.
Post by Anton Shepelev
ProbeRes IsEmpty( ref TDictRec<TKey, TVal> rec, TKey compare )
The static function misses a 'static' declaration.
Post by Anton Shepelev
{ if( rec.ListElem == null )
{ return ProbeRes.Done; }
else
{ return ProbeRes.More; }
}
ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )
Same here.
Post by Anton Shepelev
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }
This criterion is not reliable. You might hit an entry with a different
hash code. In this case or in case your key is not found you may end up
in an infinite loop at Probe.
Post by Anton Shepelev
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}
int Find( TKey key, out TVal value )
{ return Probe( CritEqual, key, out value ); }
int FindEmpty( TKey key )
{ TVal dummy;
return Probe( CritEmpty, key, out dummy );
}
public AntDict()
{ Capacity = 1;
An empty dictionary should be empty and avoid further allocations.
Post by Anton Shepelev
IncCapacity();
For whatever reason you increase the capacity here. The dictionary is
still empty.
Post by Anton Shepelev
CritEqual = new ProbeCrit( IsEqual );
CritEmpty = new ProbeCrit( IsEmpty );
}
void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;
For whatever reason you use a effectively quadratic function in case of
hash collisions here. Normally there is no need for that. Keep in mind
that in the collision algorithm must touch *all* slots before it returns
to the starting place. Can you ensure this in conjunction with the
modulus operation below? Probably not.
There are smarter collision handlings by the way.
Post by Anton Shepelev
if( index >= Capacity )
{ index = index % Capacity; }
}
int ProbeTest0( ProbeCrit crit )
{ return 0;
}
int Probe( ProbeCrit crit, TKey key, out TVal value )
{ // HACK: how to use the entire hash value?
Use uint.
Post by Anton Shepelev
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
GetHashCode could be quite expensive. Most Dictionary implementations
cache the hash code of objects. This is especially important in case of
resize operations which require all has values to be available for new
placement.
Post by Anton Shepelev
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;
This algorithm will result in a full table scan or even a regression in
case the key is not in the dictionary. It does not stop searching for a
hit, if all matching hash codes have been processed.
Post by Anton Shepelev
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}
}
void IncCapacity()
{ TDictRec<TKey, TVal>[] old_data = Data;
Capacity *= 2;
The size of a hash table should be a prime number. At least it should
never be even as this would discard bits from the hash code at the
modulus operation.
Post by Anton Shepelev
Data = new TDictRec<TKey, TVal>[ Capacity ];
if( old_data != null )
{ Resize( old_data );
}
}
void Resize( TDictRec<TKey, TVal>[] old_data )
{ int pos;
TDictRec<TKey, TVal> record;
foreach( ListEntry entry in List )
{ record = old_data[ entry.Pos ];
pos = FindEmpty( record.Key );
Data[pos] = record;
entry.Pos = pos;
}
}
}
}
No idea about the other, missing classes. (Google drive is for members
only.)


Hint: Have a look to the reference source of the .NET Dictionary<,>
class. The implementation is quite sophisticated in many respects. (Even
tough it contains a hack against a DOS attack of the web server.)


Marcel
Anton Shepelev
2017-02-24 13:58:32 UTC
Permalink
May I ask you to review my code and indicate how
further to optimise random access, i.e. the in-
dexer? I have tried passing the the structs by
value and by reference, as well as operating
bare array indices -- only to observe no differ-
ence is performance.
The performance is probably limited by other faux
pas. See below.
For whatever it be worth, random access to my dic-
tionary seems to be only 3-5% slower than to System.
Collections.Generic.Dictionary.
struct TDictRec<TKey, TVal>
{ public TKey Key;
public TVal Value;
public AntListNode<ListEntry> ListElem; // position in the array
}
The performance advantage of the struct might van-
ish if AntListNode happens to be a class.
Why? It is a class, from my doubly linked generic
list. This causes the struct to contain only a
pointer to a list node. As I wrote above, I was fo-
cused on optimising access by key -- which does not
involve the list at all -- rather than insertion or
deletion, which do work with the list and are not
very good in my implementation because I do not
cache the hashes, as you suggest.
class ListEntry
{ public int Pos;
public ListEntry( int pos )
{ Pos = pos; }
}
Why do you use a class for the trivial type int?
This is counterproductive. It causes a big number
of very small object allocations just for nothing.
Use struct or eliminate the type entirely.
My mistake due to obsolete overgeneralization. Ini-
tially I intended to store more in it but have ended
up with a single int. I shall replace it with the
bare int.
By the way: the internal types of your implementa-
tion should not be exposed in the namespace unless
they are used by other classes of your class li-
brary as well. Consider to use private nested
types.
I declared them in the namespace merely because I
found code with nesting too cumbersome to look at.
Whereas "internal" classes are not visible to the
actual user from other assemblies I thought that
should not be a problem.
int Capacity;
int Count = 0;
This is superfluous. List.Count does the job as
well.
My generic list has no Count property. It is cur-
rently a bare IEnumerable<T>.
public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }
Auto shrink could be counterproductive to objects
that change the size often largely.
There is no autoshrink here, only auto-grow.
[the Add() method continued]
Count++;
int empty = FindEmpty( key );
if( empty == -1 )
{ throw new Exception( "No room" ); }
Err, this should be a job for IncCapacity.
IncCapacity() makes certain that the capacity is at
least three times the number of elements, in which
case an empty cell shall always be found in theory.
If it is not, then there is an error in the code.
The dictionary must look for an empty cell at every
insertion, therefore this is part of the Add()
method. I cannot move it to IncCapacity() because
the latter is not called at every insertion.
public void Remove( TKey key ) { TVal dum-
my; Count--; int pos = Find(
key, out dummy ); List.Remove( Da-
ta[pos].ListElem ); Data[pos].ListElem =
null; // TODO: Clear it proper-
ly }
What is the performance of List.Remove?
If it is not O(1) - probably -
this could degrade the entire performance.
I have taken care of that. It indeed is O(1), for
it is a doubly linked list and .ListElem its node.
No lookups are required.
void PutEntry
( TKey key, TVal value, AntListNode<ListEntry> list_elem,
ref TDictRec<TKey, TVal> record
)
{ record.Key = key;
record.Value = value;
record.ListElem = list_elem;
}
Again a static function w/o static.
Agreed.
By the way it is misplaced. It should be up to
the TDictRec constructor to initialize its con-
tents.
In order to pass less data through the stack, I have
preferred by-reference (should be out instead of
ref) access to a constructor.
delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;
The compare functions are side effect free and so
static by design. You should not allocate them
anew for each class instance.
I thought the presence of a generic type template
required it. Fixed now:

static ProbeCrit CritEqual = new ProbeCrit( IsEqual );
static ProbeCrit CritEmpty = new ProbeCrit( IsEmpty );

IsEqual() and IsEmpty() are also static now as you
recommended.
ProbeRes IsEqual( ref TDictRec<TKey, TVal> rec, TKey compare )
{ if( IsEmpty( ref rec, compare ) == ProbeRes.Done )
{ return ProbeRes.Fail; }
if( compare.Equals( rec.Key ) )
{ return ProbeRes.Done; }
return ProbeRes.More;
}
This criterion is not reliable. You might hit an
entry with a different hash code. In this case or
in case your key is not found you may end up in an
infinite loop at Probe.
My dictionary always has empty elements by construc-
tion, therefore lookup for an absent key will fail
when the first empty cell is hit. If the hash codes
are different then the elements are different as
well and the Equals() check will fail. I don't un-
derstand your point here. Can you provide a simple
test case that will cause an endless loop?
public AntDict()
{ Capacity = 1;
An empty dictionary should be empty and avoid fur-
ther allocations.
[AntDict() continued]
IncCapacity();
For whatever reason you increase the capacity
here. The dictionary is still empty.
To avoid a special case and therefore an extra
branch operator at insertion and lookup.
void NextIndex( ref int index, ref int iter )
{ iter++;
index += iter;
if( index >= Capacity )
{ index = index % Capacity; }
}
For whatever reason you use a effectively quadrat-
ic function in case of hash collisions here. Nor-
mally there is no need for that. Keep in mind
that in the collision algorithm must touch *all*
slots before it returns to the starting place.
Can you ensure this in conjunction with the modu-
lus operation below? Probably not.
Yes, I can. It is a well-known property or triangu-
lar numbers that 2^n mod t scans the numbers from 0
to 2^n-1 as t goes through consecutive triangular
numbers.
There are smarter collision handlings by the way.
Which ones? Bucketing? Linear probing?
// HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
Use uint.
uint and an extra subtraction for each hash calcula-
tion. I wish the standard GetHash() returned a
uint.
GetHashCode() could be quite expensive. Most Dic-
tionary implementations cache the hash code of ob-
jects. This is especially important in case of
resize operations which require all has values to
be available for new placement.
Good idea, I have not thought about that.
[Probe() method continued]
int pos = hash;
int iter = 0;
int res = -1;
TDictRec<TKey, TVal> rec;
while( true )
{ rec = Data[pos];
switch( crit( ref rec, key ) )
{ case ProbeRes.Fail: value = rec.Value; return res;
case ProbeRes.Done: value = rec.Value; return pos;
}
NextIndex( ref pos, ref iter );
}
This algorithm will result in a full table scan or
even a regression in case the key is not in the
dictionary. It does not stop searching for a hit,
if all matching hash codes have been processed.
With crit = CritEquals() is will stop at the first
empty entry as I wrote above. If you disagree, then
please give me a test case using the following Con-
tains() method:

public bool Contains( TKey key )
{ TVal value;
return Find( key, out value ) != -1;
}
The size of a hash table should be a prime number.
At least it should never be even as this would
discard bits from the hash code at the modulus op-
eration.
This depends on the hashing algorithm. With
quadratic probing, which I use, it must be a power
of two.
No idea about the other, missing classes. (Google
drive is for members only.)
Nope, my link works for anonymous users. I have
just tested it without logging in Google.
--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Anton Shepelev
2017-02-24 14:08:54 UTC
Permalink
Post by Anton Shepelev
delegate ProbeRes ProbeCrit
( ref TDictRec<TKey, TVal> rec, TKey compare );
ProbeCrit CritEqual, CritEmpty;
The compare functions are side effect free and
so static by design. You should not allocate
them anew for each class instance.
I thought the presence of a generic type template
static ProbeCrit CritEqual = new ProbeCrit( IsEqual );
static ProbeCrit CritEmpty = new ProbeCrit( IsEmpty );
IsEqual() and IsEmpty() are also static now as you
recommended.
Strange as it may be, static CritEqual and CritEmpty
work slower than dynamic ones.
--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Marcel Mueller
2017-02-24 21:04:08 UTC
Permalink
Post by Anton Shepelev
The performance advantage of the struct might van-
ish if AntListNode happens to be a class.
Why? It is a class, from my doubly linked generic
list.
It was not that obvious that it is a doubly linked list since List<T> in
.NET is a an array (unlike the name suggests). In fact the entire
container class library is full of inconsistent names.
Post by Anton Shepelev
I declared them in the namespace merely because I
found code with nesting too cumbersome to look at.
I use partial classes and separate files in such cases.
Post by Anton Shepelev
Post by Anton Shepelev
public void Add( TKey key, TVal value )
{ if( Count > Capacity / 3 )
{ IncCapacity(); }
Auto shrink could be counterproductive to objects
that change the size often largely.
There is no autoshrink here, only auto-grow.
Sorry, I read too fast. Forget about it.

But it is unusual that a Hashtable needs to have only 30% fill factor.
70% are usually no problem. The .NET implementation works quite
reasonable up to 100%.
Post by Anton Shepelev
IncCapacity() makes certain that the capacity is at
least three times the number of elements, in which
case an empty cell shall always be found in theory.
If it is not, then there is an error in the code.
OK, so the "throw" it is more like an assertion.
Post by Anton Shepelev
By the way it is misplaced. It should be up to
the TDictRec constructor to initialize its con-
tents.
In order to pass less data through the stack, I have
preferred by-reference (should be out instead of
ref) access to a constructor.
struct constructors do not really create much code after the JIT has
done its work. In fact any struct member function takes the this pointer
as reference. Otherwise mutable struct functions would not work at all,
because they would operate on local copies.
Post by Anton Shepelev
This criterion is not reliable. You might hit an
entry with a different hash code. In this case or
in case your key is not found you may end up in an
infinite loop at Probe.
My dictionary always has empty elements by construc-
tion, therefore lookup for an absent key will fail
when the first empty cell is hit.
I did not catch that you only operate at 30% fill factor. In this case
empty cells are "everywhere".
Post by Anton Shepelev
Post by Anton Shepelev
[AntDict() continued]
IncCapacity();
For whatever reason you increase the capacity
here. The dictionary is still empty.
To avoid a special case and therefore an extra
branch operator at insertion and lookup.
OK.
Post by Anton Shepelev
For whatever reason you use a effectively quadrat-
ic function in case of hash collisions here. Nor-
mally there is no need for that. Keep in mind
that in the collision algorithm must touch *all*
slots before it returns to the starting place.
Can you ensure this in conjunction with the modu-
lus operation below? Probably not.
Yes, I can. It is a well-known property or triangu-
lar numbers that 2^n mod t scans the numbers from 0
to 2^n-1 as t goes through consecutive triangular
numbers.
OK, your container does not support arbitrary capacities.
Post by Anton Shepelev
There are smarter collision handlings by the way.
Which ones? Bucketing? Linear probing?
Linked lists. This /could/ be implemented by linear probing.
Post by Anton Shepelev
Post by Anton Shepelev
// HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
Use uint.
uint and an extra subtraction for each hash calcula-
tion. I wish the standard GetHash() returned a
uint.
? - why not simply cast to uint?
Post by Anton Shepelev
Post by Anton Shepelev
[Probe() method continued]
With crit = CritEquals() is will stop at the first
empty entry as I wrote above.
The same story with the fill factor.
Post by Anton Shepelev
The size of a hash table should be a prime number.
At least it should never be even as this would
discard bits from the hash code at the modulus op-
eration.
This depends on the hashing algorithm.
Indeed. If the has algorithm distributes well over the entire value
domain there is no problem.
But I have seen quite bad hash implementations. E.g. int.GetHashCode
just returns the integer value as well as derived types like enums. And
char.GetHashCode AFAICS just multiplies by 0x10001.
Post by Anton Shepelev
No idea about the other, missing classes. (Google
drive is for members only.)
Nope, my link works for anonymous users. I have
just tested it without logging in Google.
Hmm, I only get an empty page with a login button in the upper right
corner. Probably Google dislikes my browser or security settings.
But I just saw the download symbol left to it which did the job.


Marcel
Anton Shepelev
2017-03-02 21:03:58 UTC
Permalink
By the way: the internal types of your imple-
mentation should not be exposed in the names-
pace unless they are used by other classes of
your class library as well. Consider to use
private nested types.
I declared them in the namespace merely because
I found code with nesting too cumbersome to look
at. Whereas "internal" classes are not visible
to the actual user from other assemblies I
thought that should not be a problem.
I use partial classes and separate files in such
cases.
I am of opinion that it is dangerously similar to
include-files in C -- a way to split code in a non-
modular manner.
But it is unusual that a Hashtable needs to have
only 30% fill factor.
Yes, but it was required to reach the performace of
the standard .NET dictionary. I will try to opti-
mise my dictionary so that it works as fast with a
denser filling.
70% are usually no problem. The .NET implementa-
tion works quite reasonable up to 100%.
It has a different architechture, wherein the hash
table contains "buckets", each pointing to a list of
elements. In order to avoid this extra level of in-
direction I have chosen a bucket-less hash table,
which contains the elements directly, so that the
cost of collisions is somewhat higher in general and
much higher at high density.

The very meaning of "fill factor" differs between
these architectures. In .NET it is the fraction of
used hashes in the bucket array, whereas in mine the
fraction of used cells in the element array. They
cannot be directly compared.

Here is th average numbers of collisions (secondary
lookups) in my implementation for different fill-
factors with my easy case of uniformly distributed
keys:

Fill factor collisions per lookup
1/4 0.16
1/3 0.24
1/2 0.44
2/3 0.78
3/4 1.06
IncCapacity() makes certain that the capacity is
at least three times the number of elements, in
which case an empty cell shall always be found
in theory. If it is not, then there is an error
in the code. The dictionary must look for an
empty cell at every insertion, therefore this is
part of the Add() method.
OK, so the "throw" it is more like an assertion.
Yes. That is the major purpose of exceptions in my
code, for I agree with Joel Spolsky:

https://www.joelonsoftware.com/2003/10/13/13/
OK, your container does not support arbitrary ca-
pacities.
Indeed, but whoose does, and what meaning it has if
but a private, implementation-specific, detail?
There are smarter collision handlings by the
way.
Which ones? Bucketing? Linear probing?
Linked lists. This /could/ be implemented by lin-
ear probing.
Do you mean implementing each bucket as a linked
list? That is the design of the standard .NET Dic-
tionary. I have preferred the simplicity of a buck-
et-less array.
Post by Anton Shepelev
// HACK: how to use the entire hash value?
int hash = key.GetHashCode() & 0x7FFFFFFF % Capacity;
Use uint.
uint and an extra subtraction for each hash cal-
culation. I wish the standard GetHash() re-
turned a uint.
? - why not simply cast to uint?
How do mean to do that? The simplest solution I
have found is this:

int ihash = key.GetHashCode();
uint hash;
if( ihash < 0 )
{ hash = uint.MaxValue - (uint)(-ihash); }
else
{ hash = (uint)ihash; }

I don't think a sigle cast will do it...
But I have seen quite bad hash implementations.
E.g. int.GetHashCode just returns the integer val-
ue as well as derived types like enums.
Can one do any better without relying on domain-spe-
cific information such as a specific range and dis-
tribution of those integers?
--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Marcel Mueller
2017-03-03 06:49:56 UTC
Permalink
On 02.03.17 22.03, Anton Shepelev wrote:
[nested classes]
Post by Anton Shepelev
I use partial classes and separate files in such
cases.
I am of opinion that it is dangerously similar to
include-files in C -- a way to split code in a non-
modular manner.
It is basically a language issue. In a more generalized approach there
is no difference between static classes and namespaces. This is not true
for C# but in case of a nested class one could say that the enclosing
class is just like a private implementation namespace. So from my point
of view there is nothing wrong with separate files for nested classes.
The C# syntax for this purpose is partial class - so what?
Post by Anton Shepelev
But it is unusual that a Hashtable needs to have
only 30% fill factor.
Yes, but it was required to reach the performace of
the standard .NET dictionary. I will try to opti-
mise my dictionary so that it works as fast with a
denser filling.
There is probably no much that you can do with a direct mapped
implementation.
Post by Anton Shepelev
70% are usually no problem. The .NET implementa-
tion works quite reasonable up to 100%.
It has a different architechture, wherein the hash
table contains "buckets", each pointing to a list of
elements. In order to avoid this extra level of in-
direction I have chosen a bucket-less hash table,
which contains the elements directly, so that the
cost of collisions is somewhat higher in general and
much higher at high density.
The very meaning of "fill factor" differs between
these architectures. In .NET it is the fraction of
used hashes in the bucket array, whereas in mine the
fraction of used cells in the element array. They
cannot be directly compared.
I would not call this another definition of fill factor. But indeed it
is a different design with different properties.

I think the .NET implementation takes this approach because typical
hardware architectures dislike sparse memory usage, first of all because
of read amplification at cache line reads. This can significantly
degrade performance at high system load.
A fill factor of 30% roughly means a factor 3 read amplification. Of
course, this does not apply if the data does not fit into the cache
anyway or if it is rarely accessed. Then the read amplification is
determined by the ratio of the cache line size to the bucket size.
Post by Anton Shepelev
Here is th average numbers of collisions (secondary
lookups) in my implementation for different fill-
factors with my easy case of uniformly distributed
Fill factor collisions per lookup
1/4 0.16
1/3 0.24
1/2 0.44
2/3 0.78
3/4 1.06
I do not remember the theoretical values but it looks plausible.
Post by Anton Shepelev
Post by Anton Shepelev
IncCapacity() makes certain that the capacity is
at least three times the number of elements, in
which case an empty cell shall always be found
in theory. If it is not, then there is an error
in the code. The dictionary must look for an
empty cell at every insertion, therefore this is
part of the Add() method.
OK, so the "throw" it is more like an assertion.
Yes. That is the major purpose of exceptions in my
https://www.joelonsoftware.com/2003/10/13/13/
I do not agree with him. The rationale is something like "I do not like
exceptions because I am unable to write exception safe code."
OK, the totally unchecked exceptions of C# have some disadvantages. But
maybe future language versions introduce throws declarations. However,
for LINQ-expressions (called callbacks in the past) a concept of generic
delegation of exception declarations would be helpful. I.e. this
function throws the exception X + the exceptions thrown by the invoked
delegates. This is a lack in Java and some other languages too.
Post by Anton Shepelev
OK, your container does not support arbitrary ca-
pacities.
Indeed, but whoose does, and what meaning it has if
but a private, implementation-specific, detail?
It is. I only did not catch this at the first glance.
Post by Anton Shepelev
Post by Anton Shepelev
There are smarter collision handlings by the
way.
Which ones? Bucketing? Linear probing?
Linked lists. This /could/ be implemented by lin-
ear probing.
Do you mean implementing each bucket as a linked
list? That is the design of the standard .NET Dic-
tionary. I have preferred the simplicity of a buck-
et-less array.
Feel free to do so. I was puzzled about the .NET implementation firstly
too. But by now I like it because it is surprisingly efficient.
Post by Anton Shepelev
? - why not simply cast to uint?
How do mean to do that? The simplest solution I
int ihash = key.GetHashCode();
uint hash;
if( ihash < 0 )
{ hash = uint.MaxValue - (uint)(-ihash); }
else
{ hash = (uint)ihash; }
I don't think a sigle cast will do it...
int i = -4;
uint j = (uint)i;
j.Dump();

Whether the resulting unsigned integer division is reasonably fast on a
certain target platform is another question.
Post by Anton Shepelev
But I have seen quite bad hash implementations.
E.g. int.GetHashCode just returns the integer val-
ue as well as derived types like enums.
Can one do any better without relying on domain-spe-
cific information such as a specific range and dis-
tribution of those integers?
Well, the problem is that the available bits from the domain are not
spread over the available bits of the hash code. One part of the
implementation has to do this job. In .NET it is common practice to do
it at the point where hash codes are used or combined. This is an
option, but one has to keep it in mind.


Marcel
Arne Vajhøj
2017-03-03 23:31:55 UTC
Permalink
Post by Marcel Mueller
Post by Anton Shepelev
Yes. That is the major purpose of exceptions in my
https://www.joelonsoftware.com/2003/10/13/13/
I do not agree with him.
+1
Post by Marcel Mueller
The rationale is something like "I do not like
exceptions because I am unable to write exception safe code."
Yes. And if one is not able to write code that properly handles
exception, then it is not likely that one is able to write code
that properly handle return values.

He also conveniently forget to mention some the features making
it relative easy to handle exceptions besides catch: finally and
using.
Post by Marcel Mueller
OK, the totally unchecked exceptions of C# have some disadvantages.
Yes. But Anders Hejlsberg did not like the Java way.
Post by Marcel Mueller
But
maybe future language versions introduce throws declarations. However,
for LINQ-expressions (called callbacks in the past) a concept of generic
delegation of exception declarations would be helpful. I.e. this
function throws the exception X + the exceptions thrown by the invoked
delegates. This is a lack in Java and some other languages too.
I think you can do that in Java for a specific number of exceptions -
just not for a variable number of exceptions.

Arne

Loading...