7/36  

Refactored Hashed Collection Library

Mentor: Andres Valloud
Second mentor: Martin McClure
Level: Intermediate
Invited students: Germán Leiva, Nicolas Scarcella
Students interested: Nicolas Scarcella(very), Germán Leiva, Carla Griggio(lightly)

Technical Details

Currently, most Smalltalk implementations provide a rich hashed collection hierarchy including Sets and Dictionaries of various kinds.
Unfortunately, historical constraints crystallized the class hierarchies into inflexible shapes.  For example, in VisualWorks, all hashed collections inherit from Set, even though Dictionaries are not
Sets.  Moreover, concepts such as hash vs. identityHash, equality vs. identity checking, weakness, ephemeronness, and storage strategy are not best served by subclassing. Consequently, one can observe reversals in such properties while diving through a hierarchy branch. The current class libraries are so pervasively used that modifying them in place would be quite disruptive to existing applications. Consequently, rather than refactoring the class library in place, it would be better to provide a loadable library that offers better hashed collections on demand without conflicting with the existing hashed collections.

Broadly speaking, hashed collections can be divided into Sets and Dictionaries.  Their properties (using hash or identityHash values, comparing objects via equality or identity, several types of weakness, their storage strategies) should be configurable in place. Subclassing should be used when the behavior of the hashed collection changes.  Therefore, the properties of hashed collections should be implemented in terms of composition and delegation of policy or other
collaborator objects.

The minimum goal for this project is to implement the equivalent of today's Set, IdentitySet, WeakSet, Dictionary, IdentityDictionary, WeakKeyDictionary, WeakValueDictionary, WeakIdentityKeyDictionary, WeakIdentityValueDictionary, some ephemeron based dictionaries, and at least two storage strategies: open addressing with linear probing, and hash buckets.

Benefits to the Student

The student will be exposed to a fascinating area of research rich in results and potential for often dramatic real life application performance improvements.  This knowledge will increase the student's ability to meaningfully contribute to software projects.

Benefits to the Community

Quite frequently, the community ends up implementing a one-off hashed collection that depends on a deficient class hierarchy.  This dependency only compounds the maintenance issues for the current hierarchy.  Having a loadable library with more sensible hashed collections would prevent the need to repeatedly implement hashed collections that occur often in practice (e.g.: a hash bucketed Set), as well as preventing a proliferation of redundant hashed collection classes of similar structure and functionality.  Moreover, the potential for hashed collection class unification offers the possibility to improve the performance of applications that depend on less than optimal hashed collections.




Updated: 23.3.2010