R. A. O'Keefe on Smalltalk Design Principles and Collection Classes

To: squeak-dev at lists squeakfoundation org
From: Richard A. O'Keefe <ok at cs otago ac nz>
Subject: Design Patterns and Collection Classes
Date: Thu Aug 29 02:50:01 CEST 2002

"David Griswold" <David.Griswold at acm org> wrote:

	So if there is any overall message here, it is:  be very careful
	with code in classes like this, and show great restraint by
	making changes as minimal as possible.  The code in Collection
	is simple, but not simple-minded.  Don't underestimate the
	subtlety of why these simple-seeming methods are the way they are.

In another context, Brian Foote and Joseph Yoder wrote
[http://www.laputan.org/mud/mud.html capitals are theirs and were links]:

... when you are prototyping a system, you are not usually concerned with
    how elegant or efficient your code is.  You know that you will only
    use it to prove a concept.  Once the prototype is done, the code will
    be thrown away and written properly.  As the time nears to demonstrate
    the prototype, the temptation to load it with impressive but utterly
    inefficient realizations of the system's expected eventual functionality
    can be hard to resist.  Sometimes, this strategy can be a bit too
    successful. ...

Does this sound a bit like the history of Smalltalk?  It should.  A whole
new language every two years.  A research goal that was not "how do we
build ideal collection classes" but "how do we take the idea of object
orientation and build a language and system around it that can be
understood ("Personal Mastery") and used in powerful ways, how can we
develop a whole new approach to programming?"

... Time, or a lack thereof, is frequently the decisive force that drives
    programmers to write THROWAWAY CODE.  Taking the time to write a
    proper, well thought out, well documented program might take more
    time than is available to solve a problem, or more time than the
    problem merits.  Often, the programmer will make a frantic dash to
    construct a minimally functional program, all the while promising
    him- or herself that a better factored, more elegant version will
    follow thereafter.  They may know full well that building a reusable
    system will make it easier to solve similar problems in the future,
    and that a more polished architecture would result in a system that
    was easier to maintain and extended.

    Quick-and-dirty coding is often rationalized as being a stopgap
    measure.  All too often, time is never found for this follow-up work.
    The code languishes, while the program flourishes.

    THEREFORE [and this is their recommendation], produce, by any means
    available, simple, expedient, disposable code that adequately
    addresses just the problem at hand.

... while this approach [that is, the approach used in a particular
    example, not throwaway code as such] is, in general, a lousy way to
    address this [particular example's] problem, it is perfectly
    satisfactory within the confines of the particular purpose for which
    [it] has ever actually been used.  Such practical constraints are
    typical of THROWAWAY CODE, and are more often than not undocumented.
    For that matter, everything about THROWAWAY CODE is more often than
    not undocumented.  When documentation exists, it is frequently not
    current, and often not accurate.

Does this sound like #removeAll: and #addAll:?  It should!

... [Two example] systems might be good candidates for RECONSTRUCTION,
    were the resources, interest, and audience present to justify such an
    undertaking.  In the mean time, these [particular] systems, which are
    still sufficiently well suited to the particular tasks for which they
    were built, remain in service.  Keeping them on the air takes far
    less energy than rewriting them.  They continue to evolve, in a
    PIECEMEAL fashion, a little at a time.

Doesn't this sound like everything about Squeak?  From the core classes
to Morphic?  Isn't piecemeal (community) improvement part of the point
of Squeak?


So David Griswold claims that the Collection classes are subtle,
simple but not simple-minded, and that the methods (including the ones I
claim are broken) are the way they are for excellent reason and should
not be altered without great [indeed, to my thinking, extreme] restraint,
while I claim that the Collection classes are a "Ball of Mud" where most
methods are the simplest thing that could possibly work, except they don't.

Obviously, eyewitness testimony from the original developers could settle
this.  Except that it couldn't; it could only tell us how they remember
what they thought they were doing.  We need some more objective criterion
to help us decide what kind of code we are dealing with here.  If we are
dealing with entire and perfect chrysolite, what we need is documentation.
If we are dealing with mud, we need documentation and repairs.

So let's look away from #removeAll: and #addAll:.  Let's even look away
from #union:, #intersection:, and #difference:, enchanting though their
perversity may be.

----------------------------------------------------------------
A Design Principle.
----------------------------------------------------------------

    Every class should have a class invariant.
    It does not need to be expressed as code, but it needs to be thought
    about and written down.

    Initialisation methods have the obligation of setting up an object's
    state so that its class invariant is true.

    Other (externally accessible) methods get to assume that the invariant
    is true and have the obligation to ensure that it is still true when
    they complete.  If you cannot rely on sufficient knowledge about the
    object's state at the beginning of a method, the only thing you can
    do to restore the invariant is reinitialise.

    If the invariant of a class depends on the state of a mutable object,
    and that object is accessible outside this class, then you DON'T have
    an invariant and you cannot expect your methods to work.

    So the design principle is:
    a class's invariant should depend only on the values of its instance
    variables and the states of such objects as are immutable or are
    not shared with other objects.

----------------------------------------------------------------
The Collection Classes
----------------------------------------------------------------

Do the collection classes heed this warning?
Many of them do.

OrderedCollection, for example, never hands out its array, and does not
provide any non-private methods for changing its instance variables.
(At least, by putting methods in the 'private' category it is warning
 "don't call these lest the heavens fall".)

Set, for another example, depends on the exact state of its array, but
never hands out access to that array. but it also depends on the state
of the elements; it requires that the hash code of an element not change.
Oddly enough, it does provide a repair procedure that can be called.
You could mutate the elements of a Set, taking care that the Set itself
was not used while this was happening, and then invoke the #rehash
method.  Except that #rehash is private.  With the exception of numbers
and symbols, there are no immutable values that one might use as keys.
There is no equivalent of OPAL's InvariantArray, for example.  (I'm trying
to write one and it's a nightmare.  Array has so many methods that think
it's OK to smash, even when they are making copies.)

But what about Dictionary?

Dictionary cannot make up its mind whether it's about Sets of Associations
or keyed collections of values, and tries to be both.  Considered as a
keyed collection of values, a Dictionary's Associations are part of its
private internal structure.  Dictionary's invariant concerns the structure
of its hash table, which depends on the state of these Associations.  Yet
it is perfectly happy to hand those Associations out to anyone who asks.

Consider
    d := Dictionary new.
    d at: #red put: Color red.
    d associationsDo: [:each | each key: #green].
    d inspect
==> (shows #green -> nil)
    d at: #green
==> (Error: key not found)
    d at: #red
==> (Error: key not found)

Had Dictionary been designed with class invariants in mind,  it could still
have looked much the way it does now, but the Associations it hands out
would be InvariantKeyAssociations, where

    Association subclass: #InvariantKeyAssociation
        instanceVariableNames: ''
        classVariableNames: ''
        poolDictionaries: ''
        category: 'Collections-Support'

    "private methods"
    pvtKey: k value: v
	key := k.
	value := v.

    "accessing methods"
    key: k
        self shouldNotImplement

    "class instance creation methods"
    key: k value: v
	^self new pvtKey: k value: v

Except that this won't work; practically the only time I ever really
need 'pvt' methods is so that a class can initialise a new instance
without the initialisation method being exposed to another class, but
the 'pvt' machinery will not let you do this: "private messages may
only be sent to self".  If only there were a category of private messages
that a class could send to 'self new'.  So the method really has to be
called #privateKey:value: and we just have to hope that no outsider
will use it.  So much for private methods.

It's all very well to say that competent programmers wouldn't even try
to change the key of an association that came from a dictionary, but if
there are two things that the last fifty years of programming and the
recent history of the Internet have taught us, they are
 - everyone makes mistakes
 - some people deliberately try to break other people's programs.

Oh yes, what if no Association is changed, but a key is?
Just like Set, Dictionary could let you make a series of changes to
keys (as long as no previously unequal keys became equal) and then
repair the state of the dictionary using #rehash, except that #rehash
is a private method.  (Which means you can do this, but you shouldn't.)
#rehash depends on a rather weaker invariant:  it requires that keys
still be non-nil and unequal, but it doesn't require that their values or
hashes be unchanged.

Is there another Collection class that exposes its internal structure?
Yes, MappedCollection.

    MappedCollection collection: c map: m

in effect constructs the composition of two tabulated functions:

    (MappedCollection collection: c map: m) at: index
    =
    c at: (m at: index)

The base collection c and the map m are not copied when you do this.
Indeed, that's the point.  But it leads to some very odd consequences.

For example,
    i ~= j and: [
    (mc at: j) ~= 0 and: [
    mc at: i put: 0.
    mc at: j = 0]]
can come out true.  For every other keyed collection, setting the element
at one key doesn't alter any elements at other keys.  So don't try to
sort a MappedCollection!

It's also the only Collection class I can find in Squeak 3.2 where
(coll contents) is not the receiver but a copy with different properties.

One of the nasty things about the fact that the collection and mapping
are shared with unknown numbers of other objects is that

    foo: aCollection bar: aBlock
	|index value|

	index := aCollection size.
	value := aBlock value: (aCollection at: index).
	aCollection at: index put: value

may fail, even though aBlock value: ... doesn't send any messages whatsoever
to aCollection.  How?  aBlock could make the mapping smaller, or alter its
last element so that it is no longer a valid key for the other collection,
or could alter the other collection so that the key disappears, &c.

Oh yes, imagine the fun when you do
    a := #(3 1 4 1) copy.
    m := MappedCollection collection: a map: a.
m ==> a MappedCollection(4 3 1 3)
    m at: 2 put: 2.
m ==> a MappedCollection(1 2 1 2)
    m at: 3 put: 3
m ==> a MappedCollection(1 2 3 4)
a ==> #(2 1 4 3)

MappedCollection tacitly assumes that the collection and mapping are
different objects or that neither will be changed, but it does not say
this anywhere.

At the very least the class comment for MappedCollection
should be changed from

    I represent an access mechanism for a sequenceable collection
    re-ordering or filtering its elements.

(which is quite untrue because it doesn't in the least depend on the
collections being sequenceable) to

    I represent the composition of two collection.
    (MappedCollection collection: c map: m) at: index
    is always the same as c at: (map at: index).
    The collection and map could be any combination of sequenceable
    collections and dictionaries as long as the elements of m all
    are (and remain) acceptable keys for c.

    My instances are sequenceable precisely when their maps are
    sequenceable.

    If you plan to use #at:put: with one of my instances,
    the collection c and map m should be different objects.

    Note that the collection c and map m are not copied, but are
    shared.  If you change them, the mapped collection you made
    from them will also change.  Changes to the elements of the
    collection are normally safe; changes to the key space of
    the collection or the elements of the map may make the
    mapped collection invalid.  Do not change the map while you are
    iterating over the map or the mapped collection.

Hey, there's another oversight, found while writing that new class comment.
There's a missing method:

    "in category 'testing'"
    isSequenceable
        ^map isSequenceable

Alternatively, if the (otherwise pointless and unnecessary) restriction to
sequenceable collections is to be preserved, then #setCollection:map:
should be

    setCollection: aCollection map: aDictionary
        self assert: [aCollection isCollection].
        self assert: [aDictionary isSequenceable].
        domain := aCollection.
        map := aDictionary

except that the names of the method parameters (which I have not changed)
strongly suggest that the map should be allowed to be a dictionary, which
is not sequenceable.  But the instance creation method calls the very same
parameter aSequenceableCollection.

Whatever this code may be, it is not 'subtle'.  It's confused and confusing.

Are there any other collection classes that violate this design principle?
Yes, there is one that doesn't so much violate it as cut its breast off and
eat half of one of its kidneys:  LinkedList.

Most container libraries regard a linked list as a container for general
objects, with the link nodes being part of the private structure of the
list.  From the outside, it should not matter whether the nodes are allocated
one at a time, or in chunks (a good way to combine the flexibility of lists
with the space efficiency of arrays is to have a linked collection of arrays;
the well known "piece table" representation used in some editors is related
to this).  However, Smalltalk LinkedLists are very different:  all their
elements must be Link nodes.  This makes them utterly unlike arrays; any
number of arrays may refer to the same object as an element, and an array
may include an object as element any number of times.  Not so LinkedLists:
each Link may be an element of at most one LinkedList, and then at most
once.

In other words, LinkedList does not encapsulate.

Serious weirdness can result.  Consider
    c1 add: c2 first.
where c1 and c2 are separate collections sharing no objects at all
before this message is sent.  If c1 is any other kind of collection
that can support #add:, this is fine.  Now c1 and c2 have a common
element.  But if c1 and c2 are both LinkedLists, now a node is shared
between the two lists.  While #add: is defined to be #addLast:, it need
not actually true that c2 first is now the last element of c1!  (Because
c2 might have had more than one element.)

I could multiply examples of how easy it is to break LinkedLists
without exploiting (initial) aliasing, but the main consequence is
that Smalltalk programmers should never ever use LinkedList, not
even after making their own subclasses of Link.  Arguably, LinkedList
should be outside the Collection hierarchy.  (As noted in an earlier
message, LinkedList is amongst the SequenceableCollections, but does
not support #at:, surely a defining characteristic of such collections.)


Make no mistake:
 - I am not saying Smalltalk is unusable.
 - I am not saying the Collection classes are unusable.
 - I am not saying that the Smalltalk designers are culpable.
What I am saying is that the Smalltalk designers weren't trying to
produce highly polished code but were engaged in research of a completely
different kind, so that they produced 'good enough for the immediate task'
code which after 20-odd years outside the lab is overdue for an overhaul.
The Collection classes are a means to an end, not Holy Writ, and mortals
may dare to criticise and revise them.