Hashing a Tree Structure

If I were to do this, I’d probably do something like the following:

For each leaf node, compute the concatenation of 0 and the hash of the node data.

For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

Edit1: There is also the possibility of adding a “hash valid” flag to each node and simply propagate a “false” up the tree (or “hash invalid” and propagate “true”) up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with “symbol hash” 30 and “value hash” 25 and a two-node tree, where the root has a “symbol hash” of 0 and a “value hash” of 30 and the child node has a total hash of 25. The examples are entirely invented, I don’t know what expected hash ranges are so I can only comment on what I see in the presented code.

Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

However, if the hash performs decently on expected data, it looks as if it will do the job. It’s certainly faster than using a cryptographic hash (as done in the example code listed below).

Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

#! /usr/bin/env  python

import Crypto.Hash.SHA

class Node:
    def __init__ (self, parent=None, contents="", children=[]):
        self.valid = False
        self.hash = False
        self.contents = contents
        self.children = children


    def append_child (self, child):
        self.children.append(child)

        self.invalidate()

    def invalidate (self):
        self.valid = False
        if self.parent:
            self.parent.invalidate()

    def gethash (self):
        if self.valid:
            return self.hash

        digester = crypto.hash.SHA.new()

        digester.update(self.contents)

        if self.children:
            for child in self.children:
                digester.update(child.gethash())
            self.hash = "1"+digester.hexdigest()
        else:
            self.hash = "0"+digester.hexdigest()

        return self.hash

    def setcontents (self):
        self.valid = False
        return self.contents

Leave a Comment