08 August 2007

Persistency and Lazy Memoization in a Pairing Heap

I just released a pairing heap PlaneT package. Because the datastructure is both persistent and has amortized space bounds, it requires a neat trick to implement, which I will discuss below.

A pairing heap is an adaptive datastructure for a collection of objects which supports the following operations [edit: You can also read about an alternative implementation here in the Scheme Cookbook. This implementation doesn't achieve the amortized bounds on remove-min which mine does in the presence of persistence]:

Operation Time Bound
min O(1)
insert O(1)
merge O(1)
remove-min O(n) worst case, O(log(n)) amortized.

The pairing heaps in my PlaneT package are persistent, which means that the above operations do not change the heap given to them, but rather construct a new heap which shares some structure with the old heap. For example, the (remove-min h) procedure returns a new heap which contains all the elements of h except for the smallest element.

Note that the time-bound on remove-min in O(n) worst-case, and O(log(n)) amortized. This means that, while a single remove-min operation can take time which is O(n), most will take much less. In fact, the longer remove-min operations are so rare that, in a sequence of an infinite number of remove-min operations, the average time per operation is O(log(n)). (This bound is only conjectured at the moment, but supported by a lot of experimental evidence. Because the pairing-heap adapts its structure to the use pattern it observes---a really neat feature, by the way---it is extremely difficult to rigorously prove bounds for it.)

It turns out to be difficult to ensure that amortized bounds still obtain with a persistent datastructure. The reason is illustrated in the following code snippet which creates a list of 1000 threads, each of which does something with a heap:

(let ((h (make-expensive-remove-min-heap)))
  (list-ec (:range i 1000)
    (thread
     (lambda ()
       (let ((new-h (remove-min h)))
         (do-something-interesting-with new-h))))))
A persistent datastructure can have multiple futures! But, when proving amortized bounds, one typically exploits that the cost of an expensive operation (remove-min here) can be "accounted" to a sequence of prior cheap operations (basically, make-expensive-remove-min-heap must do a lot of inserts, each of which can "carry" a bit of the cost of the upcoming remove-min operation, so that you know that expensive remove-mins are very rare, and the cost averages out to O(log(n))). That style of accounting (often called the "Banker's Method") works if you know that only one remove-min will occur in the future of a given datastructure, as would be case if we were using ephemeral datastructures---in this case, the first remove-min would destroy the datastructure, so we could only perform it once. But we're not using ephemeral datastructures: each of the 1000 threads created above will re-perform the expensive remove-min on the same h, and we'll be in a pile of trouble, time-bounds-wise.

The solution to this problem is to use lazy evaluation with memoization. When constructing a heap, queue up the remove-min operation using delay. When you need to remove the minimum element, use force. Because delay and force memoize the result, you can guarantee that remove-min operations after the first will complete in O(1) time, which maintains the amortized bounds in the presence of persistence.

This is, of course, not my idea. It (and much more) are explained in the delightful thesis of Chris Okasaki, and also in his reportedly-delightful (I've never read it, but it's very well-regarded) book, Purely Functional Data Structures.

By the way, if you need a bunch of datastructures (more than my simple pairing-heap), please check out the Galore PlaneT library by Carl Eastlund, Jens Axel Soegaard, and Stevie Strickland (either Version 2.2---which contains more datastructures, but also more bugs and isn't as clean---or Version 3.6---which is the most modern version). It's what I use when I need a serious finite map or set.

3 comments:

Jens Axel Søgaard said...

If you find bugs in Galore 2.2 do send them to me, so I can fix them.
(I can't remember any receiving any bug reports not fixed).

Will Farr said...

Hi Jens. I don't know of any bugs in Galore 2.2, either (as I recall, you fixed the one I reported extremely quickly), so I struck out the text above. Out of curiosity, then, why did you move to Version 3 instead of sticking with 2?

Jens Axel Søgaard said...

Well, think of Galore 2 and Galore 3 as seperate libraries.

The difference between Galore 2 and Galore 3 is that Galore 3 is more flexible at the cost of being slower [at least that was the case two years ago - I don't have any recent benchmarks].

Version 2 was written by me, and Galore 3 was written by Carl and Stevie. They were (are) interested in using techniques from object oriented program as a middle-layer, where as I was more fascinated by the purely functional approach. [Note: As a user of Galore 3, you see a functional interface].