Iterators are Leaky


Programming Rust

A popular standard library staple of modern languages is the iterator interface. This API abstracts the notion of stepping through a sequence of items one at a time, much like a database cursor. Unfortunately, just like database cursors, the performance can be poor for many of the same reasons. In this post I'm going to cover one such issue: a “leaky abstraction” inherent to iterators that tends to cause repeated scans of the source data when used naively.

Unlike database cursors, languages such as C# or Rust provide a wonderfully rich set of extensions on top of the basic “move to next item” iterator interface. These simplify complex code down to short snippets that are not only terse, but also more readable than the original code.

Compare the following snippet of Rust code:

let mut iter = IntoIterator::into_iter(iterable);
let mut t = iter.next();
loop {
    match (t,iter.next()) {
        (Some(m),Some(i)) if i > m => { t = i; },
        (_,None) => { break; }
        _ => {}
    }
}

With this much terser equivalent:

let t = iterable.iter().max();

In the first snippet it may not have been immediately obvious that the maximum was being computed. The second snippet is not only shorter, it is also much more readable.

The concept of iteration is extremely abstract, and can be applied to a huge variety of things. Its use is generally fine when applied to simple containers such as arrays or hash tables, but it can also be used with libraries that wrap network APIs or system calls in an iterable interface. Unfortunately, in those cases many of these useful extended functions ignore the potentially high cost of repeatedly iterating through a result set or the cost of a network round-trip for each next() call.

I call this the “inefficient min/max” problem, which can be best illustrated by writing out an example problematic snippet of code:

let min = container.iter().min();
let max = container.iter().max();

If the container is a small array, this is reasonably efficient. The array will be scanned twice, but it'll probably fit in the CPU cache, and in the grand scheme of things this is no big deal. If we're lucky, the compiler will automatically vectorise this and it'll be probably faster than we expect anyway. It could even be optimal if we're very lucky.

However, as soon as the data set exceeds the size of the CPU cache and needs to be streamed in from memory, this is irritatingly inefficient, exactly halving the throughput that could be achieved. If networking is involved, then this snippet is almost certainly worth taking the time to optimise for better efficiency… but how?

Bad Fix #1) Manual looping

We could write an explicit loop to simultaneously compute both the minimum and maximum of a iterator. This is straightforward variant of the verbose code snippet above, but this solution throws away the benefits those wonderfully short and readable abstractions provided by the high-level iterator interface.

Bad fix #2) More utility functions

Computing the minimum and maximum of a block of data is a common pattern, used for all sorts of things. We could add an utility function to compute both simultaneously, e.g.:

let (min,max) = container.iter().range();

This is a lot better for performance, and potentially more efficient even in the case of small arrays, but it is too specific. It solves a single, “spot problem”, ignoring all of the other situations where we may be forced to iterate twice. Even in this trivial example, it's obvious that we may want to also compute the average at the same time:

let (min,max,sum,count) = container.iter().measure();
let average = sum / count;

This has become horribly messy: Not all data types support comparison. Not all types support addition. The sum needs to be a different type than the type being summed to account for overflow. As a realistic example, the minimum and maximum of sortable types such as text strings makes sense, but the average of a set of strings is nonsense. Conversely, average colours makes sense, but minimum and maximum colours are not a well defined concept.

Squeezing all of this into a single function is non-extensible and clearly not a good design.

Bad fix #3) Use fold() or one of its variants

We could implement arbitrary combinations of scanning across iterators using something like the fold() function, but again this is not quite right.

The min and max functions are best implemented with a variant called fold1(), which is more efficient, but the Rust Iterator type doesn't expose this as a public function. (The .NET IEnumerable<T> also does not have this variant.) However, computing the length is required for the average, but is best implemented using the plain fold().

Either way, we're back to manually writing out fiddly code which is almost the same as the original for loop we were trying to get away from!

Possible fix #1) Add “out” parameters

Really, the problem is that functions such as sum() consume the iterator. What they ought to do is create a future and return an iterator that can then be chained into more computations.

In Rust, it might look something like the following:

let (min,iter1) = container.iter().min();
let (max,iter2) = iter2.max();
let (sum,iter3) = iter3.sum();
let (count,iter4) = iter4.count();
iter4.consume();
let average = sum.await / count.await;
let range = (min.await, max.await);

This approach is not only flexible, but it also allows functions like count() to short-circuit if possible and eliminate an unecessary computation entirely.

The only irritation is the sequence of numbered temporary “iter” variables used to glue everything together. Ideally, it would be nice to be able to use the chained function call notation, with some sort of let bindings for “out” parameters, however Rust does not currently allow this kind of notation:

 container.iter()
    .min( let min )
    .max( let max )
    .sum( let sum )
    .count( let n )
    .consume();

C# does allow inline variable bindings in function out parameters, so the desired syntax could be achieved with a set of extension methods along the lines of the following:

static IEnumerable<T> Max<T>( this IEnumerable<T> iter, out ValueTask<T?> max ) { ... }
static IEnumerable<T> Min<T>( this IEnumerable<T> iter, out ValueTask<T?> min ) { ... }
static IEnumerable<T> Count<T>( this IEnumerable<T> iter, out ValueTask<int> n ) { ... }
static IEnumerable<T> Consume<T>( this IEnumerable<T> iter ) { ... }

Which would then allow usage like the following snippet:

int[] demo = { 1, 2, 3 };
demo.Max(out var max)
    .Min(out var min)
    .Count(out var n)
    .Consume(); // or use the iterator in a foreach loop...

if (n > 0)
{
    int range = max.Result - min.Result;
    int avg_gap = range / n.Result;
}

One issue with the above is that it pulls in the somewhat complex ValueTask type, which is needed because the Min, Max, Count functions return immediately, but don't actually produce a result until the source iterator is stepped through. In C# this might be forgivable, but systems programming languages cannot afford to use such a heavyweight interface for returning simple scalar values.

Possible fix #2) Reactive Extensions

The dual of the Iterator interface is the Observable interface. Iterators pull, Observables push.

The snippet above can be rewritten using the .NET implementation of the Rx library, and works as-is without needing any modifications:

int[] demo = { 1, 2, 3 };            
var iter = demo.ToObservable();
var max = iter.Max();
var min = iter.Min();
var sum = iter.Sum();
var n = iter.Count();
iter.Subscribe(); // Consumes the observable

There is a catch though: This exercise is about optimimum performance! The goal is to do the minimum work to compute a set of results, not wasting any CPU time unecessarily with duplication or other overheads. The aim is to have a zero cost abstraction, or very nearly so.

Unfortunately, Rx is designed for asynchronous code and hence its implementation is littered with complex control flow designed to hook into schedulers. Even a very clever compiler would have little chance of optimising the above snippet, and it almost certainly couldn't be automatically vectorised to take advantage of AVX SIMD instrunctions.

Rust could likely do much better, but as of this writing there is no Reactive Extensions library implementation for it.

There was at least one attempt, but it seems to have been abandoned in 2018. As expected, it makes references to mutexes and spin locks, making it much more complex than the the iterator trait, which is featherweight in comparison.

Possible fix #3) Running totals

The Rx interface provides a hint of how the iterator-based interface ought to be implemented. The issue with functions such as Min/Max/Sum/Fold/Count, etc… is that they consume the iterator and they return a value only at the end. However, a much more useful construct is to provide running minimums, maximums, sums, counts, etc… This is familiar to anyone who's ever used a spreadsheet to compute a running total column.

This is also how SQL solves the inefficient min/max issues. It is a dedicated query language designed in the era of magnetic tapes spooling from reel to reel, so of course it has features to avoid repeated table scans like the plague. For example, the query below will perform a single pass over the source table in practically all database engines, and may utilise indexes to reduce this to efficient lookups in some cases:

SELECT MAX(c1), MIN(c1), COUNT(*) FROM "table"    

This is extremely efficient, terse, and readable. Exactly what we're looking for! Unfortunately, no popular imperative language supports this syntax, even .NET with its Language-Integrated Query (Linq) feature.

The snippet above has an implicit “GROUP BY”, aggregating all rows of the source table. However, sometimes we want to return all table rows as well as simultaneously compute aggregates over the result set. To cater for this scenario, modern SQL has been extended with the new OVER() keyword:

SELECT 
    MAX(c1)  OVER (), 
    MIN(c1)  OVER (), 
    COUNT(*) OVER (), 
    * 
    FROM "table"

Unfortunately, the query above uses a complex query plan with 3 passes over the source data and 2 joins! This is required because the output of the aggregations can't be deferred until the end. The query engine is forced to compute them first, and then join the aggregates to each of the rows:

Inefficient query plan

By computing the running totals instead, the database engine can process the data in a single pass:

SELECT 
    MAX(c1)  OVER (ORDER BY id ROWS UNBOUNDED PRECEDING),
    MIN(c1)  OVER (ORDER BY id ROWS UNBOUNDED PRECEDING),
    COUNT(*) OVER (ORDER BY id ROWS UNBOUNDED PRECEDING),
    * 
    FROM "table"

Unfortunately this syntax is repetitive, difficult to read, and its performance is not guaranteed. Database engines may use unexpected execution plans, particularly of the rest of the rest of the query is more complex than a simple table scan.

Somewhat simpler running total functions without the complex ORDER BY sub-clauses of the SQL syntax have a more ergonomic implementation in procedural languages. (They can include the PARTITION BY feature relatively easily, but I'm ignoring that here for clarity.)

// Max and Min are structs implementing Iterator with an Item
// type of (Item,Item). This is a tuple combining the running max/min
// along with the current iterated value.
fn with_max( self ) -> Max<Self> {...}
fn with_min( self ) -> Min<Self> {...}

// This then allows code like the following, which is very similar
// to how the existing enumerate() function works.
fn main() {
    let container = [1,2,3,4,5,6,3,4,0,2,7,4,5];
    for (n,(max,i)) in container.into_iter().with_max().enumerate() {
        println!("#{}: value={}, max={}", n, i, max);
    }
}

Except that the snippet above is deceptive: we can't chain these calls! While it's possible to chain with_max() and enumerate(), the sought-after max and min combination can't be used together as with the example above.

Somewhat unexpectedly though, this actually compiles and runs:

for (min,(max,i)) in container.into_iter().with_max().with_min() {
    println!("value={:?}, min={:?}, max={:?}", i, min, max );
}

The output however reveals the issue:

 value=3, min=(3, 3), max=3
 value=2, min=(3, 2), max=3
 value=3, min=(3, 2), max=3
 value=4, min=(3, 2), max=4
 value=5, min=(3, 2), max=5

The “min” value is now a tuple, not the expected scalar value. Oops!

This is wandering into the territory of monads and functors. What we want is the ability to define a wrapper around the iterator Item type that can still be fed into one or more functions such that they can reach through the wrapper and process the underlying wrapped item, not the composite type.

That is, we're converting I into (I,I) and then we would like (I,(I,I)), (I,(I,(I,I))), etc… where the rightmost “I” is the underlying iterable Item and the ones on the left are the various min/max/sum/count wrappers. Instead with the code above we're getting (I,I) followed by ((I,I),(I,I)), which is duplicating the first wrapper into the second wrapper. Additional uses of such incorrect code will cause a rapidly growing tree of wrappers.

Implementing this is left as an exercise for the reader…

An aside: Is this even important?

Absolutely yes! The purpose of high level languages with zero-cost abstractions is to let us have our cake and eat it too: be able to write readable, elegant, declerative code and still get optimal performance as if we were manually optimising the machine code.

As a real-world example, consider the computation of a bounding box and centre-of-mass for a set of points in three-dimensional space, such as stars in an astronometric survery, or a point cloud computed from sterescopic images for photogramettry. In this case, we would like to compute:

  • The minimum of the X, Y, and Z coordinates
  • The maximum of the X, Y, and Z coordinates
  • The sum of the X, Y, and Z coordinates
  • The total count

It's entirely conceivable that the data is being streamed in from disk, far too large to fit into memory. Or we could be dealing with bucketisation or some other algorithm that provides an opaque Iterator instead of neat little arrays that fit into cache.

We're faced with a choice: Avoid using even the most basic functions in the standard library and roll our own loop manually, or suffer a tenfold hit in our performance as we scan the data repeatedly to calculate each of the above data points!

The min(), max(), len(), and similarly trivial functions shouldn't be special snowflakes with limited utility. These are core algorithms to a wide range of computations and should be composable.

Further reading

This is certainly not the only issue with traditional Iterator design. It is high time that there was a rethink of not just this interface, but the general direction of procedural programming in relation to integrated query sub-languages. These advanced iterator designs ought not to be relegated to optional libraries, but need to be promoted into standard libraries.