Writing Custom Iterators
Steve Weston
(originally for package iterators
)
Peter Meilstrup
(adapted for package iterors
)
Source: vignettes/writing.Rmd
writing.Rmd
An iterator is a special type of object that supplies data on demand, one element 1 at a time. This is a nice abstraction that can help simplify many programs. Iterators are particularly useful in parallel computing, since they facilitate splitting a problem into smaller pieces that can then be executed in parallel.
Iterators can also be used to reduce the total memory that is needed at any one time. For example, if you want to process the lines of text in a file, it is common to write a loop that reads the file one line at a time, rather than reading the entire file in order to avoid running out of memory on huge files. That’s the basic idea of iterators. Iterators provide a standard method for getting the next element, which allows us to write functions that take an iterator as an argument to provide a source of data. The function doesn’t need to know what kind of iterator it is. It just needs to know how to get another piece of data. The data could be coming from a file, a database, a vector, or it could be dynamically generated.
There are a number of iterators that come in the iterors
package. The iteror.array
method allows you to iterate over
arrays, in much the same way as the standard apply
function. apply
has fixed rules on how the results are
returned, which may require you to reshape the results, which can be
inefficient, as well as inconvenient. But since i_apply
doesn’t process any data or combine the results, it is more flexible.
You can use i_apply
with the foreach
package
to perform a parallel apply
operation, and combine the
results any way you want via the .combine
argument to
foreach
.
Another iterator that comes in the iterors
package is
the isplit
function, which works much like the standard
split
function. split
returns a list
containing all of the data divided into groups. isplit
only
generates one group at a time, as they are needed, which can reduce the
amount memory that is needed.
But of course, there will be times when you need an iterator that
isn’t provided by the iterors
package. That is when you
need to write your own custom iterator. Fortunately, that is fairly easy
to do.
What methods are needed for an iteror?
Basically, an iteror
is an S3 object whose base class is
"iteror"
, which has defined methods called
nextOr
and iteror
.
The function iteror
is a generic method. The purpose of
the iteror
method is to construct an iterator for the
specified object. For example, this makes an iteror it
out
of a list:
When iteror
is called on a list (or any object without a
class) it finds the method iteror.default
, which constructs
an iterator which returns one element at a time from the given
vector.
If x
is already of class "iteror"
then
iteror(x)
simply returns x
, which seems odd at
first. But the iteror
method can be defined for other
objects that don’t define a nextOr
method. We call those
objects iterables, meaning that you can iterate over them. The
iterators
package defines iteror
methods for
vectors, lists, matrices, and data frames, making those objects
iterables. By defining an iteror
method for data types, you
can pass those types directly to any function that expects an
iterable.
If you want to create an iterator for some existing class, you can do
that by writing an iteror
method that constructs an
appropriate iterator. The alternative is to write your own function that
takes arbitrary arguments, and returns an iterator. You can choose
whichever method is most natural.
The most important method required for iterators is
nextOr(obj, or)
. Given an iteror obj
, this
simply returns the next value, or forces and returns its second argument
or.
Returning the value of or=
indicates that
there are no more values available in the iterator. There is (by design)
no default value for or
; you need to specify one.
Above we created an iterator from a list; we can now call
nextOr
on that iterator to get the values from the
list:
nextOr(it, NULL)
## [1] 1 2
nextOr(it, NULL)
## [1] 3 4
nextOr(it, NULL)
## NULL
This shows how nextOr
behaves at the stop of iteration,
returning whatever value was given to argument or
. Here we
specified NULL, but we will later see other ways to use the second
argument.
Handling the end of iteration
In many situations, you will be dealing with effectively finite data,
but data for which you don’t necessarily know the length. So if you
extract values one at a time from an iterator, you need to handle the
case where there isn’t a next value. This is where the
iterors
package improves on iterators
by
giving you a few options for how to respond to the end of iteration.
Breaking out of a loop at end of iteration
If you are consuming an iterator in a loop, the easiest (and often
fastest) way to respond to the end of iteration so to break out of the
loop. You just put a break directly in the second argument of
nextOr
.
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Because the or
argument is lazily evaluated, the
break
does not execute until nextOr
evaluates
its second argument. You can put any code into the or
argument, which will come in useful below.
Detecting end of iteration with a sentinel value
Sometimes a break or return isn’t the right action to take. As we saw
above, nextOr
will return its second argument (whatever
that is) to signal end of iteration, when the iterator contains no more
data. So you can check whether the returned value is the same as the one
you provided, and interpret that as the end of iteration.
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Here, if nextOr
returns NULL
, the
while
loop will exit. A special value used like
NULL
is used above, is often called a sentinel
value.
Unsafe sentinel values
It’s popular to use NULL
as a sentinel, but it’s worth
considering what would happen if there were an iterator that
legitimately returns NULL
as a value. For instance, this
list contains some tricky values:
bad_values <- list(
quote(.StopIteration),
NA,
NULL,
list(),
simpleError("StopIteration"),
try(stop("StopIteration", call.=FALSE), silent=TRUE),
"",
numeric(0))
Suppose you iterate over this list using NULL as your end-of-iteration sentinel, like this:
## .StopIteration
## [1] NA
The enumeration stops early because it mistakes NULL
, a
legitimate value in the list that we are iterating over, for
NULL
the sentinel value. This is a general problem with the
sentinel value scheme in dynamically-typed languages; any value you
could use to signal a stop, could also be a legitimate value for an
iterator to emit.
Safe sentinel values
A way to escape this problem is to construct a one-shot,
unique sentinel value to use locally, just with this iteror,
and then throw away, not keeping it in your code. One easy way to do
this by using new.env()
. Consuming an iterator using a
local sentinel value looks like this:
end_sentinel <- new.env()
it <- iteror(bad_values)
repeat {
val <- nextOr(it, end_sentinel)
if (identical(val, end_sentinel)) break
print(val)
}
## .StopIteration
## [1] NA
## NULL
## list()
## <simpleError: StopIteration>
## [1] "Error : StopIteration\n"
## attr(,"class")
## [1] "try-error"
## attr(,"condition")
## <simpleError: StopIteration>
## [1] ""
## numeric(0)
Because environments compare by reference, a newly constructed
environment is guaranteed not to be identical()
to any
other object in the R session, and the check for identity is very fast
(just a pointer comparison.) By using a locally generated, unique
sentinel value you avoid mistaking a legitimate value in your data for
the end of iteration.
A simple iterator
It’s time to show the implementation of a very simple iterator.
Although I’ve made it sound like you have to write your own
nextOr
methods, you can use a standard one. In fact, that’s
what all of the following examples do. The method
iteror.function
takes a function you specify and wraps it
in an object with class iteror
, which has a
nextOr
method defined.
Now here’s a function that creates a very simple iterator, one that
returns the same value x
indefinitely.:
iforever <- function(x) {
nextOr_ <- function(or) x
iteror(nextOr_)
}
Note that I called the internal function nextOr_
with an
underscore, rather than nextOr
. I do that by convention to
avoid masking the standard nextOr
generic function. That
would cause problems when you want your iterator to call the
nextOr
method of another iterator, which can be quite
useful, as we’ll see in a later example.
We create an instance of this iterator by calling the
iforever
function, and then use it by calling the
nextOr
method on the resulting object:
it <- iforever(42)
nextOr(it, NULL)
## [1] 42
nextOr(it, NULL)
## [1] 42
You can also get values from an iterator using
as.numeric
. But since this is an infinite iterator, you
need to also give an argument n
argument to avoid getting
stuck in an infinite loop until you run out of memory:
as.numeric(it, n=6)
## [1] 42 42 42 42 42 42
Notice that it doesn’t make sense to implement this iterator by
defining a new iteror
method, since there is no natural
iterable to dispatch on; the argument 42 is just a vector of length 1
and iterator.default
already handles that case. So instead,
we implement this iterator by defining a normal function that returns
the iterator.
This iterator is quite simple to implement, and possibly even useful.
2 The
iterator returned by iforever
is actually just the same
function you provided, with the class iteror
added. This
means you get nextOr
method already defined, whlch just
delegates to your given function. Additionally since the returned object
has class iteror
, it inherits an iteror
method
that will return itself.
Of course, the reason this iterator is so simple is because it doesn’t contain any state. Most iterators need to contain some state, or it will be difficult to make it return different values and eventually stop. Managing the state is usually the real trick to writing iterators.
A stateful iterator that stops
Let’s modify the previous iterator iforever
to stop
after it returns a certain number of values. I’ll call the new function
irep
, and give it another argument called
times
:
irep <- function(x, times) {
nextOr_ <- function(or) {
if (times > 0) {
times <<- times - 1
x
} else {
or
}
}
iteror(nextOr_)
}
Now let’s try it out:
## [1] 7 7 7 7 7 7
The differences between iforever
and irep
are in the inner function nextOr_
. This inner function not
only accesses the values of the variables x
and
times
, but it also modifies the value of
times
. This is accomplished by means of the
“<<-=
”3 operator, and the rules of lexical
scoping.
After enough calls, the inner function will find
times <= 0
, and will return its argument
or
. As the author of an iterator this is how you signal end
of iteration. When implementing an iterator you never need to do
anything with or
other than return it when appropriate;
otherwise leave it alone, allowing R to be lazy in treating that
argument.
Technically, this kind of function that refers to enclosing variables
is called a closure, and is a fundamental feature of
R
. The important thing to remember is that
nextOr_
is able to get the value of variables that were
passed as arguments to irep
, and it can modify those values
using the <<-
operator. These are not global
variables: they are defined in the enclosing environment of the
nextOr_
function. You can create as many iterators as you
want using the irep
function, and they will all work as
expected without conflicts.
Note that this iterator only uses the arguments to irep
to store its state. If any other state variables are needed, they can be
defined anywhere inside the irep
function.
Using an iterator inside an iterator
The previous section described a general way of writing custom iterators. Almost any iterator can be written using those basic techniques. At times, it may be simpler to make use of an existing iterator to implement a new iterator. Let’s say that you need an iterator that splits a vector into subvectors. That can allow you to process the vector in parallel, but still use vector operations, which is essential to getting good sequential performance in R. The following function returns just such an iterator:
ivector <- function(x, ...) {
i <- 1
it <- idiv(length(x), ...)
nextOr_ <- function(or) {
n <- nextOr(it, return(or))
ix <- seq(i, length=n)
i <<- i + n
x[ix]
}
iteror(nextOr_)
}
ivector
uses ...
to pass options on to
idiv
. idiv
supports the chunks
argument to split its argument into a specified number of pieces, and
the chunkSize
argument to split it into pieces of a
specified maximum size.
Let’s create an ivector
iterator to split a vector into
three pieces using the chunks
argument:
it <- ivector(1:25, chunks=3)
as.list(it)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17
##
## [[3]]
## [1] 18 19 20 21 22 23 24 25
Note how our nextOr_
function handles the end of
iteration. If the underlying idiv
iterator reaches the end,
it will force its or
argument, which we have specified as
return(or)
, Although it is forced by the underlying
iterator, R correctly evaluates a return
call in a lazy
argument according to the scope that call was written in. So when
it
of the internal idiv
reaches the end, and
forces its or
argument, our nextOr_
passes
that signal along by returning its or
argument.
The nextOr
method uses a lazy evaluated argument
expressly for this purpose, so that you can respond to end of iteration
by using a control flow operator like return
,
break
, next
or stop
.
It should be clear that only minor modification need to be made to this function to create an iterator over the blocks of rows or columns of a matrix or data frame. But I’ll leave that as an exercise for the reader.
A recycling iterator
You can start to compose simple iterators together to do more complex things by writing functions that take in one iterator and construct a new one. In this example, we’ll return an iterator that recycles the values of the wrapped iterator: 4
i_recycle <- function(it) {
values <- as.list(iteror(it))
i <- length(values)
nextOr_ <- function(or) {
i <<- i + 1
if (i > length(values)) i <<- 1
values[[i]]
}
iteror(nextOr_)
}
This is fairly nice, but note that this is another one of those
infinite iterators that we need to be careful about. Also, make sure
that you don’t pass an infinite iterator to i_recycle
. That
would be pointless of course, since there’s no reason to recycle an
iterator that never ends. It would be possible to write this to avoid
that problem by not grabbing all of the values right up front, but you
would still end up saving values that will never be recycled, so I’ve
opted to keep this simple.
Let’s try it out:
## [1] 1 2 3 1 2 3 1 2 3
Limiting infinite iterators
I was tempted to add an argument to the i_recycle
function to limit the number of values that it returns, because
sometimes you want to recycle for awhile, but not forever. I didn’t do
that, because rather than make i_recycle
more complicated,
I decided to write another function that takes an iterator and returns a
modified iterator. Functions like this are composable; the
limiting function can be applied to any underlying iterator instead of
just i_recycle
.
i_limit <- function(it, times) {
it <- iteror(it)
nextOr_ <- function(or) {
if (times > 0) {
times <<- times - 1
nextOr(it, or)
} else
return(or)
}
iteror(nextOr_)
}
By convention, the iterors
package uses the prefix
i_
for this kind of composing iterator function, and a bare
i
for functions that build an iterator based on basic
data.
Note that this looks an awful lot like the irep
function
that we implemented previously. In fact, using i_limit
, we
can implement irep
using iforever
much more
simply, and without duplication of code:
irep2 <- function(x, times)
i_limit(iforever(x), times)
To demonstrate irep2
:
## [1] "foo"
## [1] "foo"
## [1] "foo"
Here’s one last example. Let’s recycle a vector three times using
i_limit
, and convert it back into a vector using
as.numeric
:
iterable <- 1:3
n <- 3
it <- i_limit(i_recycle(iterable), n * length(iterable))
as.numeric(it)
## [1] 1 2 3 1 2 3 1 2 3
Sort of a complicated version of:
rep(iterable, n)
## [1] 1 2 3 1 2 3 1 2 3
Aren’t iterators fun?
Conclusion
Writing your own iterators can be quite simple, and yet is very
useful and powerful. It provides a very effective way to extend the
capabilities of other packages that use iterators, such as the
foreach
package. By writing iterators that wrap other
iterators, it is possible to put together a powerful and flexible set of
tools that work well together, and that can solve many of the complex
problems that come up in parallel computing.