Yield! Reconsidering APIs with Collections

Yield! Reconsidering APIs with Collections (Image by http://www.sxc.hu/)

Yield: A Little Background

The yield keyword in C# is pretty cool. Being used within an iterator, yield lets a function return an item as well as control of execution to the caller and upon next iteration resume where it left off. Neat, right? MSDN documentation lists these limitations surrounding the use of the yield keyword:

  • Unsafe blocks are not allowed.
  • Parameters to the method, operator, or accessor cannot be ref or out.
  • A yield return statement cannot be located anywhere inside a try-catch block. It can be located in a try block if the try block is followed by a finally block.
  • A yield break statement may be located in a try block or a catch block but not a finally block.
So what does this have to do with API specifications?

A whole lot really, especially if you're dealing with collections. I personally haven't been a big user of the yield keyword, but I've never really been forced to use it. After playing around with it for a bit, I saw a lot of potential. I've written before about what I think makes a good API. In my article, I was making a point to discuss two perspectives:

  • Who needs to implement your interface. You want it to be easy for them to implement.
  • Who needs to call your interface. You want it to be easy for them to use.
In my opinion, the IEnumerable<T> interface was a tricky thing to work with as a return value. You can essentially only iterate an IEnumerable, and at the time of calling a function, maybe that's not what you want to do. The flip side is that for the person implementing the interface, IEnumerable<T> is a really easy interface to satisfy. However, the yield keyword has opened up some new doors.

In this article, I'd like to go over a couple of different approaches for an API and then explain why the yield keyword might be something you consider next time around. Disclaimer: I'm not claiming anything I'm about to present is the only way or the best way--I'm just sharing some of my own findings and perspective.

Interface For Returning Collections

The first type of API I'd like to look at is for returning collections. Based on my own API guidelines, I'd ideally choose an interface or class to return that provides a lot of information to the caller that is also easy to create for the implementer of my interface. The List<T> class is a great choice:

  • It's easy to construct
  • It's built-in to the .NET framework
  • It provides many handy functions (All of the IList<T> functionality as well as things like AddRange(), or functions that support delegates)
My next choice might be to have a return type of IList<T>, which would provide a little less ease of use to the caller, but make it even easier for the implementer of the interface. They could return arrays of type T, since an array implements the IList<T> interface, or their own custom list implementation that doesn't inherit from the List<T> class. The differences between using IList<T> and List<T> are arguable pretty small.

A third alternative, which I would have avoided in the past, is to return an IEnumerable<T>. My opinion used to be that this made the life of the interface implementer a bit easier compared to returning an IList<T>, but complicated the life of the caller for a couple of reasons:

  • The caller would have to use the results of the function in a foreach loop.
  • The caller would have to add the items to their own collection to be able to do much more with the items.
My naive implementations of being forced to return an IEnumerable<T> were... well... crap. I would have constructed a collection within the function, fill it up, and then return it as an IEnumerable<T>. Then as the caller of my function, I'd have to re-enumerate the results (or add it to another collection):
public static IEnumerable<T> GetItems()
{
    var collection = new List<T>();
    // add all the items to a collection
    return collection;
}

private static void Main() { var myCollection = new List<T>(); myCollection.AddRange(GetItems()); // use myCollection...

// or.....
foreach (var item in GetItems())
{
    // use the items
}

}

Seems like overkill to me with that implementation. However, we'll examine how using yield can truly transform this into something... better. So to reiterate, a few potential implementations for an API involving collections might be:

  • Return a List<T> class
  • Return an IList<T> (or even an ICollection<T>) interface
  • Return an IEnumerable<T> interface
## Constantly Creating Collections My design decisions, in the past, were really driven by two guidelines:
  • Make it easier for the person implementing/extending the API
  • Make it easy for the person consuming the API
As I quickly illustrated in the first section, this meant that I would have a method where I would create a collection, fill it with items, and then return it. I could generally pick any concrete collection class and return it since I would usually pick a simple collection as the return type. Easy.

One thing that might be noticeable with this approach is that it looks pretty inefficient to keep creating new collections, fill them, and then return them. I'll illustrate with a simple example. We'll create a class that has a method on it called GetItems(). As per my reasoning presented earlier, we'll have this method return a List<T> instance, and to make this example easier to work with, we'll pass in an IEnumerable<T> instance. For what it's worth, the input to this function is really just for demonstration purposes here--We're really focusing on how we're creating our return value.

public class CreateNewListApi<T>
{
    public List<T> GetItems(IEnumerable<T> input)
    {
        var newCollection = new List<T>();

        foreach (var item in input)
        {
            newCollection.Add(item);
        }

        return newCollection;
    }
}

And now that we have our simple class we can mock up a little test for performance... Just how inefficient is creating new lists every time?

internal class Program
{
    private static void Main(string[] args)
    {
        const int NUM_ITEMS = 100000000;
        var inputItems = new int[NUM_ITEMS];

        Console.WriteLine("API Creating New Collections");
        var api = new CreateNewListApi<int>();

        var watch = Stopwatch.StartNew();
        var results = api.GetItems(inputItems);

        foreach (var item in results)
        {
        }

        Console.WriteLine(watch.Elapsed);
        Console.WriteLine(Process.GetCurrentProcess().PrivateMemorySize64);
        Console.ReadLine();
    }
}

When I run this on my machine, I get an average of about 1.73 seconds. The memory printout I get when running is 1615908864 bytes. So is that slow? Is that a lot of memory usage? I think it's pretty hard to say conclusively without being able to compare it against anything. So let's keep this number in mind as we continue to investigate the alternatives.

Side Note: At this point, some readers may be saying "Well, if the input to our function was also a list (or if whatever our function has to work with was otherwise equivalent to our return value) then we wouldn't have to go populate a new collection every time... We can just return the underlying collection"! And I would say you are absolutely correct. If your function has access to an instance of the same type as the return type, then you could always just return that instance. But what implications does this have? You're now giving people access to your underlying internals, and they can go modify them as they please. So, if you need to control access to items being added or removed, then it might not make sense for you to expose your internal collections like this.

Yield to Incoming API Alternatives

We've seen how my past implementations may have looked, so how might we tweak this? If we tweak our API a bit, we can make our method return an IEnumerable<T> instead. Let's see what that might look like:

public class YieldingApi<T>
{
    public IEnumerable<T> GetItems(IEnumerable<T> input)
    {
        foreach (var item in input)
        {
            yield return item;
        }
    }
}

So in this API implementation, all we'll be doing is iterating over some type of collection and then yielding each result. If we run it through the same type of test as out previous API implementation, what kind of results do we end up with?

internal class Program
{
    private static void Main(string[] args)
    {
        const int NUM_ITEMS = 100000000;
        var inputItems = new int[NUM_ITEMS];

        Console.WriteLine("API Yielding");
        var api = new YieldingApi<int>();

        var watch = Stopwatch.StartNew();
        var results = api.GetItems(inputItems);

        foreach (var item in results)
        {
        }

        Console.WriteLine(watch.Elapsed);
        Console.WriteLine(Process.GetCurrentProcess().PrivateMemorySize64);
        Console.ReadLine();
    }
}

When I run this on my machine, I get an average of about 2.80 seconds. The memory printout I get when running is 449409024 bytes. How does this relate back to our first implementation? Well, it's certainly slower. It takes about 1.62x as long to enumerate using the yield implementation as it did with the first API we created. However, yield also uses less than 1/3 (about 27.8%, actually) of the memory footprint when compared to the first implementation. Pretty cool results!

Site Note: So yield was a bit slower according to our results, but what happens if print the elapsed time before we run that foreach loop? Well, on my machine it averages at about one millisecond. Now that's fast, right?! The cool thing about using yield with the IEnumerable<T> interface is that the work is deferred. That is, not until the program goes to actually run the enumeration do we get our performance hit. Try it out! Try moving the time printout from after the foreach loop to before the foreach loop. Try sticking breakpoints in on the line that yields. You'll see what I mean.

Summary

In this article, I've explored two different ways of implementing an API (specifically focusing on the return value). We saw a brief performance analysis between the two and I highlighted some differences in both approaches. Let's recap though:

  • Approach 1: Returning a List<T> and creating the collection ahead of time
    • Appeared to be overall a bit faster then yielding.
    • Consumed much more memory than yielding.
    • Callers can use the results immediately for enumeration, checking count, or as a collection to add more things to
    • The return type of List<T> is a bit more restrictive than an IEnumerable<T> like in the second API implementation
  • Approach 2: Return type of IEnumerable<T> and yielding results
    • Appeared to be overall a bit slower than the List<T> implementation
    • Lazy. We don't actually execute any enumeration code until the caller actually enumerates
    • Consumed significantly less memory than the first approach using List<T>
    • Callers can enumerate the results immediately, but they need to add the results to a collection class to do much more than enumerate
So next time you're designing an API for your interfaces and classes, try keeping these things in mind!

EDIT (December 30th, 2013): As per some comments on Google+ by Dan Nemec, I figured I'd add a bit more here in the summary. IEnumerable<T> on it's own is certainly not useless, especially if you're leveraging LINQ or extension methods. My main beef in the past was that the consumer of an API with a IEnumerable<T> return value can only iterate over the results... And that's just because that's all that IEnumerable<T> lets you do. Dan made a great point though--If you are leveraging things like extension methods, or LINQ (which introduces tons of handy extension methods for working with IEnumerable<T>) then you get all of that functionality tacked on to IEnumerable<T>.

So if you're not fortunate enough to be working with LINQ or extension methods (i.e. working with legacy code in old .NET framework versions... and yes I am familiar with the attribute you can add in to allow extension methods provided you have a compiler version high enough to support it), then IEnumerable<T> sometimes just plain sucks. I'd wager the majority of C# developers aren't in this boat though, so I'd like to thank Dan again for his comments.

What Does yield Do In C#: A Simplified View For Beginners

What does yield do in C#? Explore the benefits of using C# yield keyword with large datasets and best practices for implementation. One more tool to leverage!

Beware of These Iterator and Collection Traps

Iterators - An Elementary Perspective on How They Function

An error has occurred. This application may no longer respond until reloaded. Reload x