In last week's issue I provided an intro on how iterators were a big step forward for me, and how they became a challenge later on. Between then and this article, I published a couple of videos on YouTube. One was on two common challenges I see people run into with iterators. The other discussed some of the design decisions I started considering when moving towards a paging API.
In this article we'll focus on some of those design decisions! What makes it easy to call, what makes it easy to implement, and where are we getting performance? Let's get into it!
NOTE: exclusive newsletter articles should be ad-free aside from affiliate ads. You should not be seeing any Google ads. If you are, please send a screenshot to [email protected] and I'll fix it. Thank you for your support!
What's In This Article - From Iterators to Paging APIs
Reminder that you can find me on these platforms:
// FIXME: social media icons coming back soon!
Not The Article I Wanted...
Truthfully -- I wanted this article to start diving into the performance characteristics. There's some funky stuff with iterators vs materialized collections that is important to understand, and some really cool optimizations with arrays and lists we get in newer dotnet versions. However, my benchmarks seem broken.
And they aren't just broken in that I am proving myself wrong (paging being way slower than an iterator AND slower than materializing a whole set of records)... They're literally broken. The benchmark was trying to tell me that materializing a list of 10 million records when I used an iterator was taking 6,000 bytes. Sorry, but that ain't right. I'll revisit this for next week after I sleep on it... I bet you as soon as I press send on the newsletter I'll figure it out!
But... we can focus on some other important parts that are covered from earlier video topics!
Motivation to Move from Iterators to Paging APIs
While there are some performance gains to be realized without having to use an iterator and yield-returning items, that wasn't my primary reason. For the most part, I feel like unless I am on a performance critical hot path, if I have two ways to do something that are comparable, I'd like to opt for readable. In the case of IEnumerables and using iterators, it was often less code and seemed more readable than having to copy stuff into an array or a list every time with the same sort of boilerplate:
using var cmd = _connection!.CreateCommand();
cmd.CommandText = "SELECT * FROM Users LIMIT @pageSize OFFSET @offset";
cmd.Parameters.Add(new SqliteParameter("@pageSize", DbType.Int32) { Value = pageSize });
cmd.Parameters.Add(new SqliteParameter("@offset", DbType.Int32) { Value = offset });
using var reader = cmd.ExecuteReader();
// boilerplate being the new collection, looping, adding to the list
// sometimes using logic to set the initial capacity
var users = new List<User>(pageSize);
while (reader.Read())
{
var id = reader.GetInt32(0);
var username = reader.GetString(1);
users.Add(new User(id, username));
}
return users;
So that would be a win for iterators over the paging APIs I moved towards. But I have lived through *so* many painful times of working with developers troubleshooting iterator bugs. Some of which include:
- Not materializing an iterator and re-evaluating the iterator multiple times (keep re-querying that data!)
- Materializing outrageously large sets of data (just download more RAM, right?)
- Holding resources open too long OR... closing off resources too early!
- Not grasping/noticing the lazy-behavior of iterators and realizing/materializing them in inappropriate contexts (i.e. the UI thread)
You can check out this video for more reasons that pushed me over the edge:
What Do We Need For Paging?
Earlier this week I posted a video about API design with respect to passing in enumerables/collections or returning enumerables/collections. When it comes to paging, we're not passing in any enumerables or collections, but we we do need to consider what we're returning. Before we discuss that, let's talk about out the input parameters.
Parameters We Pass In
Offset. Count. That's it.
Okay -- that's MOSTLY what we need, but it's not over yet. I want you to think for a second where we're getting data from when we talk about these paging APIs.
Parameters We Pass In - What's Our Source
For me, it's generally from a database--Primarily MySQL and MongoDB are the two flavors I use a lot. It's also important to note that I (I know this will sound shocking) roll my own SQL 95% of the time and I don't use Entity Framework.
The reason I mention this though is because when we're thinking about paging from these data sources, we want to push the paging work TO the data source. Let the source of data do the paging for us! If we can tell the data to only give us back the small page of data instead of pulling back a HUGE data set, even if we're streaming it, we're better off. We don't want to have to filter that stuff in RAM if we can make the data source do it.
So think about what your data source needs in order to do paging. I know offset+count work for what I do most of the time when I'm dealing with these databases. But what if you're reading from a file? What if you already have stuff in memory? What if you need to hit a web API?
Parameters We Pass In - What Are The Bounds?
Some of this might seem trivial, but I think the thought exercise is good. Consider our offset parameter:
- Is anything less than 0 valid?
- Is 0 valid?
- Is something over 0 valid?
- Is something greater than the size of the dataset valid?
Some of these might feel more objective than others but... It depends. For me, negative numbers are a no-no here. Zero and greater than zero are totally valid--And that includes if we go beyond the length of the dataset. Why? Because I don't want to have to check. If you ask for beyond the size, you just get nothing back for that page of data.
As for the size, same exercise:
- Is anything less than 0 valid?
- Is 0 valid?
- Is something over 0 valid?
- Is something greater than the size of the dataset valid?
And of course my answers will be similar... Except a zero size I usually short circuit and return an empty collection instantly. And I'm still on the fence about handling negative numbers. I have found that I tend to use my paging APIs with an offset but don't always know the size I want. I know that the dataset will always be "within reason" for size, so it's sometimes like "just give me everything ya got!". A negative number can be a flag for that, but it's a bit weird. I personally need to converge better on this, but having a dedicated API to "get all" is probably better here.
Parameters We Pass In - Anything Else?
For me, I like to implement filtering libraries where I can make data transfer objects that represent filters that I can AND and OR together. These translate into the query language under the hood. So if I have such functionality, I might pass in a filter to operate on as well.
And what if you have async methods? Don't forget your cancellation token! Remember, this is often IO we're talking about here! It's slooowww in comparison to much of the other code we run. Make sure you can break out early with a token!
Return Types
Spoiler alert: I probably don't want to use IEnumerables here, personally. This is because when I implement paging, I have to materialize a collection under the hood anyway. My API design decisions usually lean towards if it's simple to implement the API *and* I can provide more data (without overstepping) to the caller, then it's nice to provide that. Therefore - collection and list variations are my choice here.
Okay so... What should we be using here?
- Array []
- List<T>
- ICollection<T>
- IList<T>
- IReadOnlyCollection<T>
- IReadOnlyList<T>
- ... something else?
Personally, I like having my return types being readonly. I prefer the syntax when the result of a method is explicitly marked as not being modifiable and this can help prevent a person implementing the API from exposing a collection that a caller can directly mutate. So with that said, array and list are generally off the table for me. HOWEVER they do have some incredible performance optimizations in newer dotnet versions. In fact, I probably have some code to revisit on some hot paths because of this performance boost but... More on that next time.
Of these variations, IReadOnlyList<T> is usually my go to. However, if I am unable to provide this interface (say I am using a collection without an indexer, like a Queue or a Stack) then I might fall back to a IReadOnlyCollection.
Follow Up Video
You can watch this video to see this explained verbally for a different way to hear what I have to say:
What Happens Next?
Refactoring Code For Paging
One obvious move is to go find alllll of the spots I was doing in memory paging. This meant Skip/Take LINQ calls could be updated -- and keep in mind, I am not using EF Core, so this changes up the dynamic a bit:
- When using Skip/Take, usually we're doing this AFTER we have a set of results returned back and we want to perform the paging on the result set. Of course, this is inefficient because we did all the heavy lifting to pull the data back and now we're going to do some filtering in memory.
- When we do paging, we need to instead pass the offset (the skip) and the page size (the take) into the method calls that were fetching data. Depending on your system, this may be through several layers. So we push this concept down into the lower "layers" that fetch data. And if you are using EF Core, you still get to use Skip/Take down there!
The other thing to consider is that not every API needs to be made to page:
- I have some stuff that only exists in memory in small amounts. I might stick to a always giving back the entire list/array.
- There might be situations where streaming data still totally makes sense -- don't completely rule it out!
- There are some sets of data I pull from a DB and immediately cache into memory to hold there for the lifetime of the app/service. No paging required
Measuring Performance
And of course, this brings us back full circle. Let me get my benchmarks together, because they're currently pretty damning. They seem to suggest that reading by pages is 100x slower than an iterator... but also that reading a page of 10 million records fully into memory while using an iterator doesn't allocate bytes. So I'm clearly underslept.
If you want a bit of a sneak peek into this though, much earlier this year I filmed a video when I was doing some of this transition. You can check out some iterator vs collection performance here. Just keep in mind... I didn't include an IReadOnlyList!
Wrapping up Paging APIs
Well, despite not being the original article I wanted to write for issue 24, I hope that you found this insightful! My goal is not to tell you to never use iterators, or iterators/enumerables are evil... My goal is to educate you about some of the challenges I've lived through with them on teams, so you can try to come up with solutions to avoid such problems!
If you're thinking about going through your codebase and refactoring things to take advantage of paging, you might find my course on refactoring helpful:
Are you interested in boosting your refactoring skills? Check out my course on Dometrain:
Refactoring for C# Devs: From Zero to Hero
Over the Christmas holidays, Nick Chapsas is running a 30% discount so you can pick this up for a STEAL! Use code: HOLIDAYS30 if it's still active on the site!