Multi-stage Docker Builds
November 13, 2018Using insertDesignDocument to Add Views
January 13, 2019Dealing with large data sets and big data can often call for new techniques to optimize performance. When you have a few million documents or rows, you can do things differently than when you have to deal with tens of millions of rows or documents in either a standard SQL database or a NoSQL database like Couchbase. One thing we noticed at CKH is that in very large document stores, the OFFSET and LIMIT style pagination on both N1QL queries and Views in Couchbase was getting slow as the OFFSET grew larger. This becomes a problem when you want to build efficient, consistent paging into your system. Not to worry, Couchbase provides a solution.
The concept of paging through a result set is not a difficult one. Using OFFSET and LIMIT is the easiest way to write database queries for pagination. Together, OFFSET and LIMIT, make the pagination clause of the SELECT statement. Let’s take a look at how that would look in N1QL.
SELECT *
FROM `travel-sample`
WHERE type = ‘hotel’
ORDER BY country, city
OFFSET 0
LIMIT 10;
This query would return the first 10 hotels in the “travel-sample” bucket ordered by country and city. To get the next 10, we simply increment the previous OFFSET by the size of our page (10).
SELECT *
FROM `travel-sample`
WHERE type = ‘hotel’
ORDER BY country, city
OFFSET 10
LIMIT 10;
The issue comes in at scale when that OFFSET value becomes very large.
SELECT *
FROM `travel-sample`
WHERE type = ‘hotel’
ORDER BY country, city
OFFSET 1000000
LIMIT 10;
This is because as the OFFSET increases, the indexer must get all the previous documents as well in order to know where to start and end its result set. So, how do we solve this issue? The answer is to tell the indexer where to start for each page by a unique key and instead of using an OFFSET at all, just pass in the last key of the previous page. Let’s take a look at how that can be done.
First, we will need an index that includes our unique key. In Couchbase, the document ID is unique so we can use it as our key.
CREATE INDEX ixtypectcy ON `travel-sample` (type, country, city, META().id);
Now that we have an index to cover our query, we can make it again to get our pages, but using the unique ID as the starting place of each page. For page one we can start our ID as 0.
SELECT country, city, META().id
FROM `travel-sample` use index (ixtypectcy)
WHERE type = "hotel"
AND country >= 0
AND city >= 0
AND META().id > 0
ORDER BY country, city, META().id
LIMIT 10;
And the next page we can now just start at the end of the previous page ID
SELECT country, city, META().id
FROM `travel-sample` use index (ixtypectcy)
WHERE type = "hotel"
AND country >= “France”
AND city >= “Avignon”
AND META().id > "038c8a13-e1e7-4848-80ec-8819ff923602"
ORDER BY country, city, META().id
LIMIT 10;
Both of these queries will return with the same consistent speed. The same will hold true no matter the number of documents we have to page through.
This same technique can also be used for View queries. View queries require a unique key as well as their value in the key/value pair result set, so you can use the key to paginate through the results setting the startKey on the View Query to the previous page.
For more information about the why and the how of this technique, have a look at this excellent article written by Keshav Murthy.
http://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/
We have found this technique invaluable in large data sets when we need to perform operations across the whole dataset, or need to page through higher pages in a result set. Hopefully, you can use this to improve your performance as well. If you are looking for someone to performance tune your applications or want help optimizing Couchbase, let us know, we would be glad to help.
2 Comments
Thanks for posting this interesting technique. It is exactly what I was looking for.
However, after trying it on a large dataset, I don’t see a consistent speed using the doc id approach.
In my particular example, I have a dataset of around 2m documents. When using OFFSET of 1.9m and LIMIT 100, the query takes around 3.5s, the doc id approach takes around 2s. Though it is better, it doesn’t perform as I would expect (ex: around 20ms). It seems that query duration increases as the doc id corresponds to a document with a higher offset.
Is this an expected behaviour or am I doing something wrong?
Hmm, the docID approach performance should be linear (ie the same or very similar for any ID). Did you setup the index for your query?