How to quickly get a random set of rows from a PostgreSQL table?
Suppose you have a database table with a large number of records, and you need to extract a random subset of these records for various reasons such as testing, data analysis, or just exploring the data.
The manager came to me and asked me to extract a random set of records from a 20GB table. At first, I thought about using the standard solution method with the ORDER BY RANDOM() and LIMIT clauses. However, as it turns out, this method has some significant drawbacks when working with large tables.
Let's take a closer look at how ORDER BY RANDOM() works. According to the PostgreSQL documentation, RANDOM() generates a random value for each row, sorts the rows based on these random values, and returns the first N rows defined by the LIMIT clause.
But when we tried this method, the results were not as expected. EXPLAIN ANALYZE showed that the query took an inordinate amount of time and consumed a large amount of memory. Here are the results of our experiments:
EXPLAIN ANALYZE SELECT first_name FROM users ORDER BY RANDOM() LIMIT 100000; Limit (cost=142977.84..143227.84 rows=100000 width=18) (actual time=682.697..712.831 rows=100000 loops=1) -> Sort (cost=142977.84..145477.84 rows=1000000 width=18) (actual time=682.695..701.627 rows=100000 loops=1) Sort Key: (random()) Sort Method: external merge Disk: 33280kB -> Seq Scan on users (cost=0.00..22810.00 rows=1000000 width=18) (actual time=0.040..156.449 rows=1000000 loops=1) Planning Time: 0.236 ms Execution Time: 729.661 ms (7 rows)
After some research, I discovered the TABLESAMPLE BERNOULLI feature of PostgreSQL, which provides a much more efficient solution for extracting a random sample from a large table. The basic idea behind this method is to use a random number generator to determine whether each row of the table should be included in the sample. This method is much more efficient than sorting all the rows based on random values and only selecting the first N rows.
Here's how you can use TABLESAMPLE BERNOULLI to extract a random sample of records from a table:
SELECT * FROM test_table TABLESAMPLE BERNOULLI (10);
In this example, the TABLESAMPLE BERNOULLI (10) clause will select approximately 10% of the total rows in the testtable table. In this case, since the initial table has a million records, it will select around 100,000 rows.
Let's see how TABLESAMPLE BERNOULLI performs in comparison to ORDER BY RANDOM():
EXPLAIN ANALYZE SELECT first_name FROM users TABLESAMPLE BERNOULLI(10); Sample Scan on users (cost=0.00..11310.00 rows=100000 width=10) (actual time=0.040..77.849 rows=100429 loops=1) Sampling: bernoulli ('10'::real) Planning Time: 0.201 ms Execution Time: 86.680 ms (4 rows)
The results showed that TABLESAMPLE BERNOULLI performs better than ORDER BY RANDOM with LIMIT in terms of speed. TABLESAMPLE BERNOULLI was found to be nearly 8 times faster than ORDER BY RANDOM with LIMIT. This makes TABLESAMPLE BERNOULLI a better choice for selecting a random set of rows from a large PostgreSQL table.
It is important to note that the TABLESAMPLE BERNOULLI method does not guarantee that the selected rows will be truly random, as the Bernoulli distribution is a discrete probability distribution. However, it is still a much faster and more efficient method than ORDER BY RANDOM() and LIMIT for selecting a random set of rows from a large table.
For more information on the TABLESAMPLE BERNOULLI method and other sampling methods in PostgreSQL, you can refer to the official documentation: PostgreSQL.
In conclusion, if you need to quickly select a random set of rows from a large table in PostgreSQL, TABLESAMPLE BERNOULLI is the best method to use. With the right approach and a bit of creativity, you can solve even bigger problems with this method. For instance, it would be cool to get a limited dump of the entire database by foreign keys, but this is a challenge for another day. The question remains: how would you solve it?