Scaling past the single machine
TL/DR
We cut down disk usage to 12% of the original by using btrfs with zstd compression, and organizing the table data on disk for optimal query performance.
By using PostgreSQL 10 native partitioning,
postgres_fdw
and btrfs for storage we cut down disk usage while maintaining
performance.
Intro
Here at Modio AB we store a fair bit of data, just about hitting the big data size on some of our datasets.
We've got enough data that it's unfeasible to back it up, and it doesn't fit well in RAM either. It's enough that asking a DB server to read data from disk will cause enough IO load to incur damages on other clients, simply because of disk activity.
But all data are not equal. The data from the last few weeks is more likely to be accessed and showed up than data from a year ago, so all data doesn't have to be treated equally.
For this article, I'll use a simple data set table of the following (example) schema:
id : int |
clock : int |
ns : int |
value : int |
---|---|---|---|
12 | 1525080407 | 0123123 | 0 |
1200 | 1525080407 | 0123123 | 12 |
The table will have an index over the pair (id, clock)
and nothing else.
To avoid inter-table lookups at insert time, id
is not a primary key.
Since this is time series data, it will naturally come in sorted by the
clock
field (as events happen), so all data for one moment is stored togeter
on disk.
Most queries are in the form, "outside temperature between 12 and 14", expressed in SQL as:
select * from history where id=12 and clock between (1525080407) and (1525080407+3600)
This would be better suited if all the data for a single sensor would be located together. Thus, our natural insert order causes data fragmentation on disk because it's not laid out in the way we query such data.
This is the cost of using a generic
datastore vs. an time series
datastore
that knows more about the layout of the data. A columnar
store, or a
specialized time series
storage layout performs a lot better here.
Partitioning
Native
partitioning
is a new feature in PostgreSQL 10, and it allows you to split out your data
into smaller sub-groups. In our case, we use range
based partitioning for the
clock
column, in order to split data into tables on a monthly basis.
Column | Type | Nullable | Storage |
---|---|---|---|
id | integer | not null | plain |
clock | integer | not null | plain |
value | integer | not null | plain |
ns | integer | not null | plain |
Partition key: RANGE (clock)
Partitions: history_y2018m01 FOR VALUES FROM (1514764800) TO (1517443200),
history_y2018m02 FOR VALUES FROM (1517443200) TO (1519862400),
history_y2018m03 FOR VALUES FROM (1519862400) TO (1522540800)
Partition maintenance
Once we've partitioned our data, we have smaller, more manageable datasets of ~10-15GB, each containing half a billion of datapoints, it should be more manageable.
This involves using the CLUSTER command. It will re-organize a table on disk, rewriting all data to appear in the order prescribed by an index on the table. There is also an extension called pg_repack that can do similar, but online ( pg_repack requires a primary key or a unique key, which we don't have).
CLUSTER takes an exclusive, database-level, read-and-write lock on the table, and thus any read or write to the primary partition will block while it happens. To prevent this, we start by detaching the partition, and attaching a new, empty table in it's place. This causes a gap in the data for the time period that the maintenance runs.
After this, we make sure that there's a btree index on the table for
(itemid, clock)
, since CLUSTER cannot use the more space-efficient BRIN
index.
Once that's done, we CLUSTER
the table, add CHECK
constraints, and return
it to it's position in the primary table.
For the hairy details, you can see the source code.
The disk space issue
Now we have neatly ordered data on disk, but it's still using a lot of space. How do we deal with this? I set about to test out cstore_fdw, which uses the ORC format to store and compress data.
The results of such experiments were mixed, and tests with our usual queries
were much slower than usual. Our SELECT
benchmark resulted in, at best, 11
Transactions per second, compared to approximately 150-200 otherwise.
cstore_fdw gave good savings on disk space (8.5 GB on disk for 75GB stored data).
As an alternative to that, I ran a test with btrfs+zstd for archival tables, and while the compression was slightly worse ( approximately 3%), query performance was an order of magnitude higher.
The below table uses on disk size, not apparent size, and differ between various filesystems, so they are more for ballpark idea, than exact measures. The dataset for this test is ~75GB of time series, and the queries represent our typical workload. Machine and postgres configuration is the same for all tests. Data is much larger than available RAM.
solution | disk used | insert tps | select tps |
---|---|---|---|
no compression | 75717528 | 3879 | 187 |
cstore_fdw | 9604680 | 4565 | 11 |
btrfs (zstd) | 9773740 | 4663 | 166 |
In other tests, compression gave a slight advantage to query perfomance, simply because of reduced IO latencies.
Putting it together
So, a simple solution wrapping this together would be to use tablespaces where archival data is on a different filesystem from normal data. For various reasons, we decided against this to begin with, a decision we might return to.
Instead, we're using postgres_fdw to use remote tables for archival data on a second database server.
Moving data from one database to another over postgres_fdw turned out to be
slow, so I had to make a solution to efficiently copy
data
between the databases, using the COPY
command and pipes.
The steps
So, the tool will:
- Create tables on the archive database
- Create foreign tables on the primary database
- Detach source table from the parent table
- Attach foreign table to the parent table
- Use
COPY
to move and sort the data from primary to archive - Cluster the table after data injection
- Clean up after itself
Overall, getting all the small details correct has been difficult. However, the end result is a reduction of disk usage to 12% of the original, while keeping query performance at around 90% of the original.
Compression Gains
A single month's data (September 2017) according to compsize
:
Type | Perc | Disk Usage | Uncompressed | Referenced |
---|---|---|---|---|
TOTAL | 11% | 1.4G | 12G | 12G |
BRIN index vs. Btree index
For the above table a brin index is approximately 4.1MB uncompressed, while a btree index for the same columns and dataset is 7.4GB.
What about dropping nanosecond resolution of data?
We can't just remove a column from the table, even if that would be desirable,
since that would break the foreign tables. However, zeroing it in the table and
letting disk compression sort it out can help there. However, this only saves
~1.1% in size.
All together now
- Table partitioning in ranges
- Use
brin
indexes on older data (disk savings) - Use
btree
indexes on fresh data (performance) CLUSTER
tables on disk to further optimizebrin
index (perfromance)- Place older partitions on
btrfs
withzstd
(disk savings) - Use
postgres_fdw
to offload older data to another DB (disk savings)