Clickhouse vs Postgres - which database to use for analyzing nginx logs
I’ve been searching for a good way to analyze our nginx web logs for a long time. If money were no object, there are a lot of options. I could have setup a AWS Redshift cluster or a Google BigQuery datastore. Or I could have just gone for a plan offered by various APM tools in the market.
But money is indeed an issue in a startup which works on a tight budget. Both Redshift and BigQuery would have ended up with unacceptable monthly budgets. Same with the off the shelf enterprise products. So I was searching for an open source solution. ELK stack indeed fits the case. Though I managed to setup a single node cluster, keeping it running was a herculean task. It would fail irrevocably at the slightest disturbance. If any cleanup script failed to work and the disk filled up, it would fail. Sometimes it would fail for no obvious reason and it would take me several hours to bring it back up. In the end I decided that the trouble of keeping the setup running was not worth it. The cloud plan offered by Elastic company was still out of our budget.
Finally I stumbled upon Clickhouse It is an open source DBMS which claims that its columnar architecture, where data belonging to a column is stored together is specifically suited for OLAP workloads. The claims were impressive and I decided to give it a try. But while trying this, I also wanted to check whether Postgres could serve my requirement as well. I reach out for SQLAlchemy always whenever I want to write queries and considering that it is very well integrated with Postgres, I did want to consider the possibility that maybe Postgres itself would work. Maybe my data load was not heavy enough for warranting a specialized solution like Clickhouse. With these questions I mind, I proceeded to setup both Postgres and Clickhouse to process my nginx logs and compared both.
Loading nginx data in the databases
I have my nginx logs stored in AWS s3 buckets. So I got a data dump by just syncing this bucket contents to a local folder. The time period of the logs spanned over 2 months approximately. The total size of the folder with gzipped logs worked out to 277 MB. I had to write some scripts to read these files and load the data into the databases - clickhouse and postgres. I have shared the code used for this as a repository named Nginx Log Analytics. I might convert this into a ready to use pypi package soon. I will also write a separate detailed post about the modeling and loading techniques used. But since this post is focused on the comparison of the databases after the loading was done, you can just assume that the loading worked.
Comparison of data warehouse sizes after loading
As mentioned earlier, the size of the gzipped nginx access logs was 277MB
The size of the clickhouse db after data loading was determined by reviewing the size of the folder with the database’s name in /var/lib/clickhouse/data. I used a db named
test. So the output of the command
du -sh /var/lib/clickhouse/data/test gave 733MB as the database size.
The size of the postgres db was determined by reviewing the sizes of the folders inside
/var/lib/postgresql/12/main/base. It had different folders with names as numbers. I was able to find the one referring to the database table used for loading the nginx log data by checking the output of
SELECT pg_relation_filepath('weblog_entries'); in the psql client. The folder size turned out to be a whopping 7.5GB
So Postgres occupied 10x more space than Clickhouse for the same data. And it was almost 30x the size of the original gzipped input. Considering that this was just for 2 months worth of logs and just for 1 server, loading a year worth of logs from 2 servers would have resulted in 12 times this size, ie 90GB. This was more than the file system size of the machine I was using. This itself should have ruled out postgres as a valid option.
Comparison of Query runtime
I started with a simple comparison of the time taken to count all records
Clickhouse: 0.005 seconds
surya-VirtualBox :) SELECT COUNT() FROM test.weblog_entries; SELECT COUNT() FROM test.weblog_entries ┌─COUNT()─┐ │ 6258734 │ └─────────┘ 1 rows in set. Elapsed: 0.005 sec.
Postgres: 62.96 seconds
test=# SELECT COUNT(*) FROM weblog_entries; count --------- 6566618 (1 row) Time: 62960.427 ms (01:02.960)
Clickhouse was 1260 times faster.
Then I tried a query with a filter
Clickhouse: 0.137 seconds
surya-VirtualBox :) SELECT count() FROM test.weblog_entries WHERE request_time > 1; SELECT count() FROM test.weblog_entries WHERE request_time > 1 ┌─count()─┐ │ 166219 │ └─────────┘ 1 rows in set. Elapsed: 0.137 sec. Processed 6.26 million rows, 56.33 MB (45.61 million rows/s., 410.48 MB/s.)
Postgres: 43.903 seconds
test=# SELECT COUNT(*) FROM weblog_entries WHERE request_time > 1; count -------- 236806 (1 row) Time: 43902.541 ms (00:43.903)
Thus we can see that Clickhouse is clearly the winner by several orders of magnitude. But it still leaves me with a longing for Postgres
Clickhouse doesnt support unique constraints. So I have to write the application logic to make sure that the same log entry is not loaded more than once. I would rather leave this to the database
SQLAlchemy’s support for Clickhouse is still very rudimentary, via some third party unofficial libs only. Nowhere near what is available for Postgres
Best of both worlds
Postgresql supports a concept called Foreign Data Wrappers. This is a mechanism that lets the user interact with foreign data sources via Postgresql interface. Percona has open sourced the FDW they created for Clickhouse here. This will make it possible for me to use a postgres interface to communicate with clickhouse. I have to yet check whether this solves the 2 problems mentioned above - allowing unique constraints via postgresql and SQLALchemy support via Postgres drivers. If that turns out to be the case, I would have got the best of both worlds. Will post about it once I get it done.comments powered by Disqus