Monday, February 10, 2014

Fosdem report with some thoughts from P2D2

(TL;DR version) Issues in cloud-based deployments totally dominated the database topics on both, Fosdem and p2d2 conferences. Whether it is high availability or scalability of the data stack, usually possible to handle with in-house features, people choose to implement those challenges more and more using external tools recently, which seems more accurate in some particular cases. Internal features simply include way too many manual steps, so they're not fine enough in environment, where full automation is necessary. Any such an extra effort distributions can eliminate counts.

A bit more verbose version follows.

This is my report from Fosdem conference from the last weekend and I also included some thoughts from Prague PostgreSQL Developers Day 2014 I attended on Thursday. As databases and general packaging are my concerns, I focused on databases lecture rooms on Fosdem, which means I alternated between rooms focused on MySQL/MariaDB, PostgreSQL, NoSQL and Distributions.

Generally speaking about databases direction, there were really a lot discussions about deploying databases in cloud-based environment.

The problems we face in that field can be summarized by several topics: replication for high availability, scaling for utilize more nodes effectively and automation to limit manpower in deployments where number of nodes has more than 3 digits. Quite a lot interesting smaller or larger projects, that would solve some of these issues, were introduced on the lectures and I'd like to mention the most interesting of them.

Some interesting tools for recent challenges

proxysql, currently a single-developer project, seemed like a generally usable light proxy server that can be configured to offer load balancing, partitioning or even some more complicated tasks like query rewrite or selecting proper server based on region priorities. The lack of community around that project is the biggest con right now.

For PostgreSQL, there is a similar project, Postgre-XC, that also allows to access multiple PostgreSQL instances through one single proxy for large cluster. PgBouncer then offers a solution by pooling the connections for deployments where number of client connections attacks thousands and thus reaches the limit.

Similar slight layer between client and server is community-developed Galera clustering project, which originated from Oracle's MySQL cluster solution. Even if it is considered standard for MariaDB now, the support in MariaDB itself is not merged yet and necessary patch kind of blocks wider adoption in distribution, since an extra MariaDB binary package has to be maintained.

Connect engine for MariaDB seems like really fine help in heterogeneous environment, since it allows to work with external data from MariaDB, the same as there was a native table. Similar feature is now included in the most recent stable version of PostgreSQL 9.2.

More about replication

Interesting comparison between an external HA solution Slony versus out-of-the box streaming replication, that new PostgreSQL offers and was evaluated as better, showed that PostgreSQL upstream does still very good. Also the number of new features focused on HA, scalability and NoSQL-like usage of this DBMS promise good future for PostgreSQL. For example new JSON document type has even better performance than pure-JSON MongoDB in the most recent PostgreSQL, according to one of the benchmarks out there; and as a bonus it doesn't have to give up the ACID features, like MongoDB does.

To sum it up - relational databases offer some basic features, but do not include good enough solutions for all distributed deployments using a simple way. This can change for PostgreSQL in the next releases, but since one size doesn't fit all, we would still need some little help of plug-ins, proxies, etc.

Where distributions and especially support companies can help is also offering some supported scenarios, because as a talker from Avast described, it was real PITA to choose the best-fit database system to store their big data, especially if the management can't give specific-enough requirements.

What is up in NoSQL world?

The situation is a bit different for NoSQL databases, that include some good enough automatic solutions out of the box or even by definition. The draw back side of NoSQL databases are rather in missing ACID transactions and that they require totally different approach to database design than relational databases.

However, really full NoSQL lecture rooms should be clear sign that NoSQL world will be very important in the next years' cloud-oriented market.

During the devops-related talks I've seen many desires to configure system using some automatic tool like puppet or chef, which is what companies use in the end and there is definitely some space to support these deployment tools better. So generally, no more admins' work can be manual any more, not in cloud environment. Automation then can't be done without versioning, logging, module testing, integration testing or continuous integration.

What is not happening any more

Interesting thing for me was kind of lack of single-server performance talks. It seems the database servers reached some point where improving performance is not so important for a single server, especially if we have so many challenges to improve things in scaling parallel deployments or increasing limits of data amount to be stored efficiently.

To sum it up, databases world might be quite boring the last years, but that is definitely changing now. Clouds are simply database-enabled.