Monday, February 10, 2014

Fosdem report with some thoughts from P2D2

(TL;DR version) Issues in cloud-based deployments totally dominated the database topics on both, Fosdem and p2d2 conferences. Whether it is high availability or scalability of the data stack, usually possible to handle with in-house features, people choose to implement those challenges more and more using external tools recently, which seems more accurate in some particular cases. Internal features simply include way too many manual steps, so they're not fine enough in environment, where full automation is necessary. Any such an extra effort distributions can eliminate counts.

A bit more verbose version follows.

This is my report from Fosdem conference from the last weekend and I also included some thoughts from Prague PostgreSQL Developers Day 2014 I attended on Thursday. As databases and general packaging are my concerns, I focused on databases lecture rooms on Fosdem, which means I alternated between rooms focused on MySQL/MariaDB, PostgreSQL, NoSQL and Distributions.

Generally speaking about databases direction, there were really a lot discussions about deploying databases in cloud-based environment.

The problems we face in that field can be summarized by several topics: replication for high availability, scaling for utilize more nodes effectively and automation to limit manpower in deployments where number of nodes has more than 3 digits. Quite a lot interesting smaller or larger projects, that would solve some of these issues, were introduced on the lectures and I'd like to mention the most interesting of them.

Some interesting tools for recent challenges

proxysql, currently a single-developer project, seemed like a generally usable light proxy server that can be configured to offer load balancing, partitioning or even some more complicated tasks like query rewrite or selecting proper server based on region priorities. The lack of community around that project is the biggest con right now.

For PostgreSQL, there is a similar project, Postgre-XC, that also allows to access multiple PostgreSQL instances through one single proxy for large cluster. PgBouncer then offers a solution by pooling the connections for deployments where number of client connections attacks thousands and thus reaches the limit.

Similar slight layer between client and server is community-developed Galera clustering project, which originated from Oracle's MySQL cluster solution. Even if it is considered standard for MariaDB now, the support in MariaDB itself is not merged yet and necessary patch kind of blocks wider adoption in distribution, since an extra MariaDB binary package has to be maintained.

Connect engine for MariaDB seems like really fine help in heterogeneous environment, since it allows to work with external data from MariaDB, the same as there was a native table. Similar feature is now included in the most recent stable version of PostgreSQL 9.2.

More about replication

Interesting comparison between an external HA solution Slony versus out-of-the box streaming replication, that new PostgreSQL offers and was evaluated as better, showed that PostgreSQL upstream does still very good. Also the number of new features focused on HA, scalability and NoSQL-like usage of this DBMS promise good future for PostgreSQL. For example new JSON document type has even better performance than pure-JSON MongoDB in the most recent PostgreSQL, according to one of the benchmarks out there; and as a bonus it doesn't have to give up the ACID features, like MongoDB does.

To sum it up - relational databases offer some basic features, but do not include good enough solutions for all distributed deployments using a simple way. This can change for PostgreSQL in the next releases, but since one size doesn't fit all, we would still need some little help of plug-ins, proxies, etc.

Where distributions and especially support companies can help is also offering some supported scenarios, because as a talker from Avast described, it was real PITA to choose the best-fit database system to store their big data, especially if the management can't give specific-enough requirements.

What is up in NoSQL world?

The situation is a bit different for NoSQL databases, that include some good enough automatic solutions out of the box or even by definition. The draw back side of NoSQL databases are rather in missing ACID transactions and that they require totally different approach to database design than relational databases.

However, really full NoSQL lecture rooms should be clear sign that NoSQL world will be very important in the next years' cloud-oriented market.

During the devops-related talks I've seen many desires to configure system using some automatic tool like puppet or chef, which is what companies use in the end and there is definitely some space to support these deployment tools better. So generally, no more admins' work can be manual any more, not in cloud environment. Automation then can't be done without versioning, logging, module testing, integration testing or continuous integration.

What is not happening any more

Interesting thing for me was kind of lack of single-server performance talks. It seems the database servers reached some point where improving performance is not so important for a single server, especially if we have so many challenges to improve things in scaling parallel deployments or increasing limits of data amount to be stored efficiently.

To sum it up, databases world might be quite boring the last years, but that is definitely changing now. Clouds are simply database-enabled.

Sunday, February 09, 2014

How to glue Copr, Software Collections, MariaDB and Docker?

We've heard about exciting technologies on devconf this weekend, like software collections, which is an alternative way how to package multiple versions of packages, or docker, which offers a containerized environment for software delivery.

We've also heard about Copr, which is a new build system and allows to build software collection. What I didn't heard here was MariaDB, a community developed fork or MySQL, a default MySQL implementation in Fedora and which is developed by most of original developers of MySQL, who escaped from Sun or Oracle after things started to change. That was kind of my mistake and I'll definitely try to come up with something database-related the next year. But now, how these all things and technologies come together?

Imagine you take care about a public cloud environment and you're supposed to provide various versions of the awesome MariaDB database and for some non-disclosed reasons also original MySQL from Oracle. This service would be used on demand for some short time and so it should be as fast as possible.

Problem #1 -- packages of various versions of MariaDB and also MySQL are in conflict, so they are not able to be installed one one system. Usually. That could complicate your situation, since you would be able to offer only one database version on one particular machine at a time. Solution -- we can use the databases packaged as software collections, so files are properly put away to /opt/something.

Problem #2 -- my system, which is Fedora, does not include any scl packages I need. Solution -- we'll build them in Copr. It's just enough to edit the buildroot (we need to include the meta package and scl-utils-build to th buildroot, so we have all macros at a time the scl-ized package is parsed by RPM) add prepared srpms and wait a while to build the stuff.

Problem #3 -- full visualization is too big overhead if we want to provide just one little (well little is questionable) daemon. Solution -- we'll use docker.

So, this is a workflow, that could work in practice. First, we convert the MariaDB package into software collection package. Second, we build this package in Copr and save the repository from Copr in the containerized system in Docker.

Firing up a new database instance then means just creating a data directory for a new instance and running the daemon, while redirecting the port to hosting system. It will take a couple of seconds only and you'll have any version of database you want. Awesome, right?