Thursday, February 16, 2017

Notes from Prague PostgreSQL Developers Day 2017

Thursday in the middle of February means that there was another year of p2d2 conference. This year, it's already 10th p2d2 meet-up of PostgreSQL users. Although it might seem interesting, it's nothing special, 16th will be better opportunity to celebrate.

Beating 1 billion rows per second

Billion rows in any database is nice portion of data, even more interesting it is in relation database, but as Hans-Jürgen Schönig from Cybertec explained, it is possible do it in 1s. It all began when Oracle published an article saying they can do it, Hans decided he can beat it with PostgreSQL as well.

From the real beginning, it was clear that it must be done by parallelism. Traditional one-core execution could help in parallelism when running more queries at a time, while pl/proxy could handle parallelism using sharding on procedures level. Parallelism from scratch is simplu not something feasible these days.

Hans began with a simple scenario with 3 columns, but this can be scaled to real data using more memory and more processors (more machines), theoretically to any level, so we can take this simplifying as acceptable. So, what can one server offer?

With 9.6 parallelism, which is not complete yet, optimizer determines number of cores that will be processing a query, small tables are processed by one core only. Option parallel_workers can be set to overcome guessing and make number of workers higher.

Benchmark must be done in memory only, to not do disc benchmark instead. It turned out, that one server can scale linearly up to 16 cores, even when running on 32 core box, because that box is actually 32 thread box, hyper-threading eventually degrades it to 16 core, though.

So, at this point we have 40M rows. Next thing is partition tables. And we can do this to needed extend. In the end, 32 billion rows were processed in less than 3s, which means the goal was reached, only few patching was used on top 9.6, many of them will be already in 10.

PostgreSQL for OLAP alias how the parallelism works in PostgreSQL

Tomas continued with an overview around processes and connections. Traditionally no threads and no forks are done in PostgreSQL when processing a query. That is ideal for OLTP applications, where apps do many small chunk of work jobs. Sharing processes via connection pooling is one thing how to utilize one process more, but it is not interesting for making query processing more effective.

OLAP applications use usually less connections, so some real multi-processing is more needed there. Gather process can run more processes in parallel, gather data from them and every process can do something on their own. Quite a lot of restrictions so far, like shared locks don't allow to run data modifying queries. There is also only one level of parallelism, but for selects it might cover pretty nice chunk of use cases.

In version 10 there will be some improvements in bitmap scan, hash joins or aggregations. Users will need to set flag whether the PL function is parallel safe though. Using TPC-H, standardized benchmark, many tests were faster at least twice, comparing to single process queries.

Effectiveness of parallelism (difference from theoretical limit) depends on core number, we saw it goes from 100% (1 core) to 80% for 32 cores.

Currently, refreshing materialized views is not parallelized, but it would make great sense. Same for parallel sort, although gather sort might be done in 10.

David Turoň from LinuxBox began with a bit of joke, explaining how miners in Ostrava region start to program instead of mining.. Anyway, he talked about partman_to_cstore, which is a plugin living in gtihub currently, so user needs to build it first. It might be good for: table with rotations (timestamp/timestamptz), senzor data, log of events; generally big tables that does not change. It's nice that in most cases partman_to_cstore might be used without changes in application. Cron may run function and store data to cstore.

Roman Fišer from nangu.tv continued with sharing their problems when delivering a lot of multimedia content to many users. From the tools they use, let's name Consul, Patroni, or HAProxy. As for the approaches they chose, they need to be tolerant for latences between nodes, etc. For data collection they use collectd, store logs to RRD files, Collection3/Cacti as a frontend, InfluxDB/Grafana for back-end, and they think about Prometheus. New PgSQL version integration takes them approx. 6 months.

Containers everywhere

My talk then focused on containers in distributions, of course giving all of the examples from PostgreSQL world. Describing mostly basic stuff I wanted to bring containers a bit closer to users that typically don't like them much. Mentioning OpenShift, Flatpack, or systemd-nspawn, except traditional Docker, people could start trying it on their own.

After presentation there were quite a few questions:
  • Network - could IP tables (firewall) inside the container?
  • Data in volumes shared in COW mode, so they can share the common part on disc - could volume-mounted data be on zfs?
  • Several questions about user namespaces, what they are good for and whether it works as people expect.
  • Sharing libraries in memory was an interesting question we couldn't answer for 100%, but from further look-up, any memory sharing would be against containers' concept of isolating processes, so nobody will probably implement it.

Machine learning using PostgreSQL

Lightning talk by Hans-Jurgen about Machine Learning focused on detecting fraudulent logins on a machine. Support Vector Machine is used to separate good users from bad, it's basically about creating model that is used for decision. PL/Python function to load data and use history data for learning itself. In the end the decision is also done in SQL and that all of course in PostgreSQL..

Pavel Stehule talked about \g, \gset and \gexec commands in the second and also last lightning talk. Those commands were added recently in psql, all help in faster work in psql console.

What does it mean to hack PostgreSQL?

Pavel Stehule closed the conference with a talk about experiences with implementing XMLTABLE for parsing XML, which was missing in PostgreSQL. It was possible to use the XML functions before, but with XMLTABLE it is much more obvious, which is the reason why SQL looks like it looks. It was interesting excursion to the world of PostgreSQL hacker.

When somebody reads ANIS SQL, it does not need to be really nice reading, details are missing and existing implementations are slightly different from standard, so what to do? In PostgreSQL, which implements the XMLTABLE later than DB2 and Oracle, author needs to take the functionality that is already existing. Generally, when implementing some new functionality, we might either consider patching the code, or writing an extension. If the feature can be done using extension, it's much easier, it can live anywhere and people are happy and prosper. In case patching of PostgreSQL is needed, it takes time and developer will spend a lot of energy. Especially tricky it is when using some library like libxml2, which, same as PostgreSQL, uses own memory management, uses different types of data and a conversion needs to be done.

In case of XMLTABLE implementation, some code change needs to happen in parser and executor. Here it comes handy to know well how the execution pipeline works in PostgreSQL -- parser, rewriter, planner, optimizer, executor. Also, SQL needs to be translated to some other language, so changing behaviour is basically changing of translator. On the other hand, this SQL translation won't be the bottle neck. Patching of parser (bison) needs to be clean, no conflict warnings will be acceptable during the review. Implementation is either an SRF function or executor patch.

When writing patch, first iteration in community is to write something dirty, to show that it is possible and conversation about benefits can begin. Next phase is complete patch, with documentation and regressions tests. The last part is commitfest patch, that is used for cleaning the code, documentation, extending tests, ...

The whole patch writing is very hard, one will learn assertiveness and respect in the first place. Some features take even 10 years from first patch to commit released. hstore for example helped Ruby on Rails and that helped PostgreSQL to grow. Without use case it is not very probable that change will be accepted.

So that was the 10th p2d2, I'm already looking for the next year!

Tuesday, February 07, 2017

FOSDEM 2017 - personal notes from Sunday

First day was exhausting, so as it happens, during second day of the Fosdem, I tend to write less notes, so this day will be a bit shorter.

Arduino in Go and Go in Arduino

We started the day 2 in Golang room where Ron showed how to code Arduino in golang. They provide a framework that can be used for not only simple, but also a bit more complicated apps. He showed few practical demos from blinking led, through button-controlled lights, to Linux (busybox) based dron simulating solar system repair. I must admit it was really cool to see a linux controlled dron flying above our heads, doing a flip and that all communicating with a base via wi-fi.
More info: https://fosdem.org/2017/schedule/event/go_gobot/

Hacking cars to make them drive alone

A similar talk, this time not in golang room any more, was Open Source Car control introduction by Josh Hartung -- they buy a cheap Kia car and put an Arduino, inside. They rather spoof wheel assistant rather than creating new complicated controlling.. so they basically do a slight man-in-the middle attack. The reason is that when anybody needs to do something with a car, he needs to hack it, and the only standard is the OBDII, which does not offer what is needed.
Safety can be done by different ways, the safest so far seems to have strict policy, that touching a driving wheel or break hard resets the system, which puts a car to the stock settings.

For assuring good enough product quality, they use coding standards, SW diagnostics, code review process and unit integration testing.
Autonomous car racing looks like seniors driving (approx speed just few km/h), but it looked interesting, given those are not cars produced with milions given into research. And waht is the best, anybody can join and hack own car, the cheaper the better.
More info: https://fosdem.org/2017/schedule/event/open_source_car_control/

Rest of the day in Python devroom

Brian Bouterse explained how CPython (the runtime) can be debugged using gdb to find some issues in the runtime itself. However, GDB can also work with the python code and corefile in a bit smarter way, so users only see python stack functions for example. And why should we use debugger for python? For example if you have some problem that is hard to debug or you cannot share reproducer to the upstream. We can make the Python better by fixing such corner cases. It's a powerful technique, that every Python developer should know of.
More info: https://fosdem.org/2017/schedule/event/python_gdb/

Threads are goto of our generation, mentioned Ewoud Van Craeynest in his talk Asynchronous programming with Coroutines in Python. It does not mean we should not use them, we should just use it less. Or use some framework that will force us to do it more correct, like asyncio.
asyncio allows to call some functions asynchronously by using decorators, can create event loop, can call functions with delay or at some point in time.
No callbacks, easy to use. Usable for SSH, file handling, IRC, HTTPD, RPC, testing, and other use cases.
In the end Ewoud mentioned, that some reimplementations of parts of asyncio also exist, because not everybody liked the way upstream implements the async framework.
More info: https://fosdem.org/2017/schedule/event/python_coroutines/

Ben Nuttall was talking about IoT in Python. GPIO Zero helps implementing Raspberry apps using pre-prepared objects where user only specifies pin number and calls methods to work with the sensor.
Quite complicated and large device hierarchy demonstrated wide range of objects available for users.
Remote GPIO (pigpio) allows to run the same code and in the end control remote device.
Picamera is a able to capture 1080p30, 720p60 VGA90 video, so you can create your mini camera using few lines of code.
Energeie allows to remote control real 230V plugs, an example of remote controlled tortoise lamp showed one of the possible usages. I guess there are more useful use cases though.
Sense HAT is then a complex module that returns many values like temperature, humadity, etc.
More info: https://fosdem.org/2017/schedule/event/python_raspberry_pi/

The last talk on this year's Fosdem was prompt_toolkit by Jonathan Slenders. This tool brings nice and very advanced prompt for cmd-line python apps.
Features included today: autocompletion (using table), syntax highlighting, colors, multi-line edits and many more.
Tools that use prompt_toolkit library already proof its readiness: wharfee (Docker client), pgcli, mycli, ipython, screen splitting, etc.
Since the library includes a lot of around text-editing, pyvim was then quite easy to write, at least to show up that it is not that painful to implement a real editor in pure Python. Jonathan plans to continue in re-factoring the library, so the API can change a bit in the future.
More info: https://fosdem.org/2017/schedule/event/python_prompt_toolkit/

I'll be back!

So that was it, it was exhausting but still great this year, same as the Belgium beer. And we tried Uber the first time, since it was cheeper than traveling by local transport (in 4 persons). It worked perfectly!

FOSDEM 2017 - personal notes from Saturday

As always, do not expect the following text to be a nice reading, it's more a sack of notes for further reference. And yeah, there was also the CentOS Dojo one day before and second date of Fosdem the day after.

Toooo big FOSDEM

First weekend in February means that several thousands of free and open source fans get to Belgium capital, Brussels. First visitor's idea about Fosdem is that it is year to year more and more crowdy. Talks in smaller rooms are almost always full and one needs to sneak inside in advance to get to some interesting talk. On the other hand, number of interesting talks is not decreasing, so it is still worth traveling there, especially when you meet friends from last year or other conferences. For the new comers, at least there was 15 degrees more than in Brno that time. All about the event is available at https://fosdem.org/2017.

Optimizing MySQL, MariaDB already in compiler

Optimizing SQL without touching my.cnf talk by Maxim Bublis, a Dropbox engineer opened MySQL devroom. He spoke about billions of files stored daily, PB of metadata, and EB of raw data.
They use sysbench to test performance and Bazel.build for building own flavor of packages, because profile-guided optimization (PGO) requires rebuild.
It's not a good idea to use benchmarking for unit tests, since those test corner cases.

Maxim mentioned concrete options they use in GCC for PGO: --fprofile-generate --fprofile-correction
Clang with PGO is a bit more successful, it was interesting that GCC 4.9 was worse than 4.6, 5.4 even much more worse, but on the other hand very good with PGO builds.
Link-time optimization, done by linker instead of compiler, is not supported in MySQL so far. However, they start coding on it.
Totally, they achieved 20% improvement, and have many further ideas to future.
More info: https://fosdem.org/2017/schedule/event/opti_mysql/

Sysbench talk by Alexey Kopytov started from history, that it originally was a simple tool, then became more complicated to be usable in general use cases, so Lua was chosen as scripting language. But then a real surprise came -- Alexey was happy to announce the first 1.0 release after 10 years of development.
Option --mysql-dry-run can measure tps in MySQL without need to know server structure. Thanks to LuaJIT and xoroshiro128+ the tool is several times faster. Performance also benefits from having no mutexes and no shared counters -- ConcurencyKit is used to avoid those, so stats are counted per thread.

Sysbench can be used in shebangs as well. Supporting all cmd-line options as global vars was troublesome, version 1.0 can validate them better.
Arbitrary C functions can be called from lua scripts, so no binary execution is needed. It's possible to use more connections in one thread. In the new version we can also see what previously was hidden to user - latency histograms. Users also may find useful to ignore some particular errors, especially in distributed environment. After long considerations Windows support was dropped. Live long and prosper, sysbench!
More info: https://fosdem.org/2017/schedule/event/sysbench/

When one instance is not enough

Spark for multi-procesing explained by Sveta Smirnova from Oracle gave an overview about how Spark differs from MySQL, why parallelism makes full table scan even better than indexing, which is missing in Spark.
More info: https://fosdem.org/2017/schedule/event/mysql_spark/

Group replication is a plugin, which was added to 5.7 version of MySQL pretty recently. It allows to automate failover in single primary mode, provides fault tolerance, multi-master updates on any node, enables group reconfiguration, etc. So, it's a pretty big deal added in the minor version, that for sure deserves some closer look.

The principle is simple, as Alfranio explained -- the plugin checks conflict, on conflict the transaction rolls back, else it gets committed.
Then he explained the concurrency problem solution, how more entities can agree on some fact, using paxos, which is used in group replication.
More info: https://fosdem.org/2017/schedule/event/mysql_gr_journey/

Which proxy for MariaDB?

Colin Charles went through existing proxy solutions we have today for MySQL and clones.
MySQL Proxy is middle part between client and server, with Lua interpretter to rewrite queries, add statements, filter results, etc. It is not much active now though.

MaxScale is similar, fully plug-able architecture operating on level 7, allowing logging, or working with other back-ends. Schema-based sharding, binlog server, or query re-writing also supported in MaxScale. Very usable for Galera cluster and Kafka backend also exists. MaxScale has first forks by booking and airbnb (connection pooling), because they use Ruby and it is then necessary for them to pool connections. Colin also mentioned the license change, when MaxScale 2.0 was released under BSL.

MySQL Router - fully GPLv2, allows transparent routing, plugin interface via MySQL Harness, can do Failover, load ballancing (at this point Colin suggested to see https://www.youtube.com/watch?v=JWy7ZLXxtZ4). MySQL Router does not work with MariaDB server though.

ProxySQL - stable, brings HA to DB topology, connection pooling and multiplexing supported, R/W split and sharding, seamless failover, Query cashing, rewriting, ...
SSL encryption not that good as in MaxScale, also no binlog router. Maxwell's Daemon allows to connect ProxySQL with Kafka.
http://vitess.io/ is also intersting, it allows to scale MySQL as well.
http://proxysql.com/compare shows comparison with other tools.
More info: https://fosdem.org/2017/schedule/event/mysql_proxy_war/

RocksDB rocks even in MariaDB

Last two talks were dedicated to the new engine for MySQL/MariaDB, built on top of RocksDB, called MyRocks. It provides better write efficiency, good enough read, best space efficiency, effective with SSD. In the Facebook testing for example, it uses 50% of spaced used by compressed innodb, or 25% of space used by uncompressed InnoDB, and only 10% of disk operations.
Performance problems are still not fully fixed, but goal for performance is to be as good as InnoDB. As for stability, what can be better prove that it is already stable, than the fact, that it is used already in Facebook for storing user data in production.
Goal is to integrate to MariaDB and Percona upstreams and expand features in the new engine.
It all is more visible when using small rows, because the per-entry overhead is very small comparing to innodb.
Bloom filter helps when non-existent IDs are read (index tree does not need to be read).

Currently it is hard to debug, has issues when using long range scans or when using large transactions or group commit.
For performance check, Linkbench is used, trx time went from 6ms to 1ms when writing, it went worse for reading a bit, but saving time during writes we have more for reads.
More info: https://fosdem.org/2017/schedule/event/myrocksdb/, https://fosdem.org/2017/schedule/event/myrocks_at_fb/

Really big data in many dimensions

Peter Bauman, a database expert from Jacob's University talked about datacubes on steroids with ISO Array SQL. Their long term research helped adding multidimensional arrays to SQL standard. Big Data can be structuralized in many ways, sets, trees, graphs or arrays. The idea of arrays in SQL is similar to other languages -- a column can simply be composed of more values. That should get to the standard later this year and it not only allows to access particular values, but also do other complicated operations, like matrix multiplication, histograms, and other operations that can be composed from those. Yeah, a real algebra in relational databases.
The research can be seen in practice on planetserver.eu, which includes 20TB of data. Generally we must count that the data will be bigger than memory if we speak about BigData..
Distribution of data gets crazy, since they not only fetch data from datacenters far away, but also want to put a database to the sattelite, and querying it from Earth.
They don't use hadoop, because it doesn't know about arrays, that might be TB long (which is the case when working with bit bitmaps for example).
Comparing their Rasdaman, Sparc and Hive, when getting more than one pixel, it didn't look pretty well for Hive and especally not for Sparc.
As for number of dimensions they look at several types of data -- it begain with 1D from sensors, and they went to 4D when modelling athmosphere.
More info: https://fosdem.org/2017/schedule/event/datacubes/

Live Patching of Xen and CI in Ubuntu

Ross Lagerwall from Citrix XenServer talked about Xen and live patching, which is needed because live migration is too slow and makes host unusable during that time. Xen looks at whether a payload needs to be applied when it is possible, so no patched code is on the stack. What they do then is putting jump istruction to the beginning of the function while stopping IRQs and WP.
The payloads are created by kpatch tool from fully build Xen and running some kind of diff tool, while picking just the changed functions. When applying, some non-static data can be handled, while touching initialized data is prevented. Definitely worth watching.
More info: https://fosdem.org/2017/schedule/event/iaas_livepatxen/

Martin Pitt talked about Ubuntu testing infrastructure. Developers must be responsible for writing tests, if the testing should be done in meaningful way.
QE should be responsible for keeping infrastructure working and for consultancy.
In the end, when the testing infrastructure really helps is the upstream testing, where test results are added for every pull request (systemd shown as example).
More info: https://fosdem.org/2017/schedule/event/distribution_ci/

So that was the first day of the Fosdem 2017.

Few notes from pre-Fosdem CentOS Dojo 2017

As it is already a tradition, Friday before Fosdem there is a meet-up of CentOS folks in Brussels. This time it was funny because we were in the same venue as PgDay was held in. Anyway, it was again great to talk to people and see progress in CentOS ecosystem (I really cannot call it system any more).

How it was with CentOS and Red Hat

The day with two parallel trackes was started by Karanbir, the Mr. CentOS. He talked about recent history, when first talk with people from Red Hat happened, where they talked about joining forces for the first time. That gave a great insight at how the relation looks now, which I think was really important, since especially active contributors need to understand this unique relation.

And what the relation is? Obviously, CentOS wouldn't be here without RHEL, but on the other hand, RHEL benefits from CentOS, which is not a downstream any more. Many projects like Gluster, OpenStack or Software Collections provide place for wider community to develop tools that are simply not good fit for Fedora. We can call it upstream or just community projects, important is that it works.

Then he gave several numbers, when all guesses from audience were mostly wront (lower) -- since numbers about CentOS base image downloads (tens millions) or current size of the mirror (254GB from what 150GB is not upstream RHEL) is simply astonishing.

On the negative part of the present state, CentOS folks (the official guys paid by RH for working FT on CentOS) lack resources, people resources, but with big community it still works amazingly, thanks to collaboration.

Now, what next? It's good to understand, that CentOS still can make easier everything you need to do with CentOS -- may it be modularity related, OpenStack, or NodeJS related. What is for example happening is that NodeJS master branch is tested in CI CentOS to see whether it works correctly on CentOS.

Long term they want to have CI that is available for everybody in opensource, and generally focus beyond CentOS ecosystem.

What EL6 user should pay attention in EL7, CI and really big data in Cern

Next talk by Bert Van Vreckem was more targeted to users of el6, who recently moved or decided to move soon to el7. He presented several commands any admin should know about. Exploring his github repo I found also nice collection of articles about writing readable and kinda safe Bash scripts, so I recommend checking https://t.co/F1JkYaOMQV.

Brian Stinson, the main CentOS CI guy, shared some basic info about current CI abilities, while it's more than awesome that many open-source projects, even outside of CentOS ecosystem can levarage this cool thing . It's especially cool because we do not speak about VMs, but about bare metal machines, that one can connect to and get a unique testing infrastructure for free. For the future we might find more architectures -- currently there is quite big demand for ppc64le and 64bit arm, which seems to be the future of the cloud soon.

The last talk of that day I attended was by David Abdurachmanov from Cern, who shared crazy numbers from their practice. They need to process huge amount of data that are often created in less than a second. 350PB data on disk, 800PB carried over net, 650k intel cores or 300Gbps network, that is the reality that helps humans in exploring the smallest particles in the Universe.

I hope at least slides will be soon available at https://wiki.centos.org/Events/Dojo/Brussels2017, but if not, I guess speakers will be more than happy to share them if you contact them. See you next year!