Quick links:

Florian's blog: Now available: Slides from Percona Live and Linuxcon Europe

Florian's blog: Ready to roll for Percona Live UK

Florian's blog: Twitter

Florian's blog: Busy weeks ahead!

Florian's blog: Speaking at Percona Live — and you can get there for cheap!

Arrfab's Blog » Cluster: Monitoring DRBD resources with Zabbix on CentOS

Florian's blog: On to new endeavors!

Xaprb » High Availability: When can I have a big server in the cloud?

Xaprb » High Availability: What’s wrong with MMM?

Everything is a Freaking DNS problem - ha: High Availability MySQL Cookbook , the review

Arrfab's Blog » Cluster: DRBD backported (or not) to 2.6.32 in EL6 ?

Everything is a Freaking DNS problem - ha: MySQL HA , an alternative approach

Xaprb » High Availability: Why high-availability is hard with databases

Everything is a Freaking DNS problem - ha: Linux Open Administration Days 2010

Everything is a Freaking DNS problem - ha: UKUUG Spring Conference 2010

High Availability MySQL: Relevance in the datacenter

High Availability MySQL: On synchronous replication

High Availability MySQL: Patch for global transaction IDs, binlog event checksums and crash-safe replication state

High Availability MySQL: Vendor lock in and MySQL documentation

High Availability MySQL: Cool things you can almost do with replication

Everything is a Freaking DNS problem - ha: Better days Arrive when Dev Meet Ops

Everything is a Freaking DNS problem - ha: Disabling DHCP on a LibVirt setup

Everything is a Freaking DNS problem - ha: Got Interviewed

Everything is a Freaking DNS problem - ha: Yet Another DNS Issue

Everything is a Freaking DNS problem - ha: Nines , Damn Nines and More Nines

Everything is a Freaking DNS problem - ha: Heartbeat 2 OpenAIS

Everything is a Freaking DNS problem - ha: Monitoring MySQL

Everything is a Freaking DNS problem - ha: Why learn to type ?

Anchor Web Hosting Blog » drbd: GitHub: Speed matters

Anchor Web Hosting Blog » drbd: When HA won’t play the way you want it to

Now available: Slides from Percona Live and Linuxcon Europe

Posted in Florian's blog by Florian Haas at November 01, 2011 08:53 PM

The slides from last week’s talks I (co-)presented at Percona Live and Linuxcon Europe are now available from our web site.

All slides are available entirely free of charge for logged-in users on our web site. To log in, you don’t even need to register — just use your Google Profile, or Google Apps account, or your WordPress account, or anything else that uses OpenID, and you’ll be good to go.

Comments on our slides are, of course, always highly appreciated.


Ready to roll for Percona Live UK

Posted in Florian's blog by Florian Haas at October 23, 2011 06:17 PM

Percona Live MySQL Conference, London, Oct 24th and 25th, 2011

All slides are done, all virtual images are completed and we’re ready to roll for tomorrow’s MySQL High Availability Sprint: Launch the Pacemaker tutorial at Percona Live UK 2011.

This is probably your very last chance to register for PLUK as there are only a handful of tickets left. You can still use my discount code, HaasPLUK11. See you tomorrow!


Twitter

Posted in Florian's blog by Florian Haas at October 19, 2011 10:22 AM

Henceforth, you can find and follow us on Twitter. See you there!


Busy weeks ahead!

Posted in Florian's blog by Florian Haas at October 17, 2011 03:17 PM

I’m speaking at Percona Live, LinuxCon Europe, and linux.conf.au. And I just co-founded a new company.

I have a few busy weeks behind me, and even busier weeks ahead. If you’ve been wondering why recently I haven’t been updating this space too frequently, here’s why:

Yours truly and fellow ex-Linbiters Martin Loschwitz and Andreas Kurz have recently founded hastexo, an independent professional services organization focused on open-source high availability and disaster recovery. We are already offering both on-site and remote consultancy, custom training, and our Availability Checkup package, with more services lined up to be added to our offering.

We’re able to offer direct, 24/7 access to high availability experts with dial-in numbers in Europe, North America and Australia. We’re offering our services under an extremely flexible, versatile payments scheme with an attractive volume discount model. We’re experts in an array of high availability and disaster recover technologies — like Pacemaker, Corosync, Heartbeat, DRBD, highly available virtualization (a.k.a “enterprise cloud”), and cluster file systems.

And we’ve got a unique, free offering. Have you ever considered hiring a high availability consultant to review your setup or provide expert advice, but were unsure as to the expected cost involved? At hastexo, we can help. You simply go to our Help page (free-of-charge registration required), collect information as instructed, and then just create a ticket in our support system. And we’ll make a qualified estimate as to the amount of effort (and cost) required to fix your issue, or improve your uptime, or both.

And, just in case one of us has previously help you on a mailing list, on IRC, or at a conference, as we frequently do, then please leave us a message in our Shoutbox. We love to support the high availability community, and we’re thrilled to hear about it when we can help.

Speaking of conferences: next week, I’m doing back-to-back conferences in Europe.

And, for those of you making plans for Ballarat in January: I’ll return to linux.conf.au as a tutorial speaker, together with Andrew Beekhof and Tim Serong. I have also submitted a talk for the High Availability and Distributed Storage miniconf, preceding the main conference. See you there!


Speaking at Percona Live — and you can get there for cheap!

Posted in Florian's blog by Florian Haas at September 08, 2011 06:16 AM

Following my departure from Linbit, I’m honored to be serving a number of speaking requests at conferences over the next few months.

The first I am pleased to announce is my commitment to speak at Percona Live in London this October. The conference venue is the America Square Conference Centre not too far from the iconic Tower of London. My 3-hour tutorial MySQL High Availability Sprint: Launch The Pacemaker! is scheduled for Monday, October 24th at 1pm.

In this tutorial, I’ll show you the simplest, quickest and easiest way to set up MySQL high availability in Pacemaker clusters — once you understand the concept, you’ll be able to pull this sort of thing off in under an hour.

What’s cool is that Percona provide tutorial speakers with discount codes for registration, which we can freely share. Thus, if you register for Percona Live using the discount code HaasPLUK11, you get £40 off the Conference+Tutorials ticket — and if you do so before September 19, you save an additional £135 with Early Bird Registration. This discount is valid regardless of whether you actually come to my tutorial or choose a concurrently scheduled one — so even if my tut is not for you, I can still help you get into the conference cheaper!

I’m thrilled to be doing this and can’t wait to see a bunch of familiar faces in London. And I’d be thrilled so see you!


Monitoring DRBD resources with Zabbix on CentOS

Posted in Arrfab's Blog » Cluster by fabian.arrotin at September 07, 2011 12:10 PM

We use DRBD at work on several CentOS 5.x nodes to replicate data between our two computer rooms (in different buildings but linked with Gigabit fiber). It's true that you can know if something wrong happens at the DRBD level if you have configured the correct 'handlers' and the appropriate notifications scripts (Have a look for example at the Split Brain notification script). Those scripts are 'cool' but what if you could 'plumb' the DRBD status in your actual monitoring solution ? We use Zabbix at $work and I was asked to centralize events from differents sources and Zabbix doesn't support directly monitoring DRBD devices. But one of the cool thing with Zabbix is that it's like a Lego system : you can extend what it does if you know what to query and how to do it. If you want to monitor DRBD devices, the best that Zabbix can do (on the agent side, when using the zabbix agent running as a simple zabbix user with /sbin/nologin as shell) is to query and parse /proc/drbd . So here we go : we need to modify the Zabbix agent to use Flexible User Parameters, like this (in /etc/zabbix/zabbix_agentd.conf) :

UserParameter=drbd.cstate[*],cat /proc/drbd |grep $1:|tr [:blank:] \\n|grep cs|cut -f 2 -d ':'|grep Connected |wc -l
UserParameter=drbd.dstate[*],cat /proc/drbd |grep $1:|tr [:blank:] \\n|grep ds|cut -f 2 -d ':'|cut -f 1 -d '/'|grep UpToDate|wc -l

We just need to inform the Zabbix server of the actual Connection State (cs) and Disk State (ds) . For that we just need to create Application/Items and Triggers .. but what if we could just create a Zabbix Template so that we can just link that template to a DRBD host ? I attach to this post the DRBD Zabbix template (xml file that you can import in your zabbix setup) and you can just link it to your drbd hosts. Here is the link . That XML file contains both two Items (cstate and dstate) and the associated triggers. Of course you can extend it, especially if you use multiple resources , drbd disks. Because we used the Flexible parameters, you can for example in the Zabbix item, create a new one (based on the template) and monitor the /dev/drbd1 device just by using the drbd.dstate[1] key in that zabbix item.

Happy Monitoring and DRBD'ing ...

On to new endeavors!

Posted in Florian's blog by Florian Haas at September 05, 2011 02:33 PM

As many of you know already, I have left Linbit on September 1. Stay tuned for updates; I will post those here.

Overall my experience at Linbit has been excellent, and the parting has been very amicable. I am enormously grateful for the 4½ years I spent at Linbit, and I wish them well.

At the moment it appears that the autoreply I enabled on my email account when I left is non-functional. Unfortunately, as the email address as such is still active, that means anyone writing to my old email address is no longer being notified that if anyone does read it, it won’t be me.

My current, correct, email address is on my LinkedIn and Xing profile pages. Please make sure you use the new one when communicating with me. Or contact me on LinkedIn or Xing directly; that’ll also work.

I presume that the same thing is also true for the email addresses previously used by fellow ex-Linbiters Martin Loschwitz and Andreas Kurz, as well. You can of course find them on their respective Xing pages, too.


When can I have a big server in the cloud?

Posted in Xaprb » High Availability by Xaprb at June 10, 2011 03:12 PM

I was at a conference recently talking with a Major Cloud Hosting Provider and mentioned that for database servers, I really want large instances, quite a bit larger than the largest I can get now. The lack of cloud servers with lots of memory, many fast cores, and fast I/O and network performance leads to premature sharding, which is costly. A large number of applications can currently run on a single real server, but would require sharding to run in any of the popular cloud providers’ environments. And many of those applications aren’t growing rapidly, so by the time they outgrow today’s hardware we can pretty much count on simply upgrading and staying on a single machine.

The person I was talking to actually seemed to become angry at me, and basically called me an idiot. This person’s opinion is that no one should be running on anything larger than 4GB of memory, and anyone who doesn’t build their system to be sharded and massively horizontally scaled is clueless.

I’ve received similar push-back from a lot of cloud hosting providers. When I work through the math with clients, a lot of them don’t like the ultimate price/performance ratio offered by cloud hosting. Hype doesn’t drive everyone’s business decisions, so a lot of people are wisely staying far away from cloud hosting for their applications, or even moving whole applications out of cloud hosting into real hardware to consolidate machines and save a lot of money. Some of them are using flash storage devices such as Fusion-io to further lower their TCO (this isn’t the right answer for every app, though).

Why do cloud hosting providers work so hard to make everyone buy lots of anemic machines and shard their applications an order of magnitude more than is required? Why aren’t they jumping to offer really beefy instances? I think there are a couple of simple reasons.

First, they want to colocate virtual machines and over-provision, just as airlines sell more tickets than there are seats in the plane. It’s a numbers game: sell more capacity than you really have, and bet on some of the instances not using all resources allocated to them. Win! Of course, this is only possible with lots of small instances; the law of large numbers doesn’t work without lots of instances, and large instances can’t be colocated. Cloud providers tend to dislike dedicated instances, which leads to the second reason. They don’t want to make strong claims about the availability of any particular machine. This is where the cloud paradigm of “you must build to recover from machines vanishing without warning” comes from. A dedicated beefy instance wouldn’t let the hosting provider push that responsibility onto the application.

There are lots more reasons — all of them combining into one big overall “cloud application architecture best practice” — but I think those are two of the showstoppers.

I really think this is a wrong paradigm. People talk about the cloud being the technology of the future, but in many ways it’s pretty stone-age compared to what smart system architects can achieve with high-quality hardware and networking at a much lower cost, with very strong guarantees of performance, consistency, and availability.

Cloud computing is new enough that we don’t understand, in a collective sense, how to think about it. (I know that lots of individuals do, but as a whole, there isn’t much of a shared understanding.) The real value proposition that I want to see emerge from cloud computing is pretty much orthogonal to what everyone’s raving about these days. I want to see the DevOps engineering discipline build momentum around the idea that systems should be treated as services, with architectural components provisioned and controlled through APIs. That can be done completely independently of many of the characteristics of current cloud computing platforms (virtualization, ephemerality, horizontally scaled architectures…)

And like most people, I’ve got an ego and I don’t appreciate repeatedly being called a moron by cloud computing providers’ sales people, who don’t know anything about running database servers. I can do math and understand price/performance, and I know the cost and difficulty of building a sharded application. I look forward to the day when I don’t have to just bite my tongue and walk on to the next booth. I look forward to cloud hosting providers advancing to the year 2005 or so. I’m sure it will happen as we figure this all out.

Feel free to comment, but don’t expect me to approve your comment if you’re from a cloud provider and you’re plugging your platform :)

Further Reading:

What’s wrong with MMM?

Posted in Xaprb » High Availability by Xaprb at May 04, 2011 07:43 PM

I am not a fan of the MMM tool for managing MySQL replication. This is a topic of vigorous debate among different people, and even within Percona not everyone feels the same way, which is why I’m posting it here instead of on an official Percona blog. There is room for legitimate differences of opinion, and my opinion is just my opinion. Nonetheless, I think it’s important to share, because a lot of people think of MMM as a high availability tool, and that’s not a decision to take lightly. At some point I just have to step off the treadmill and write a blog post to create awareness of what I see as a really bad situation that needs to be stopped.

I like software that is well documented and formally tested. A lot of software is usable even if it isn’t created by perfectionists. But there are two major things in the MySQL world for which I think we can all agree we need strong guarantees of correctness. One is backups. The other is High Availability (HA) tools. And this leads me to my position on MMM.

MMM is 1) fundamentally broken and unsuitable for use as a HA tool, and 2) absolutely cannot be fixed. I’ll take that in two parts.

First, it’s broken and untrustworthy. I could go into the technical details of why MMM is broken at the architectural and implementation level. I could talk about the way that it uses a distributed set of agents, which do not have a reliable communications channel, all maintain their own state which is not communicated or agreed upon across nodes, and don’t even share configuration. I could talk about the fact that MMM itself can’t be made HA or redundant — you can only have a single instance of it.

I could talk about lots of things, but you can argue with every one of those assertions. You can’t argue with the list of failures I’ve personally seen. It fails over with no reason when nothing is wrong — and botches it up, causing the entire replication cluster to get out of sync and break. It tries to fail over when something actually is wrong with the cluster, but it does things out of order and with no synchronization amongst the agents, leading to chaos. It can’t handle anything unexpected, such as the ordinary kinds of network, disk, etc failures you’d expect in systems that have something wrong (which is exactly when an HA tool is supposed to function). It doesn’t protect itself against the human doing something wrong, such as mixing up the agent configuration on different hosts. There are many bizarre ways MMM can fail, but these are all theoretical — until you witness them. I’ve witnessed them, and new customer cases on MMM failures are filed on a regular basis. Here’s one:

In the recent past, we have had a couple of bad experiences with mmm-monitor tool which broke replication and brought our website down for a few hours.

And another:

We have recently started testing MMM for MySQL and when using it under write load we have been experiencing ‘Duplicate entry’ (1062) errors.

In short, MMM causes more downtime than it prevents. It’s a Low-Availability tool, not a High-Availability tool. It only takes one really good serious system-wide mess to take you down for a couple of days, working 24×7 trying to scrape your data off the walls and put it back into the server. MMM brings new meaning to the term “cluster-f__k”.

Now, why isn’t it possible to fix it? One simple reason: MMM is completely untested and untestable. Change one line of code in Agent.pm’s master control flow and tell me that you’re confident that you know what it has just done to the whole system? You can’t do it. If you don’t have tests, you can’t change the code with confidence, period. And as I said before, HA and backup tools are where we need a zero-tolerance policy. “I think this fixed the bug” or “I think it’s safe to change this code” are not acceptable. I have seen a lot of bug fixes that cause new and interesting bugs. I appreciate the variety — life is boring if all we’re doing is seeing the same old bugs — but this isn’t what we need in an HA tool.

In order to fix MMM, it has to be completely rewritten from scratch. Among other things, decisions and actions need to be completely separated. Then the decisions can be verified with a test suite, and the actions can be verified independently. But if you do that, you don’t have MMM anymore, you have a new tool. Therefore MMM can’t be fixed, it can only be thrown out and reimplemented.

Note that I’m not claiming that MMM was developed by bad programmers or that it is bad quality. I am only claiming that a) it demonstrably doesn’t work correctly, and b) it can’t be fixed without a rigorous test suite, which can’t be added to it without a complete reimplementation.

I will go further and claim that the architecture of MMM is fundamentally unreliable, and it isn’t a good idea to reimplement it (it’s already been done once!). This we could argue for a long time, but I know of so many better architectures that I wouldn’t entertain the notion of building a new tool with the same architecture.

I have seen a number of people reach the same conclusions and then implement new systems in the same general vein as MMM, with a limited set of functionality to avoid some of the problems. For instance, Flipper is a single tool with no agents, so that’s an improvement. Unfortunately, these tools all suffer from the same problem: they aren’t formally tested. I simply can’t accept that in an HA tool.

If I’m such a perfectionist, why haven’t I built a tool that solves this problem? I have a limited amount of time, and at some point, I don’t do things for free. I’ve had multiple conversations that go like this: “My last replication downtime incident cost me $75k. I can’t let that happen again. What will it cost to build a correct solution? No way — I can’t pay $20k for a high availability tool that really works.”

There is active development on something related that I can’t talk much about now. But if you want, you can come to Percona Live and be among the first to find out.

Further Reading:

High Availability MySQL Cookbook , the review

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at November 03, 2010 11:07 PM

When I read on the internetz that Alex Davies was about the publish a Packt book on MySQL HA I pinged my contacts at Packt and suggested that I'd review the book .

I've ran into Alex at some UKUUG conferences before and he's got a solid background on MySQL Cluster and other HA alternatives so I was looking forward to reading the book.

Alex starts of with a couple of indepth chapters on MySQL Cluster, he does mention that it's not a fit for all problems, but I'd hoped he did it a bit more prominently ... an upfront chapter outlining the different approaches and when which approach is a match could have been better. The avid reader now might be 80 pages into MySQL cluster before he realizes it's not going to be a match for his problem.

I really loved the part where Alex correcly mentions that you should probably be using Puppet or so to manage the config files of your environment, rather than scp them around your different boxes ..

Alex then goes on to describe setting up MySQL replication and Multi Master replicataion with the different approaches one can take here, he gives some nice tips on using LVM to reduce the downtime of your MySQL when having to transfer the dataset of an already existing MySQL setup, good stuff.

He then goes on to describe MySQL with shared storage ... if you only mount your redundant sandisk once on your MySQL nodes my preference would probably be a Pacemaker stack rather than a RedHat Cluster based setup, but his setup seems to work too. Alex quickly touches on using GFS to have your data disk mounted simultaneously on both nodes (keep in mind with only 1 active MySQLd) and then goes on to describe a full DRBD based MySQL HA setup

The last chapter titled Performance tuning gives some very nice tips on both tuning your regular storae, as your
GFS setup but also the tuning parameters for MySQL Cluster

I was also really happy to see the Appendixes on the basic installation where he advocates the use of Cobbler , Kickstart and LVM ..

One of the better books I read the past couple of years .. certainly the best book from Packt so far , I hope there is more quality stuff coming from that direction !

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/1022

DRBD backported (or not) to 2.6.32 in EL6 ?

Posted in Arrfab's Blog » Cluster by fabian.arrotin at July 20, 2010 12:54 PM

As some of you already know it, DRBD is now (since kernel 2.6.33) part of the mainline/upstream kernel. Some were expecting RHEL6 to come with that kernel (used for Fedora 13). The latest RHEL6beta2 still comes with 2.6.32, which doesn't include DRBD support. Of course we still don't know what the 'frozen' RHEL6 kernel will be but on the other hand, we know that Red Hat quite often 'backports' modules from newer kernel into the RHEL kernel. What about DRBD ? At the time of writing this blog post, it seems still undecided, but you can follow the DRBD RFE on Upstream Bugzilla to get a clue, or even comment on it if you have a bugzilla account to  make hear your voice. On the other hand, you can still be sure that even if DRBD isn't part of EL6, CentOS will still ship it in the Extras repository, like for EL4/EL5 ...

MySQL HA , an alternative approach

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at April 28, 2010 09:17 PM

For those who've seen my presentation on MySQL HA, you already know that I often use a multimaster setup with a meta OCF resource that groups my favoured MySQL instance with the service ip , using a meta resource means that pacemaker monitors mysql, but it doesn't actually manage it. It's an approach that works for us.

One of the other approaches I will be looking at soon is the freshly released OCF resource that Florian announced last week.

Back in the days our approach meant we didn't have to use clone resources, which you might remember being pretty buggy in the v2 era, not wanting to use clons resources isn't really a valid reason anymore these days . I've also frequently mentioned the combination of using DRBD and MultiMaster replication, using this set of OCF resource makes that a lot more easy ..

Now all I need to do is find me some time to validate this setup.

Why high-availability is hard with databases

Posted in Xaprb » High Availability by Xaprb at April 26, 2010 11:53 AM

A lot of systems are relatively easy to make HA (highly available). You just slap them into a well-known HA framework such as Linux-HA and you’re done. But databases are different, especially replicated databases, especially replicated MySQL.

Matchbox CarThe reason has to do with some properties that hold for many systems, but not for most databases. Most systems that you want to make HA are relatively lightweight and interchangeable, with little to zero statefulness, easy to start, easy to stop, don’t care a lot about storage (or at least don’t write a lot of data; that’s usually delegated to the database), and there’s little or no harm done if you ruthlessly behead them. The classic example is a web server or even most application servers. Most of the time these things are all about CPU power and network bandwidth. If I were to compare them to a car, I’d say they are like matchbox cars: there are many of them, and they are cheap and easy to replace.

Mining TruckDatabases are different. With or without replication, you’re looking at a system that is complex, stateful, heavyweight, and cares a lot about storage. It runs on bigger hardware with fast disks and a lot of memory. It’s usually disk-bound, and it does a lot of writes. It’s hard to start — it takes a long time to warm up and really get ready to serve production workloads (many minutes, hours, or even days). It tends to run with a lot of data in memory in a dirty state, so shutdown is slow, because a clean shutdown requires flushing a bunch of data to disk. If you yank its power plug or kill-dash-nine it, it’ll have to perform recovery on startup, which slows the startup process even more. If I were to compare a database server to a car, I wouldn’t even use a car as the analogy: I’d use one of those big-ass mining trucks. If your mining truck breaks down, you don’t just toss it in the trash and pull another off the shelf.

The problem with a lot of HA solutions is that they want to deal with inconsistencies or irregularities by killing the resource and replacing it in another location. This works fine with web servers, but not with database servers. Doing that will cause serious pain and downtime, defeating the point of HA. And when you add replication into the mix, it gets even worse. A system that wants to manage replication needs to deal with very complex conditions. A lot of replication failures are delicate matters that require skilled human intervention to solve. The HA solution must insulate the application from the misbehaving resource, but leave it running so the human can handle things.

This is not the way most applications are made HA. It’s different with databases, and it’s much harder.

Further Reading:

Linux Open Administration Days 2010

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at April 20, 2010 07:43 PM

So about 4 monts ago there was the crazy idea to start a new FOSS event in Belgium targeted at sysadmins.

What started out as an event for local people to meet local people with some local speakers actually ended up being a small local event with some top international speakers on onfiguration mananagement and system administration mixed with a bunch of good local ones !

I had the honour to open the conference with an extremely short version of the Devops talk I gave earlier last year.. extremely short as I knew that over the course of the weekend the topic would reoccur a lot.

We had the first european talk on Chef, by Joshua Timberman, and we had Puppet talks amongst by Dan Bode from Puppetlabs and CFengine talks , devops was a frequently dropped word,

We had a book raffle where we handed out O'Reilly's .. we had a great free pizza party (got the idea from the saturday pizza event at LCA 2005) , and we had some free beer. Sounds like a good combination for a geeky weekend.

Apart from the regular talks there were plenty of Open Spaces where interesting topics were discussed ... we had spaces on Open Source vs Open Core , strong voices were heard when we discussed what we should do with the Open Core companies that claim to value Open Source , some people think we should actually list the fauxpensource ones somewhere and make sure the world knows about them

We had an awesome configuration management discussion session discussing Chef vs Puppet vs CFengine . And much much more ...

Some people owe me plenty of Sushi as I had to do my MySQL HA talk before their Managing MySQL talk , but other than that .. things just went fine..

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/998

UKUUG Spring Conference 2010

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at April 07, 2010 05:16 PM

Last week I was in Manchester for the 2010 UKUUG Spring Conference, right .. make that 2 weeks ago , :)

The UKUUG usually hosts the more interesting conferences around ... , it's not just the schedule that attrackts me , yes there's the strong focus towards Larger Scale Unix (and mostly Linux) deployments and how to manage them, but there's also the opportunity to chat in real life with the Devops from across the chunnel.

Spending time with R.I.Pienaar, Julian Simpson, Simon Wilkinson , Alex Davies , Simon Riggs , Josette, and many others is always fun .

As I was in town early I went to the preconference beer meetup and met with a lot of people and chatted about config management, virtualization and lots of other stuff ... after the pub the plan was to go for curries nearby .. and while walking to the , ahem Bus stop, I managed to recognise Ben Martin from meeting him back ages ago in Hamburg for LinuxKongress , always fun ..

Apart from having to jump on a bus and our group being split at the curry place , rather than being able to tell the latecomers where to walk to and being seeted upstairs with the whole group , the curries were interesting and fun.

As I had been pushing Simon Wardley on Twitter to submit a talk for the conference it was really great to finally see him present .. His talk was the perfect soft introduction to the conference ...

Simon's talk was followed by a talk on Security for the virtual datacenters, after I questionned the speaker if anyone actualy uses TPM outside an academic lab the talk suddenly changed into a commercial presentation for a Quack, nuff said.

The Ever energetic Matt S Trout talked about 21st century perl before Simon "Life is to short for SELinux" Wilkinson talked about his experiences in getting the openAFS crowd on Git.

Bummer Thierry Carrez didn't show us the real juice of UEC and just the installations of a Cloud Controller and a Node Controller , but he managed to do so in approx 30 minutes as promised .

A talk titled Coherent and Integrated Configuration of Virtual Infrastructures always cathces my eye.. however when that talk turns out to be a Coherent and Integrated configuration only within the Univerity of Edinborough (aka lcfg2) talk I`m dissapointed, specially since it pretty much didn't introduce any new concepts from the ones I introduced back in my Durham UKUUG presentation

Luckily Andrew Stribblehill gave a very interesting talk on MySQL scalability, in which I promised him some answers to his questions for the next day :)

The Conference dinner was without a doubt the best UKUUG dinner so far , no typical english "food", no weird location (Old Trafford, an abandoned warship) , but just a big chinese place and plenty of food !

I started thurday morning in the wrong track, I assumed to be in the Virtualization track, but I ended up in the Sun thinclient and Abusing Linux to serve weird desktops under the Green computing umbrella track, not my favourites ..

When Patrick and Julian started their Hudson hit my Puppet with a Cucumber talk (which featured some aweseom #devops content) I was a afraid that we'd had to look for a replacment PostgreSQL talk as Simon hadn't arrived yet .. Luckily he arrived in time for his presentation and he explained us about the new replication features that are slowly making it into PostgreSQL, one way ... log shipping ... not really up to par with other alternatives yet :(

So with no further ado .. here's the presentation I gave

PS. If at a Ukuug event and not sure about a person's name ... try Simon.. pretty good chance you're correct :)

Relevance in the datacenter

Posted in High Availability MySQL by Mark Callaghan (noreply@blogger.com) at March 28, 2010 08:11 AM

Do you know SQL or do you NoSQL? MySQL has been very popular for internet-scale deployments. But times have changed and there are alternatives. The alternatives either out-scale or out-avail MySQL and this is more important than providing the features of an RDBMS for many applications. My prediction is that there will be much less usage of MySQL for internet-scale applications in the future if we do not make big changes.

What are the problems and what can we do to fix them? From my perspective there are two problems:
  1. MySQL is not efficient on modern hardware (multicore, many disk IOPs)
  2. Replication is very expensive to manage
We are in the process of fixing the first problem for InnoDB and Percona has binaries you can use in production today that make things much better. However many problems remain that limit throughput on servers with 8+ cores and there is little visible work in progress to fix them (MyISAM, query cache, LOCK_table, ...). This is a serious issue as 8 cores is or will soon be the new common box in the datacenter and price/performance comparisons will get much worse for MySQL.

Replication requires much more work. I want more automation and more flexibility.

The lack of automation is apparent when you consider the replication related errors that require manual intervention. These errors are frequent or constant when you run a large number of MySQL servers. It is very expensive to support MySQL in this environment. Actions that must be automated include:
  • the promotion of a slave to a master after the failure of the master
  • failover of slaves to the new master
I also want the flexibility to extend replication. I have participated in the development of many replication enhancements (semi-sync, mirror binlog, global group IDs) and that effort has been incredibly difficult. I am still amazed at what Wei and Justin were able to accomplish. I doubt that anyone would ever volunteer for such a project (I was paid). The code is not fun to modify.

I have more ideas to improve replication but it isn't clear to me that I can afford the cost to modify the replication code in official MySQL. But then I looked at the code for Drizzle. Wow! The code is clean, easy to read and easy to modify. So I still have hope for MySQL-related technology in the datacenter, but in the form of Drizzle.

On synchronous replication

Posted in High Availability MySQL by Mark Callaghan (noreply@blogger.com) at March 28, 2010 08:11 AM

Is synchronous replication possible in MySQL? Yes. Is it possible without major surgery to the existing code? Probably (or hopefully). Notes on an approach are at code.google.com.

The MySQL replication team may be working on this now for MySQL 6.0. They have spent a lot of time recently making replication flexible to support semi-sync and other new features. I assume they plan to support sync replication as well.

Patch for global transaction IDs, binlog event checksums and crash-safe replication state

Posted in High Availability MySQL by Mark Callaghan (noreply@blogger.com) at March 28, 2010 08:11 AM

Justin just added a patch for global transaction IDs, binlog event checksums and crash-safe replication state. It is at code.google.com. This patch is based on MySQL 5.0.68, so Justin did a bit of work to port code forward from the version we use (5.0.37).

Well, I assume that this includes support for crash-safe replication state. This replaces transactional replication. But it works for all storage engines.

Percona has ported a few of the replication features from previous Google patches. Hopefully, they are interested in these changes. MySQL has semi-sync replication in 6.0 with a promise to backport to 5.4. Perhaps these changes will end up there too.

Vendor lock in and MySQL documentation

Posted in High Availability MySQL by Mark Callaghan (noreply@blogger.com) at March 28, 2010 08:11 AM

Part of the sales pitch for MySQL is that there is less risk of vendor lock in. This is repeated frequently on their marketing here, here, here and here. The explanation is that the source code for MySQL is available with a GPL license and if you are unhappy with MySQL the company you can continue using MySQL the product and get support elsewhere.

Documentatation does not have a similar license. You can decide whether this creates the risk of vendor lock in. Details are here.
  • We cannot edit it.
  • We have limited rights to publish it.
Isn't it in the best interests of Sun/MySQL to address this issue and reassure potential customers?

Arjen, Sheeri, Baron and the lead for the MySQL docs team have also written about this.

Cool things you can almost do with replication

Posted in High Availability MySQL by Mark Callaghan (noreply@blogger.com) at March 28, 2010 08:11 AM

We added support for row-change logging to MySQL 5.0. The logged data is similar to row-based replication with changes to the output that make it much easier to parse. Gene Pang describes this work at 2pm at the conference.

What might be done with this data?
  • replicate row changes to a data store that is not MySQL (Teradata, HBase/Hypertable, memcached)
  • materialized view maintenance
  • change notification
And I talk at the Percona Performance Conference at 10:50am today on the InnoDB IO architecture.

Better days Arrive when Dev Meet Ops

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at March 06, 2010 01:03 PM

A couple of weeks a go Brian Profitt pinged me for a chat about Devops , the result of that chat , his article can now be found on the Zenoss blog, it's titled Datacenter Barometer: Better days arrive when dev meets ops

It's a very nice read with some pointers to places regular readers of my blog should already know ;)
So with lots of leading Open Source infrastructure companies on different levels, such as config management (OpsCode and Reductive Labs) , monitoring (Zenoss) , deployment (openQRM, RPath, and obviously Consultancy companies , the upcoming Devops conferences around the planet promise to be a lot of fun ! ;)

Oh, and apparently there is some more on the story on /.

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/991

Disabling DHCP on a LibVirt setup

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at December 04, 2009 06:43 PM

So you have this libvirt setup and you want to have a dhcp server on the virtual machines you are playing with , or you want to have all static IP's.

Libvirt uses dnsmasq to provide dhcp services etc and when you generate a config from the gui it will look like

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. <dhcp>
  7. <range start='192.168.100.128' end='192.168.100.254' />
  8. </dhcp>
  9. </ip>
  10. </network>

If you fully remove the dhcp section, then restart libvirt you'll notice dnsmasq running with no dhcpd on that subnet so you'll have full control again :)

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. </ip>
  7. </network>

Yet Another DNS Issue

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at November 12, 2009 09:01 PM

While browsing trough my enormous mailinglist backlog I ran into the following message from Gianluca Cecchi on the DRBD-user mailing list

guess I`ll have to give Lars a T-Shirt when we next meet ;)

  1. From: Gianluca Cecchi
  2. To: drbd-user@lists.linbit.com
  3. Subject: [DRBD-user] notes on 8.3.2
  4.  
  5.  
  6. - drbdadm create-md r0 segfaults when the command "hostname" on the
  7. server contains the fully qualified domain name but you have put only
  8. the hostname part in drbd.conf
  9. Instead, the command "drbdadm dump" correctly gives you a warning in
  10. this case (suggesting how to correct the error you made....):
  11.  
  12. suppose complete hostname is virtfed.domainname.com and you put
  13. virtfed alone in drbd.conf
  14. [root@virtfed ~]# drbdadm dump
  15. WARN: no normal resources defined for this host (virtfed.domainname.com)!?
  16.  
  17. while
  18. [root@virtfed ~]# drbdadm create-md r0
  19. Segmentation fault

Guess I`ll have to give the Linbit crowd a T-Shirt when we next meet ;)

Nines , Damn Nines and More Nines

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 19, 2009 09:46 PM

Funny how different experiences lead to different evaluations of tools. The MySQL HA solutions the MySQL Performanceblog list, are almost listed in the complete opposited order of what my impressions are.

Ok agreed, I should probably not put my MySQL NDB experiences from 2-3 years ago with multiple Query of deaths and more problems than you into account anymore , but back then went in the list Less stable than a single node. I've had NDB POC setups going down for much more than 05:16 minutes
Ndb comes with a lot of restrictions, there are

As for MySQL on DRBD, I've said this before , I love DRBD, but having to wait for a long InnoDB recovery after a failover just kills your uptime ,
I remember being called by a customer during Fred last holiday who was waiting over 20 minutes for recovery , twice, so putting the DRBD/San setup second would not be my preference. But agreed .. it's only listed at 99.9% meaning almost 9 hours of downtime per year are allowed.

On the other hand we've seen database uptime of MySQL MultiMaster setups with Heartbeat reaching better figures than 99.99% Heck I've seen single nodes achieve better than 99.99% :)

So what does this teach us ... there is no golden rule for HA, lots of situations are different, it's the preferences of the customer, the size of the database, the kind of application , and much
more .. you always need to think and evaluate the environment ...

Heartbeat 2 OpenAIS

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 16, 2009 06:53 PM

While upgrading a pretty recent Heartbeat cluster to OpenAis earlier today I ran into the following weird situation

  1. Last updated: Fri Oct 16 08:50:03 2009
  2. Stack: openais
  3. Current DC: CO_NMS-1 - partition with quorum
  4. Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
  5. 4 Nodes configured, 2 expected votes
  6. 1 Resources configured.
  7. ============
  8.  
  9. Online: [ CO_NMS-1 CO_NMS-2 ]
  10. OFFLINE: [ co_nms-1 co_nms-2 ]

or

  1. crm(live)node# show
  2. co_nms-1(5c48ab4f-767f-e2dc-20ec-5969cddad152): normal
  3. co_nms-2(922ff786-eca9-bed0-d79d-8222727a2c5b): normal
  4. CO_NMS-1: normal
  5. CO_NMS-2: normal

Whohoo.. OpenAIS must have realized I have upperase and lowercase cores :)

Funny to see .. but quickly solved..

Monitoring MySQL

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 11, 2009 06:18 PM

Ronald Bradford wants to know what kind of Monitoring you use..
He specifically wants to know about Alerting tools

There's different cases , looking at it from a full infrastructure point my current favourite is Zabbix or good old Nagios,

But when looking at it from a debugging perspective you have MySQLAR or Hyperic, but those aren't in the alerting list.

However, when you are building HA clusters, you have custom scripts running either from mon or from pacemaker ..

Still .. Ronald probably wants more input :)

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/945

Why learn to type ?

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 09, 2009 10:06 AM

When your machine knows what you mean ..

  1. [s3p-root@XMS-1 tomcat6]# crm configure
  2. crm(live)configure# bye
  3. [s3p-root@XMS-1 tomcat6]# crm confiure
  4. crm(live)configure# bye
  5. [s3p-root@XMS-1 tomcat6]# crm confiture
  6. crm(live)configure# bye
  7. [s3p-root@XMS-1 tomcat6]#

I'd better

  1. apt-get install coffee

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/944

GitHub: Speed matters

Posted in Anchor Web Hosting Blog » drbd by bsmith at September 29, 2009 06:39 AM

Impressions from the first article (in its first day) and the first 24 hours of the GitHub migration, have caused us at Anchor to believe that;

  1. GitHub is just as popular as we thought,
  2. The migration was worth it, as things are running much faster (just check your twitter feeds, or better yet, check your GitHub source tree for no reason ;) ); and,
  3. People are interested in what has gone under the hood of the new GitHub (insert your favorite fast car here; otherwise lets say a roadster).

Taking these three things into account, this installment will discuss why things are so much faster post migration compared to prior.

I said ‘faster’ and not ‘fast’, because GitHub is now as fast as any website should be. So in comparison, yes, GitHub is fast now, however it is akin to riding your bicycle with half inflated tires: when fully inflated, suddenly your old bike is blazing fast. Now this is not to be critical of the former architecture which held its merits when GitHub was founded. GitHub had simply moved to a stage where a infrastructure architecture refresh was logical.

The main thing, in the large, that made this new architecture fast was that we were given a blank slate and large amounts of freedom to make an architecture that would do the job well.  This is an incredibly rare thing, and it no doubt took a lot of courage on Github’s part.  For that, we have to say “thankyou” to the Github team for letting us have that freedom.  I like to think that we’ve repaid that trust with a pretty awesome architecture that will serve them well for some time to come.

SCALE: When looking at the new architecture as a whole, the increased scale is immediately evident. GitHub now consumes far more hardware than ever before:

Old Infrastructure:

  • 10 VMs
  • 39 VCPUs
  • 54GB RAM

New Infrastructure:

  • 16 physical machines
  • 128 physical cores
  • 288GB RAM

Or for those who enjoy visual cues:

Resource comparison old to new infrastructure

It is a credit to the old infrastructure and GitHub’s code that it ran so well on so little (in comparison). The first credit for increased performance is increased scale.

An important note regarding the hardware is that there is nothing special (or industry secretive) regarding it. The solution in its entirety is run from commodity hardware. No special black boxes doing scary things with packets and routes. No appliance servers. The solution architecture developed by Anchor can be used with any hardware vendor (insert: Dell, HP, IBM, SuperMicro, etc). Vendor neutrality provides GitHub with no encumbrance with either scaling up or out, a key issue when considering growth and future flexibility.

Note: The architectures flexibility allows for the user repository storage to be expanded with a mix of vendor hardware (should GitHub ever change hardware vendor). Furthermore, any component can be exchanged for another vendor’s hardware with no change to GitHubs architecture or software.

In a nutshell, the increased scale provides:

  • More GitHub front-end servers to service your requests;
  • More storage; and
  • More I/O bandwidth when working with your repository data

HARDWARE PERFORMANCE: The speed specifications of the underlying components is important, in addition to how that hardware is utilised.

Storage I/O: A common factor in poor performance with any solution is an I/O bottleneck at the storage level.  This pain was GitHub’s. To alleviate this, not only is the storage now distributed across several servers (distributing the I/O), but it is now running on direct-attached 15,000 RPM SAS disks on battery-backed hardware RAID. Therefore, the second credit for increased performance is faster storage.

Direct access to hardware: Virtualisation is great. What isn’t great is when virtualisation is used as a universal solution. At Anchor we believe there is a place for virtualisation, and systems with massive I/O or CPU requirements is not that place. By moving resource heavy systems onto dedicated hardware, any contention for resources between individual VMs is removed. The third credit goes to less overhead.

ARCHITECTURE: Throwing hardware at a scaling problem is an easy solution, but without the right division of resources and the right software to properly use it, it’s not going to run real fast.

For GitHub, this was their innovative Git command proxying systems, which do an excellent job of taking requests from the frontends (where users connect with their web browser, git client, or SSH client) and shipping them to the fileservers.  The database structure, filesystem layout, and code efficiency also contribute to this.

Given that the software isn’t our speciality, there’s not a lot for us to say about this, but Github are planning a series of posts on their blog, and I’m quite sure it’ll be enlightening.

TO REVIEW: The factors involved in GitHub’s faster response on the new infrastructure include (but are not limited to):

  • Increased Infrastructure (Scale)
  • Faster Hardware ( Storage)
  • No resource contention (More resources per server)
  • Solid, scalable architecture (Awesomeness)

Keep an eye on this space, as we delve into technology specific posts regards what kinds of 11 herbs and spices Anchor used to realise the new GitHub architecture.


When HA won’t play the way you want it to

Posted in Anchor Web Hosting Blog » drbd by oliver at September 08, 2009 03:26 AM

In an ideal world every service would support High Availability and Load Balancing, would scale up easily and cleanly and all of us systems administrators would be paid bucketloads to play golf all day while the computers did all the hard work. To quote Dylan Moran of Black Books fame, “Don’t make me laugh…bitterly”.

I’ll cut to the chase – sometimes you have to really shoehorn technologies to do what you want. Fortunately I love doing this, and the technologies of today’s article are virtualised Windows 2008 on Xen, and Oracle XE 10g. Neither likes to play ball, for a few reasons:

  • Generally speaking, when you virtualise an OS you want to have para-virtualisation drivers enhancing the hardware support. Open Source Xen has PV drivers, but they are not signed with a legitimate certificate. Windows 2008 does not play nicely with unsigned or test-cert-signed drivers.
  • Oracle is just a messy, messy, nasty thing. Yes, paid versions undoubtedly support all manner of loadbalancing and HA options, but the free one does not.

Adding HA to Windows 2008 on Xen

The basic procedure was as follows:

  • Install the telnet server within Windows (making sure to lock it down in the firewall to only be accessible by the host machines)
  • Create a special admin account and password used for triggering a shutdown
  • Create an Expect script which logs into the VM via telnet, and issues the shutdown command
  • Create a modified version of the Heartbeat Xen resource agent which calls the expect script to shut down the VM (and wait a safe period of time) before “xm shutdown” is called. Without this, “xm shutdown” will simply power off the VM (in absence of working PV drivers).

The VM was already running on a DRBD volume between the two HA Xen servers, so I was able to just create a standard set of Heartbeat resources to control DRBD primary/secondary mode and the startup/shutdown of the HA WIndows VM. For your benefit (if you want to recreate it) here is the expect script:

#!/usr/bin/expect -f
#
# Script which "automates" shutting down a Windows VM

# Don't log telnet output and commands to stdout, and set a reasonable timeout.
log_user 0
set timeout 3

# Log in via telnet and issue commands. Fairly straightforward.
spawn -noecho /usr/bin/telnet 192.168.1.1
sleep 0.5

# login as the "shutdown" user
expect {
 -re "login: $" {send "shutdown\r"}
 timeout exit
}
sleep 0.5
expect {
 -re "password: $" {send "mysecretpassword\r"}
 timeout exit
}
sleep 0.5
expect {
 -re ">$" {send "shutdown /s /t 0\r"}
 timeout exit
}
sleep 0.1
expect {
 -re ">$" {send "exit\r"}
 timeout exit
}
exit

The rest is fairly self-explanatory if you understand Heartbeat.

Oracle XE 10g

This was more of a learning process, since usually you just install Oracle and leave it the hell alone. Not so for me.

  • Install Oracle on both nodes using (fortunately) the RPMs they provide
  • Configure Oracle on both nodes including creating the databases, using the same password for SYSDBA
  • Shutdown both instances of Oracle
  • Create the DRBD resource, and mount it on the primary node
  • On the primary node, move the contents of /usr/lib/oracle/xe/oradata and /usr/lib/oracle/xe/app/oracle/flash_recovery_area onto the mounted DRBD
  • On the secondary node, delete the aforementioned paths
  • Bind mount the oradata and flash recovery area from the mounted DRBD volume into the correct places in the directory tree.
  • Start Oracle

After I had created a Heartbeat resource group which contained the DRBD resource, the DRBD filesystem mount, the aforementioned bind mounts and the Oracle service itself I was quite pleased to see that Oracle plays quite nicely with our shoehorned HA setup. You’ll want to make sure you have a properly fixed Oracle init script though, as the supplied one is fairly bad.

After making Oracle and Windows 2008 work nicely in HA, I’m almost certain any service no matter how bad can be shoehorned in a similar way to give you decent availability even when it was n’t originally intended.