Quick links:

The Cluster Guy: Pacemaker removed from OBS

Florian's blog: Heartbeat 3.0.2 released

Florian's blog: Disaster Recovery survey — we need your feedback!

The Cluster Guy: Pacemaker 1.0.7 Released

The Cluster Guy: Ubuntu looking for Pacemaker testers

The Cluster Guy: Pre-Announce: End of Pacemaker 0.6 support is near

Florian's blog: And just when you thought LINBIT was all about DRBD…

Florian's blog: We’re in!

Everything is a Freaking DNS problem - ha: Disabling DHCP on a LibVirt setup

Gregory Steulet blog: How to recover data from a "dead" hard drive or from a human error

Everything is a Freaking DNS problem - ha: Got Interviewed

Florian's blog: Publican packages for Ubuntu karmic

Florian's blog: LINBIT announces stewardship for Heartbeat code base

The Cluster Guy: New Documentation Formats

MySQL Performance Blog » High Availability: Finding your MySQL High-Availability solution – Replication

Everything is a Freaking DNS problem - ha: Yet Another DNS Issue

The Cluster Guy: Pacemaker 1.0.6 Released

Advogato blog for lmb: 29 Oct 2009

MySQL Performance Blog » High Availability: State of the art: Galera – synchronous replication for InnoDB

Everything is a Freaking DNS problem - ha: Nines , Damn Nines and More Nines

MySQL Performance Blog » High Availability: Finding your MySQL High-Availability solution – The questions

Everything is a Freaking DNS problem - ha: Heartbeat 2 OpenAIS

Everything is a Freaking DNS problem - ha: Monitoring MySQL

MySQL Performance Blog » High Availability: Finding your MySQL High-Availability solution – The definitions

Everything is a Freaking DNS problem - ha: Why learn to type ?

The Cluster Guy: Advisory: Don't use Pacemaker on Corosync (yet)

Anchor Web Hosting Blog » drbd: GitHub: Speed matters

The Cluster Guy: Clusters From Scratch

The Cluster Guy: Version Control Prompt

Florian's blog: LINBIT mount their bikes to support Butterfly Children

Pacemaker removed from OBS

Posted in The Cluster Guy at February 04, 2010 10:42 AM

Today I removed Pacemaker from server:ha-clustering on the openSUSE build service.

I lost patience with the service some time ago and the project has been providing pre-built packages from cluster labs ever since (see our install page for more details).

It seems no-one else has had the time or patience to keep the build service updated since my departure so, after noticing their age and the fact that they no longer even build on the majority of targets, I made the decision to remove them.

Hopefully this will help avoid confusing those wanting the latest Pacemaker software.

Heartbeat 3.0.2 released

Posted in Florian's blog by Florian Haas at February 01, 2010 06:49 PM


After a long release hiatus, an elaborate project re-organization, and with a new primary project sponsor, the Heartbeat cluster messaging layer saw its 3.0.2 release earlier today.

Heartbeat 3.0.2 is the first official Heartbeat release since 2.1.3, released over 2 years ago. There have been a number of intermediate releases in the interim, including some Fedora releases labeled 3.0.0 and 3.0.1, but this is the first official Heartbeat 3.0 release. This means that Pacemaker, the definitive Linux cluster stack, continues to be fully supported on both the Heartbeat and Corosync/OpenAIS messaging layers.

The release tarball may be downloaded directly from the Mercurial repository, or from the Linux-HA web site. Debian packages will soon be available from Martin Loschwitz’ people.debian.org repository and are expected to make their way into Squeeze shortly.

And yes, we had planned to make this release in January. But 2/1/2010 just looked so neat as a release date.

Disaster Recovery survey — we need your feedback!

Posted in Florian's blog by Florian Haas at January 21, 2010 05:24 PM


Want to help us out making DRBD an even better fit for off-site Disaster Recovery? Then please participate in our survey. It’s just 6 questions and will take up barely 3 minutes of your time. Thanks!

Pacemaker 1.0.7 Released

Posted in The Cluster Guy at January 18, 2010 11:38 AM

The latest installment of the Pacemaker 1.0 stable series is now ready for general consumption.

In this release, we’ve made a number improvements to clone handling - particularly the way ordering constraints are processed - as well as some really nice improvements to the shell.

The next 1.0 release is anticipated to be in mid-March. We will be switching to a bi-monthly release schedule to begin focusing on development for the next stable series (more details soon). If you have feature requests, now is the time to voice them and/or provide patches :-)

Pre-built packages for Pacemaker and it’s immediate dependancies are currently building and will be available for openSUSE, SLES, Fedora, RHEL, CentOS from the ClusterLabs Build Area shortly.

Debian users should check for updates Martin’s repo over the coming days and Ubuntu fans can visit LaunchPad for 8.04 and 9.10 packages.

The source tarball is also available directly from Mercurial.

General installation instructions are available at from the ClusterLabs wiki.

Release Statistics

Changesets 193 
Diff 220 files changed, 15933 insertions(+), 8782 deletions(-)

Changes of note since Pacemaker-1.0.6

  • High: PE: Bug 2213 - Ensure groups process location constraints so that clone-node-max works for cloned groups
  • High: PE: Bug lf#2153 - non-clones should not restart when clones stop/start on other nodes
  • High: PE: Bug lf#2209 - Clone ordering should be able to prevent startup of dependant clones
  • High: PE: Bug lf#2216 - Correctly identify the state of anonymous clones when deciding when to probe
  • High: PE: Bug lf#2225 - Operations that require fencing should wait for ‘stonith_complete’ not ‘all_stopped’.
  • High: PE: Bug lf#2225 - Prevent clone peers from stopping while another is instance is (potentially) being fenced
  • High: PE: Correctly anti-colocate with a group
  • High: PE: Correctly unpack ordering constraints for resource sets to avoid graph loops
  • High: Tools: crm: load help from crm_cli.txt
  • High: Tools: crm: resource sets (bnc#550923)
  • High: Tools: crm: support for comments (LF 2221)
  • High: Tools: crm: support for description attribute in resources/operations (bnc#548690)
  • High: Tools: hb2openais: add EVMS2 CSM processing (and other changes) (bnc#548093)
  • High: Tools: hb2openais: do not allow empty rules, clones, or groups (LF 2215)
  • High: Tools: hb2openais: refuse to convert pure EVMS volumes
  • High: cib: Ensure the loop for login message terminates
  • High: cib: Finally fix reliability of receiving large messages over remote plaintext connections
  • High: cib: Fix remote notifications
  • High: cib: For remote connections, default to CRM_DAEMON_USER since thats the only one that the cib can validate the password for using PAM
  • High: cib: Remote plaintext - Retry sending parts of the message that did not fit the first time
  • High: crmd: Ensure batch-limit is correctly enforced
  • High: crmd: Ensure we have the latest status after a transition abort
  • High (bnc#547579,547582): Tools: crm: status section editing support
  • High: shell: Add allow-migrate as allowed meta-attribute (bnc#539968)
  • Medium: Build: Do not automatically add -L/lib, it could cause 64-bit arches to break
  • Medium: PE: Bug lf#2206 - rsc_order constraints always use score at the top level
  • Medium: PE: Only complain about target-role=master for non m/s resources
  • Medium: PE: Prevent non-multistate resources from being promoted through target-role
  • Medium: PE: Provide a default action for resource-set ordering
  • Medium: PE: Silently fix requires=fencing for stonith resources so that it can be set in op_defaults
  • Medium: Tools: Bug lf#2286 - Allow the shell to accept template parameters on the command line
  • Medium: Tools: Bug lf#2307 - Provide a way to determin the nodeid of past cluster members
  • Medium: Tools: crm: add update method to template apply (LF 2289)
  • Medium: Tools: crm: direct RA interface for ocf class resource agents (LF 2270)
  • Medium: Tools: crm: direct RA interface for stonith class resource agents (LF 2270)
  • Medium: Tools: crm: do not add score which does not exist
  • Medium: Tools: crm: do not consider warnings as errors (LF 2274)
  • Medium: Tools: crm: do not remove sets which contain id-ref attribute (LF 2304)
  • Medium: Tools: crm: drop empty attributes elements
  • Medium: Tools: crm: exclude locations when testing for pathological constraints (LF 2300)
  • Medium: Tools: crm: fix exit code on single shot commands
  • Medium: Tools: crm: fix node delete (LF 2305)
  • Medium: Tools: crm: implement -F (—force) option
  • Medium: Tools: crm: rename status to cibstatus (LF 2236)
  • Medium: Tools: crm: revisit configure commit
  • Medium: Tools: crm: stay in crm if user specified level only (LF 2286)
  • Medium: Tools: crm: verify changes on exit from the configure level
  • Medium: ais: Some clients such as gfs_controld want a cluster name, allow one to be specified in corosync.conf
  • Medium: cib: Clean up logic for receiving remote messages
  • Medium: cib: Create valid notification control messages
  • Medium: cib: Indicate where the remote connection came from
  • Medium: cib: Send password prompt to stderr so that stdout can be redirected
  • Medium: cts: Fix rsh handling when stdout is not required
  • Medium: doc: Fill in the section on removing a node from an AIS-based cluster
  • Medium: doc: Update the docs to reflect the 0.6/1.0 rolling upgrade problem
  • Medium: doc: Use Publican for docbook based documentation
  • Medium: fencing: stonithd: add metadata for stonithd instance attributes (and support in the shell)
  • Medium: fencing: stonithd: ignore case when comparing host names (LF 2292)
  • Medium: tools: Make crm_mon functional with remote connections
  • Medium: xml: Add stopped as a supported role for operations
  • Medium: xml: Bug bnc#552713 - Treat node unames as text fields not IDs
  • Medium: xml: Bug lf#2215 - Create an always-true expression for empty rules when upgrading from 0.6

Ubuntu looking for Pacemaker testers

Posted in The Cluster Guy at January 15, 2010 05:22 PM

Ubuntu is looking to switch its supported cluster stack to Corosync+Pacemaker and has put out a “Call for testers”.
Check out the link if this is something you’re interested in.

Pre-Announce: End of Pacemaker 0.6 support is near

Posted in The Cluster Guy at January 12, 2010 12:18 PM

Unless there are violent objections, I plan to officially stop supporting 0.6 at the end of February.

Since I’ve not seeing any bugs reported for some time, it seems that anyone still using 0.6 is happy with it for their workload.

Also, 1.0 has been out for over a year now and contains significant improvements over 0.6 including

  • A unified shell that hides the XML scaffolding
  • Migration thresholds that are easy to configure and understand
  • Failures can be ignored after a specified period of time
  • Ability to specify defaults for resource an operation parameters
  • Man pages for all CLI tools
  • Up-to-date online documentation

The online documentation has more details on whats new/different in Appendix C and detailed instructions for upgrading in Appendix E.

We’re in!

Posted in Florian's blog by Florian Haas at December 08, 2009 05:02 PM


DRBD has entered a new phase. After being developed out of tree for 9 years, and after an extended review and streamlining phase since March, Phil submitted DRBD to be merged into 2.6.32 release of the Linux mainline kernel. The submission was accepted by block layer maintainer Jens Axboe, who merged DRBD in September, then deferred to the 2.6.33 merge window, and this morning Linus pulled DRBD into his tree.

That makes DRBD an integral part of Linux, starting with the 2.6.33 release expected in a few weeks’ time.

We have something to celebrate.

Disabling DHCP on a LibVirt setup

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at December 04, 2009 06:43 PM

So you have this libvirt setup and you want to have a dhcp server on the virtual machines you are playing with , or you want to have all static IP's.

Libvirt uses dnsmasq to provide dhcp services etc and when you generate a config from the gui it will look like

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. <dhcp>
  7. <range start='192.168.100.128' end='192.168.100.254' />
  8. </dhcp>
  9. </ip>
  10. </network>

If you fully remove the dhcp section, then restart libvirt you'll notice dnsmasq running with no dhcpd on that subnet so you'll have full control again :)

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. </ip>
  7. </network>

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/967

How to recover data from a "dead" hard drive or from a human error

Posted in Gregory Steulet blog by Gregory Steulet at November 27, 2009 08:55 PM

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

Who never encountered problems with a failing media, never accidently dropped a valuable file, corrupted his FAT(File Allocation Table) or simply got bad disk block sectors ? Of course restoring his backup in such cases would be the easiest way to recover precious data.. Unfortunately it is often in such a situation we saw that backup is a month old or our files have never been backuped. The following lines relate such stories and introduce some ways to recover data with open source softwares.


A nice sunny day

 

It was a nice sunny Saturday and I was invited at my girlfriend’s mother home for tea. During lunch she spoke about weird behavior with her PC under windows XP, error messages, impossible files to drop from the desktop and a lot of other weird stuff. I was quite proud to offer some help in order to solve those problems.

 

So I tried to switch on the computer by pressing the power button but the PC shutdown after 5 seconds! Furthermore, the power button remained pushed in each time I pressed on it due to a broken plastic inside the power switch. After 5 minutes I finally succeeded starting the PC and a wonderful Windows XP logo preceding  error messages about missing “.dll” files appeared. Finally after removing 2 out of 3 antivirus/spywares and some unused programs, the computer behavior became smoother, although still very slow so I decided to run a checkdisk on C:\ drive. As you know when one does such an operation on windows, one has to  reboot it, which I did. The checkdisk started, step 1 out of 5, 2 out of 5, when the following message appeared :  “unable to locate the file name attribute of index entry etc”, “not enough space to rebuild index”, etc..

 

When the chkdsk ended, the computer restarted, the windows xp logo appeared and the computer automatically switched off. Nothing concerning the power switch button this time, it was the disk. No more visible data on it. This nice sunny day became really cloudy when my girlfriend told me that her mother stored a lot of important documents concerning her work and many pictures. Of course she obviously hadn’t any backup as well as the majority of people storing precious data.

 

I decided to take the hard drive home to plug it in as ”slave” on my own PC. The disk was present but windows indicated 0 disk space available and 0 disk space used. Disappointed, I decided to contact a company specialized in data recovery to ask for a quote and received it a few days later with the amount of 1684 Euro for a complete data recovery without guarantee that the integrity of the data would be recovered. It indicated : “The reasons for the defective data are major structure damages.”

 

It was simply too expensive to recover maybe only unusable blocks on a drive.

 

 cause of data loss

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);}

Let’s save those data with open source softwares

 

I decided to look for softwares on the internet in order to resolve this problem. I was amazed to discover so many recovery tools e.g.: active partition recovery, ontrack easy recovery, active uneraser, lost & found, prosoft media tools, active undelete and many others.

 

Attempting to repair the filesystem directly on the defective disk would generate unnecessary disk activity and carries the risk of further damage to the drive. Therefore the first step in such a situation is to copy the hard drive onto another one.

 

I firstly thought about dd for this task and noticed a lot of better tools had been developed to copy data from one disk to another, like ddrescue, dd_rescue and dd_rhelp (wrapper script for dd_rescue). My choice went on ddrescue because it combines advantages from dd_rescue and dd_rhelp. Unlike dd, dd_rescue will not stop when it encounters errors. It is especially useful when you work with failing media. In addition it allows copying blocks backward, so if we have an error in the middle of a block, ddrescue will copy both data before and after the error inside this block.

 

Ubuntu distrib has been used to proceed to this recovery.Notice that it’s not needed to have any Linux installation, knoppix allows to boot on a live CD and all needed software to copy and recover data are available on it.

 

In the following case the partition /dev/sdc2 need to be duplicated. In order to proceed to the partition copy it’s mandatory to create a new partition of the same size as the defective one on a backup disk.  Fdisk will perfectly fullfill this mission.

 

The partition size can be calculated by substracting the end column(last cylinders) to the start column(first cylinder).  In the case below 10240-833 = 9507 cylinders.


 

steulet@steulet-desktop:~$ sudo fdisk  /dev/sdc

 

Command (m for help): p

 

Disk /dev/sdc: 80.0 GB, 80060424192 bytes

240 heads, 63 sectors/track, 10341 cylinders

Units = cylinders of 15120 * 512 = 7741440 bytes

Disk identifier: 0x1549f232

 

   Device Boot      Start         End      Blocks   Id  System

/dev/sdc1               1         832     6289888+   c  W95 FAT32 (LBA)

/dev/sdc2   *         833       10340    71880480    7  HPFS/NTFS

 

Creation of a new partition - /dev/sdd1 - onto the new media - /dev/sdd - with fdisk is a straightforward process.

 

steulet@steulet-desktop:~$ sudo fdisk /dev/sdd

 

Command (m for help): n

Command action

   e   extended

   p   primary partition (1-4)

p

Partition number (1-4): 1

First cylinder (1-38913, default 1):

Using default value 1

Last cylinder, +cylinders or +size{K,M,G} (1-38913, default 38913): 9508

 

Then ddrescue can duplicate blocks on the second media. As shown bellow the average rate for duplicate is about 1Mb/s on a small configuration, meaning that about 28 hours are needed to copy a 100Gb partition and 12 days for 1Tb.

 

steulet@steulet-desktop:~$ sudo ddrescue /dev/sdc2 /dev/sdd1

 

 

Press Ctrl-C to interrupt

rescued:    73605 MB,  errsize:       0 B,  current rate:     699 kB/s

   ipos:    73605 MB,   errors:       0,    average rate:    1064 kB/s

   opos:    73605 MB

 

Once the failing media partition is fully copied, data recovery can start. One more time a variety of useful tools can be found on internet. The one used in this example is TestDisk.  TestDisk is a free open source software originaly created to recover lost partition or making non bootable disk bootable again.

 

TestDisk has a large set of functionalities such as : undeleting files from FAT,NTFS and ext2 filesystem, recovering/rebuilding NTFS boot sector, fixing FAT tables, copying files from deleted FAT, NTFS, ext2/ext3 partitions and many others. In addition TestDisk can be run on many operating systems such as Windows, Linux, macOS, SunOS, BSD.

 


 

Having a short look on TestDisk will present some interesting functionalities

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

  TestDisk is free software, and

comes with ABSOLUTELY NO WARRANTY.

 

Select a media (use Arrow keys, then press Enter):

Disk /dev/sda - 160 GB / 149 GiB - ATA SAMSUNG HD160JJ

Disk /dev/sdb - 160 GB / 149 GiB - ATA SAMSUNG HD160JJ

Disk /dev/sdc - 80 GB / 74 GiB - ATA SAMSUNG SP0802N

Disk /dev/sdd - 320 GB / 298 GiB - Hitachi HTS543232L9A300

 

[Proceed ]  [  Quit  ]

 

Note: Disk capacity must be correctly detected for a successful recovery.

If a disk listed above has incorrect size, check HD jumper settings, BIOS

detection, and install the latest OS patches and disk drivers.

 

After having selected the disk to analyze, TestDisk will ask you for the partition table type. In the current case “Intel” is used. Notice that it is also possible to use Mac, Sun or even Xbox partition.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

 

Disk /dev/sdd1 - 85 GB / 79 GiB - Hitachi HTS543232L9A300

 

Please select the partition table type, press Enter when done.

[Intel  ]  Intel/PC partition

[EFI GPT]  EFI GPT partition map (Mac i386, some x86_64...)

[Mac    ]  Apple partition map

[None   ]  Non partitioned media

[Sun    ]  Sun Solaris partition

[XBox   ]  XBox partition

[Return ]  Return to disk selection

 

 

Note: Do NOT select 'None' for media with only a single partition. It's very

rare for a drive to be 'Non-partitioned'.

 


 

After analyzing disk partitions, TestDisk shows the available partitions and also offers functionalities such as rebuilding boot sector.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Boot sector

Status: OK

 

Backup boot sector

Status: Bad

 

Sectors are not identical.

 

A valid NTFS Boot sector must be present in order to access

any data; even if the partition is not bootable.

 

 

[  Quit  ]  [  List  ]  [Org. BS ]  [Rebuild BS]  [  Dump  ]

                      Copy boot sector over backup sector

 

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

filesystem size           166128102 166112100

sectors_per_cluster       8 8

mft_lcn                   10 10

mftmirr_lcn               1048576 1048576

clusters_per_mft_record   -10 -10

clusters_per_index_record 1 1

Extrapolated boot sector and current boot sector are different.

 

 

[  Dump  ]  [  List  ]  [ Write  ]  [  Quit  ]

 

                           List directories and files

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Write new NTFS boot sector, confirm ? (Y/N)

 


 

As expected TestDisk recovered the backup boot sector

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition                  Start        End    Size in sectors

 1 * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Boot sector

Status: OK

 

Backup boot sector

Status: OK

 

Sectors are identical.

 

A valid NTFS Boot sector must be present in order to access

any data; even if the partition is not bootable.

 

[  Quit  ]  [  List  ]  [Rebuild BS]  [Repair MFT]  [  Dump  ]

                            Return to Advanced menu

 

Sometimes the MFT (Master File Table) can be also corrupted. Microsoft Check Disk (chkdsk) can failed trying to repair the MFT. TestDisk offers the possibility to repair this MFT through the advanced menu after having selected the NTFS partition has shown above „Repair MFT“.

 

TestDisk provides plenty of other functionalities, adding a partition, changing is type or listing files inside a partition and copy those files to another location, etc…

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

 

Disk /dev/sdd - 320 GB / 298 GiB - CHS 38913 255 63

     Partition               Start        End    Size in sectors

* HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

 

Structure: Ok.  Use Up/Down Arrow keys to select partition.

Use Left/Right Arrow keys to CHANGE partition characteristics:

*=Primary bootable  P=Primary  L=Logical  E=Extended  D=Deleted

Keys A: add partition, L: load backup, T: change type, P: list files,

     Enter: to continue

NTFS, 85 GB / 79 GiB

 


 

Example below demonstrates the possibility to copy files from defected media to another place.

 

TestDisk 6.10, Data Recovery Utility, July 2008

Christophe GRENIER <grenier@cgsecurity.org>

http://www.cgsecurity.org

   * HPFS - NTFS              0   1  1 10340 254 63  166128102 [HP_PAVILION]

Directory /

 

dr-xr-xr-x     0     0         0  6-Sep-2009 17:52 .

dr-xr-xr-x     0     0         0  6-Sep-2009 17:52 ..

-r--r--r--     0     0        50  1-Jan-2005 19:49 AUTOEXEC.BAT

-r--r--r--     0     0       218 30-Oct-2005 18:04 BOOT.BAK

-r--r--r--     0     0       298 30-Oct-2005 18:04 boot.ini

dr-xr-xr-x     0     0         0  2-Jan-2005 04:00 Config.Msi

-r--r--r--     0     0         0 23-Nov-2004 15:21 CONFIG.SYS

dr-xr-xr-x     0     0         0 13-May-2006 12:18 C_DILLA

-r--r--r--     0     0        62  6-Sep-2009 17:16 delfichier.bat

dr-xr-xr-x     0     0         0  2-Jan-2005 05:01 Documents and Settings

-r--r--r--     0     0   528011264  6-Sep-2009 17:24 hiberfil.sys

dr-xr-xr-x     0     0         0  2-Jan-2005 05:05 hp

-r--r--r--     0     0         0 23-Nov-2004 15:21 IO.SYS

-r--r--r--     0     0         0 23-Nov-2004 15:21 MSDOS.SYS

-r--r--r--     0     0     47564  5-Aug-2004 13:00 NTDETECT.COM

-r--r--r--     0     0    252240  5-Aug-2004 13:00 ntldr

-r--r--r--     0     0   792723456 23-Dec-2007 05:35 pagefile.sys

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 Program Files

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 Python22

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 RECYCLER

dr-xr-xr-x     0     0         0  2-Jan-2005 05:21 sysprep

dr-xr-xr-x     0     0         0 28-Oct-2005 17:29 System Volume Information

dr-xr-xr-x     0     0         0  2-Jan-2005 04:12 system.sav

dr-xr-xr-x     0     0         0 17-Mar-2006 17:36 temp

dr-xr-xr-x     0     0         0  2-Jan-2005 05:37 WINDOWS

 

 

 

Use Right arrow to change directory, c to copy,

    q to quit

 

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

But what about ext3 filesystem ?

Among the TestDisk limitations there is the impossibility to recover deleted files which stand on an ext3 partition.

Having a look on forum about ext3 file recovery will often discourage you. If you are not convice that file recovery on ext3 is not possible having a look on ext3 FAQ (http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html ) will definitively remove any hope.

1.1      Q: How can I recover (undelete) deleted files from my ext3 partition?

Actually, you can't! This is what one of the developers, Andreas Dilger, said about it:

In order to ensure that ext3 can safely resume an unlink after a crash, it actually zeros out the block pointers in the inode, whereas
ext2 just marks these blocks as unused in the block bitmaps and marks the inode as "deleted" and leaves the block pointers alone.

Your only hope is to "grep" for parts of your files that have been deleted and hope for the best.

Hopefully it looks that this statement is too categorical. When a file is removed, data are not really overwritten. On ext3 filesystem the pointer that reference a file is simply removed meaning that the disk area can be overwritten if writes operation occur. Therefore the first thing to do after such a mistake is avoiding any additional write operation. The best way to achieve that is simply unmouting the filesystem.

Once the filesystem unmounted one can take time looking for recovery tools. I found two tools able to recover deleted files.  

  • ext3grep  is a simple tool developed by Carlo Wood and intended to aid anyone who accidentally deletes a file on an ext3 filesystem.

Both of them are intended to be run on disk images, meaning that it is mandatory to create a disk image of the partition where the removed files stand.

 

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

Some doubts ?

 

A concrete exemple needing file restoration on ext3 partition could be removing all *.log files on a partition hosting redo log files from an  Oracle Database. Redo logs that would be named with *.log extension (which is by the way strongly not recommended especially for this reason) would be removed after such an operation. As you maybe know an Oracle Database cannot work without at least two redo log groups therefore such a delete would lead to a database crash.

 

The storage setup in the following example is composed by two groups of raid 5 with three disks each configured by mdadm. On the top a raid 0 (stripping) has been configured with LVM2. This configuration – RAID 50 - is illustrated in the figure below.

 

raid 50

 

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

The Oracle database version is 10.2.0.4 and the filesystem_io_option is set to “setall“.

After removing redo logs of the SOUK database, the following can be seen in the Oracle alert log.

 

LGWR: Failed to archive log 2 thread 1 sequence 72 (16198)
Sun Nov  8 17:17:32 2009
Thread 1 advanced to log sequence 72 (LGWR switch)
  Current log# 2 seq# 72 mem# 0: /u05/oradata/SOUK/redog02a_SOUK.log
  Current log# 2 seq# 72 mem# 1: /u05/oradata/SOUK/redog02b_SOUK.log
Sun Nov  8 17:18:47 2009
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/u05/oradata/SOUK/redog01b_SOUK.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
ORA-00312: online log 1 thread 1: '/u05/oradata/SOUK/redog01a_SOUK.log'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory


In order to restore redo log files, the first step is to avoid any additional write on the filesystem. The most writes operations occur, the less chance we have to recover those files. That why it is needed to stop any processes that write on this specific filesystem.

oracle@slo02test:~/ [SOUK] sqh

SQL*Plus: Release 10.2.0.4.0 - Production on Fri Oct 23 00:08:36 2009
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> shutdown abort

ORACLE instance shut down.


The filesystem needs to be unmounted to avoid any write access.

[root@slo02test ~]# umount /u05


Then an image copy of the filesystem can be done with “dd” command as shown bellow. “dd” perfectly fits our need in such a case.

[root@slo02test ~]# dd if=/dev/mapper/vgdata-lvdata of=/u99/copyU05
6291456+0 records in
6291456+0 records out
3221225472 bytes (3.2 GB) copied, 241.446 seconds, 13.3 MB/s


Now we simply have to execute ext3grep with the filesystem image in parameter. Several options are provided allowing restoring from a specific date or a specific file or simply everything.

[root@slo02test u99]# ext3grep /u99/copyU05 --restore-all
Running ext3grep version 0.10.0
Number of groups: 24
Minimum / maximum journal block: 713 / 17115
Loading journal descriptors... sorting... done
The oldest inode block that is still in the journal, appears to be from 1257696360 = Sun Nov  8 17:06:00 2009
Number of descriptors in journal: 108; min / max sequence numbers: 3 / 34
Writing output to directory RESTORED_FILES/
Finding all blocks that might be directories.
D: block containing directory start, d: block containing more directory entries.
Each plus represents a directory start that references the same inode as a directory start that we found previously.

Searching group 0: DDD+D+++
Searching group 1:
Searching group 2:


Searching group 22:
Searching group 23:

Writing analysis so far to 'copyU05.ext3grep.stage1'. Delete that file if you want to do this stage again.

Result of stage one:
  4 inodes are referenced by one or more directory blocks, 4 of those inodes are still allocated.
  3 inodes are referenced by more than one directory block, 3 of those inodes are still allocated.
  0 blocks contain an extended directory.

Result of stage two:
  4 of those inodes could be resolved because they are still allocated.

All directory inodes are accounted for!

Writing analysis so far to 'copyU05.ext3grep.stage2'. Delete that file if you want to do this stage again.
Restoring oradata/SOUK/redog01a_SOUK.log
Restoring oradata/SOUK/redog01b_SOUK.log
Restoring oradata/SOUK/redog02a_SOUK.log
Restoring oradata/SOUK/redog02b_SOUK.log
Restoring oradata/SOUK/redog03a_SOUK.log
Restoring oradata/SOUK/redog03b_SOUK.log


Restored files are copied in the “RESTORED_FILES” directory of the current path.  Now we only have to copy them in the correct directory.

[root@slo02test ~]# ls -ls RESTORED_FILES/ -R

RESTORED_FILES/:
total 8
4 drwx------ 2 root root 4096 Oct 22 23:06 lost+found
4 drwxr-xr-x 3 root root 4096 Oct 22 23:09 oradata

RESTORED_FILES/lost+found:
total 0

RESTORED_FILES/oradata:
total 4
4 drwxr-xr-x 2 root root 4096 Oct 22 23:38 SOUK

RESTORED_FILES/oradata/SOUK:
total 82080
10260 -rw-r----- 1 root root 10486272 Oct 22 23:17 redog01a_SOUK.log

10260 -rw-r----- 1 root root 10486272 Nov 22 23:17 redog01b_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog02a_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog02b_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog03a_SOUK.log
10260 -rw-r----- 1 root root 10486272 Oct 22 23:18 redog03b_SOUK.log

[root@slo02test ~]# mount /u05

[root@slo02test ~]# cp RESTORED_FILES/oradata/SOUK/redog0?a_SOUK.log /u05/oradata/SOUK/
[root@slo02test ~]# chown oracle.oinstall /u05/oradata/SOUK/ -R

Once redo logs restored and copied into the original path, database start can be done. However keep in consideration that committed transactions could be lost depending on your database and filesystem configuration!

oracle@slo02test:~/ [SOUK] sqh
SQL*Plus: Release 10.2.0.4.0 - Production on Thu Oct 22 23:43:41 2009

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to an idle instance.

SQL> startup

ORACLE instance started.

Total System Global Area 1073741824 bytes
Fixed Size                  1271588 bytes
Variable Size             264243420 bytes
Database Buffers          805306368 bytes
Redo Buffers                2920448 bytes
Database mounted.

Database opened.

 

 As we can see, dropping redo log files or any other file do not necessarily lead to definitely loosing transactions. Ext3grep can be used as an additional way to recover your database.

 

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

Conclusion

 

Several situations can lead to media recovery, site disaster, block corruption, FAT corrupted, etc… In case of hardware errors the most secure way to solve it is often calling media recovery companies. Dealing with physical damages to a hard drive with such tools can destroy any last hope of a successful recovery. However if no physical damages are confirmed these tools can recover precious data.

 

The first thing to do when media error occurs is stopping immediately any activity on this media. It is only after stopping any activity on the media that we can take time to think about the way to proceed. As shown in this article the second step is generally copying the failing media on a trusted support. Working directly on the failing media could lead to definitively loosing valuable data. That’s why it is strongly recommended duplicating data before any other operation. Several tools provide duplication functionalities such as dd_rescue.

 

Once the failing media duplicated recovery can be done using the backup media. One more time a variety of tools can be used depending on the filesystem and data to recover.

Finally once data recovered it may worth doing a backup. Although those tools can get out of hopeless situations and add a way to recover files to usual methods, testing backup and restore processes at regular intervals is maybe the best way to avoid spending time in such recovery processes.

 

Gregory Steulet
Oracle Certified Professional 10G
MySQL Cluster 5.1 Certified
Avaloq Certified Professional

 

Trivadis SA
Rue Marterey 5
CH-1005 Lausanne
Tel: +41-21-321 47 00
Fax: +41-21-321 47 01
Internet:  www.trivadis.com
Mail: info@trivadis.com


 

Literature and Links…

http://www.cgsecurity.org/wiki/TestDisk

http://www.gnu.org/software/ddrescue/ddrescue.html

http://www.knoppix.org/

http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html

http://foremost.sourceforge.net/

 

 

http://www.hiren.info/

http://www.krollontrack.com/

http://www.giis.co.in/

 

       

Got Interviewed

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at November 18, 2009 08:05 PM

by @botchagalupe
on Virtualization, Open Source tools and DNS Problems

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/960

Publican packages for Ubuntu karmic

Posted in Florian's blog by Florian Haas at November 18, 2009 01:57 PM


Andrew, whom I helped convince to switch to DocBook for writing Pacemaker documentation, pointed me to Publican, a DocBook publishing framework developed by tech writers at Red Hat.

Publican packages have been available in Debian squeeze for a while, but as of today are not included in Ubuntu. Thus, I built packages for karmic, which you may fetch by adding the following line to your /etc/apt/sources.list file:

deb http://people.linbit.com/~florian/ubuntu/publican karmic main

Then fetch my GnuPG key so you can verify the Release file’s integrity:

apt-key adv --keyserver pgp.mit.edu --recv-keys 7349B897BC759CF1

Now you can install with

aptitude update
aptitude install publican

… and then hack away.

Ubuntu folks (Andres, that means you!), can we have this in the upstream Ubuntu distros please?

LINBIT announces stewardship for Heartbeat code base

Posted in Florian's blog by Florian Haas at November 16, 2009 07:06 PM


Here is an announcement we made earlier today on the linux-ha and linux-ha-dev mailing lists.

This is to announce that LINBIT, with the kind permission from the Linux-HA project board, will act as the “steward” of the Heartbeat cluster messaging layer code base, from this point forward. This is a summary of our motivation and plans related to that role.

What does this entail?

  • LINBIT will assume responsibility for bug fixes for the Heartbeat code base, currently hosted at http://hg.linux-ha.org/dev/.
  • LINBIT will bundle up the 3.0 beta codebase, make a 3.0 final release (currently this is planned for the month of January 2010), and subsequently make bugfix releases as deemed necessary.
  • LINBIT will further collaborate with the Pacemaker project to keep the existing dual-stack capability in Pacemaker.
  • LINBIT will continue making the public Mercurial repository available at the present location (any eventual relocation, if desired by the Board, would be publicly announced with ample advance notice).
  • LINBIT will administer the public mailing lists (linux-ha and linux-ha-dev) on the servers currently hosting them (again, any eventual relocation, would be publicly announced with ample advance notice).
  • LINBIT intends to offer improved documentation for the Heartbeat messaging layer. This is meant to consolidate the content currently found on the linux-ha.org wiki site.
  • LINBIT intends to offer support services for the Heartbeat/Pacemaker cluster stack (i.e. the Pacemaker cluster resource manager running on top of the Heartbeat cluster communication layer).
  • LINBIT will continue to respect he Board as the final authority on matters affecting the project as a whole.

What does this not entail?

  • LINBIT has no intention to add significant features to the Heartbeat code base, or extend its functionality significantly.
  • LINBIT has no intention to apply changes to the licensing, development model, or collaboration model for the Linux-HA code base.
  • LINBIT has no intention to establish the Heartbeat code base as a long-term alternative or competition to the OpenAIS/Corosync cluster messaging layer. However, we do believe that it is a valid alternative for the short to mid term, and for some configurations where OpenAIS/Corosync is currently suffering from some growing pains.
  • LINBIT has no intention to support or advocate continued use of Heartbeat in v1 (haresources) configurations. We will continue to recommend to switch to the Pacemaker cluster stack, now that two (technically and commercially) supported cluster messaging layers are available.

At this time, the primary contact in charge of Heartbeat development matters at LINBIT is Lars Ellenberg, the person in charge of documentation is myself. The best means of relaying comments and asking questions continues to be the public mailing list.

We hope that this is a useful service to the Heartbeat user community. I want to reiterate that we have no intention whatsoever to change the current, proven, community centric approach to how the Heartbeat code base is managed. We continue to welcome, and depend on, community suggestions, feedback, and collaboration. Heartbeat is a community project and will remain so.

If you have any questions about our intentions and plans, please post them on the mailing list, or peruse the comment fields below.

New Documentation Formats

Posted in The Cluster Guy at November 16, 2009 03:56 PM

I’m pleased to report that the core Pacemaker documentation is now available in PDF, HTML (chunked and single page) and even TXT formats.

The old Pages.app sources have been replaced with DocBook which allows them to be:

  • published in a variety of formats
  • kept under version control
  • included in the packages
  • updated by anybody

Additionally, we’re using Publican to produce the final result so supporting multiple languages should be now possible. Let us know if you’re interested in doing some translation :-)

The primary location for Pacemaker documentation will remain http://www.clusterlabs.org/wiki/Documentation however there is also a index of the generated documentation at http://www.clusterlabs.org/doc/ which includes the date and version from which it was generated.

Finding your MySQL High-Availability solution – Replication

Posted in MySQL Performance Blog » High Availability by yves at November 13, 2009 08:22 PM

In the last 2 blog posts about High Availability for MySQL we have introduced definitions and provided a list of ( questions that you need to ask yourself before choosing a HA solution. In this new post, we will cover what is the most popular HA solution for MySQL, replication.

High Availability solution for MySQL: Replication

This HA solution is the easiest to implement and to manage. You basically need to setup MySQL replication between a master and one or more slaves. Upon failure of the master, one of the slaves is manually promoted to the master role and replication on the other slaves is re-adjusted to point to the new master. This solution works well with all the MySQL storage engines including MyISAM (NDB is a special discussed later) but it suffers from the limitation of MySQL replication. The main limitation, in term of HA, is the asynchronous design of MySQL replication which does not allow the master to be sure the slave has been updated before returning after a commit statement. There is a window in time where it is possible that a fully committed transaction has not been pushed to the slave(s) leading to data loss. Many large websites that are fine with some data loss rely on replication for HA and for read scaling.

In addition to hardware failure, the level of availability of this solution is affected by the availability of the MySQL replication link between the servers. Replication often break for various reasons and while replication is broken, there is no High-Availability. Also, the availability of this solution is affected by how much the slaves were behind the master when the outage occurred. So, if you want to have a good level of availability, you need a good monitoring and alerting system to quickly react to replication issue and you need a rather small write load so that the slaves do not lag behind the master too much. To maximize the level of availability, recovery should be automatic.

Apart of its simplicity, an HA solution based on replication as many interesting properties, no wonder it is so popular. First, if the application is well designed and has specific database handles for read and write operations, this HA solution can scales the read operations to a high level. Using the slaves for reads cause a second interesting side effect, the caches of the slaves are hot so failing over to a slave means no degraded performance associated with caches warm up. Finally, it is well known that with MySQL, altering a table means recreating the whole table and it is a blocking operations. Altering a large table may takes many hours. The trick here is to run the alter table on a slave and then, once done, we let the slave catch up with the master using the new table schema, we failover to the slave and repeat the alter table on the other server. Those online schema change are easier when a master to master topology is used.

The following figure summarize the simplest HA architecture using MySQL replication. All writes are going to the master while reads are spread between the master and the slave. Upon failure of the master, replication is stopped on the slave and all traffic is redirected to the slave which now handles reads and writes.

HA replication

Pros Cons
Simple Variable level of availability (98-99.9+%)
Inexpensive Not suitable for high write loads
All the servers can be used, no idle standby read scaling only if application splits reads from writes
Supports MyISAM Can lose data
Caches on failover slave are not cold
Online schema changes
Low impact backups

Automatic failover with replication

I already mentioned that for best HA levels, failover or recovery should be automatic. There are tools to manage automatic failover with replication like MMM, Flipper and Tungsten. Here, I will quickly describe the most popular one, MMM.

With MMM, you need to add a separate server, the Manager that, like the name imply, manages the availability of the MySQL service. A high availability solution based on MMM requires at the 2 MySQL servers configured in a Master to Master topology. Additional slaves can also be added. A MMM agent runs on all the MySQL servers and it is used to do OS level operations. The principle of operation of MMM is based on VIPs. There is one write VIP, where write operations are sent and as many read VIPs as the number of MySQL servers. For the write VIP, MMM monitors the state of the current master and, upon failure, try to kill all the connections to the failing server and transfer the write VIP to the other master. For the read VIPs, MMM monitors the state of the slaves and remove the read VIP of a slave if it has failed or is lagging behind the master by more than a defined threshold. One of the main limitation of MMM is its lack of fencing capability. It is important to stop all the connections to the failing master and if that server is not responding, maybe because of a network problem, a stonith device must be used to fence it. I am far from being an expert with MMM, other guys on my team are way better than me, but I heard that the MMM v1 code base had some deficiencies. MMM v2 is a complete rewrite that addresses some of the shortcomings of v1. Walter Heck from OpenQuery gave an excellent webinar on it recently.

The architecture of a highly available setup using MMM and Master-Master replication is presented on the figure below. Apart from the minimum requirement of two MySQL servers replicating each other, there is a third server, called the manager, that controls both MySQL server through an agent that is running on each server. The manager controls and monitors the state of the replication and assign virtual IPs for specific roles. There are one VIP where write operations are sent and two or more VIPs where read operations are sent. If replication on one of the MySQL servers lags behind too much, its read VIP will be moved to another server.

master-master

As a conclusion, replication can be used in many cases to build effective and scalable highly available solutions but it has some limitations. In my next blog post, I’ll present another HA solution build around Heartbeat and DRBD.


Entry posted by yves | 5 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Yet Another DNS Issue

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at November 12, 2009 09:01 PM

While browsing trough my enormous mailinglist backlog I ran into the following message from Gianluca Cecchi on the DRBD-user mailing list

guess I`ll have to give Lars a T-Shirt when we next meet ;)

  1. From: Gianluca Cecchi
  2. To: drbd-user@lists.linbit.com
  3. Subject: [DRBD-user] notes on 8.3.2
  4.  
  5.  
  6. - drbdadm create-md r0 segfaults when the command "hostname" on the
  7. server contains the fully qualified domain name but you have put only
  8. the hostname part in drbd.conf
  9. Instead, the command "drbdadm dump" correctly gives you a warning in
  10. this case (suggesting how to correct the error you made....):
  11.  
  12. suppose complete hostname is virtfed.domainname.com and you put
  13. virtfed alone in drbd.conf
  14. [root@virtfed ~]# drbdadm dump
  15. WARN: no normal resources defined for this host (virtfed.domainname.com)!?
  16.  
  17. while
  18. [root@virtfed ~]# drbdadm create-md r0
  19. Segmentation fault

Guess I`ll have to give the Linbit crowd a T-Shirt when we next meet ;)

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/958

Pacemaker 1.0.6 Released

Posted in The Cluster Guy at November 02, 2009 10:19 AM

The next installment of the Pacemaker 1.0 stable series is now ready for general consumption.

In addition to further polishing of the crm shell and CLI tools, this is the first release to support CoroSync (version 1.1.2 or greater is required).

The ”Pacemaker Explained” reference has also been converted to docbook and is included as part of the tarball (and pre-built packages if the relevant stylesheets are present at build time).

Pre-built packages for Pacemaker and it’s immediate dependancies will be available for openSUSE, SLES, Fedora, RHEL, CentOS from the OpenSUSE Build Service in the next couple of days depending in how overloaded it is.

Debian users should check for updates Martin’s repo over the coming days and Ubuntu fans can visit LaunchPad for 8.04 and 9.10 packages.

The source tarball is also available directly from Mercurial.

General installation instructions are available at from the ClusterLabs wiki.

Release Statistics

Changesets 185 
Diff 331 files changed, 13858 insertions(+), 3277 deletions(-)

Project Administrivia

We may switch to a bi-monthly release cycle. If you have any thoughts on this (for or against), please get in touch.

Changes of note since Pacemaker-1.0.5

  • High: cib: Correctly clean up when both plaintext and tls remote ports are requested
  • High: ais: Avoid excessive load by checking for dead children every 1s (instead of 100ms)
  • High: ais: Bug lf#2199 - Prevent expected-quorum-votes from being populated with garbage
  • High: ais: Bug rh#525589 - Prevent shutdown deadlocks when running on CoroSync
  • High: ais: Gracefully handle changes to the AIS nodeid
  • High: ais: Prevent deadlock - dont try to release IPC message if the connection failed
  • High: ais: Ubuntu needs a leading zero for directory modes
  • High: cib: For validation errors, send back the full CIB so the client can display the errors
  • High: cib: Prevent use-after-free for remote plaintext connections
  • High: cib: Repair the ability to connect to the cluster from non-cluster machines
  • High: Core: Bug lf#2169 - Allow dtd/schema validation to be disabled
  • High: crmd: Bug bnc#527530 - Wait for the transition to complete before leaving S_TRANSITION_ENGINE
  • High: crmd: Bug lf#2201 - Guard against possible cause of a segfault
  • High: crmd: Prevent use-after-free with LOG_DEBUG_3
  • High: Extras: Add sctp support to the controld RA
  • High: PE: Bug bnc#515172 - Provide better defaults for lt(e) and gt(e) comparisions
  • High: PE: Bug lf#2106 - Not all anonymous clone children are restarted after configuration change
  • High: PE: Bug lf#2170 - stop-all-resources option had no effect
  • High: PE: Bug lf#2171 - Prevent groups from starting if they depend on a complex resource which can’t
  • High: PE: Bug lf#2197 - Allow master instances placemaker to be influenced by colocation constraints
  • High: PE: Disable resource management if stonith-enabled=true and no stonith resources are defined
  • High: PE: Don’t include master score if it would prevent allocation
  • High: PE: Make sure promote/demote pseudo actions are created correctly
  • High: PE: Prevent target-role from promoting more than master-max instances
  • High: shell: Add allow-migrate as allowed meta-attribute (bnc#539968)
  • High: tools: bnc#547579,547582 - crm: status section editing support
  • High: Tools: crm: add semantic checks depending on the meta-data from resource agents
  • High: Tools: crm: improve processing of group edit and constraints
  • High: Tools: crm: improve the edit command
  • High: Tools: pingd - Fix a number of critical bugs (patch via Kazunori INOUE)
  • Med: xml: Mask the “symmetrical” attribute on rsc_colocation constraints (bnc#540672)
  • Medium (bnc#520707): Tools: crm: new templates ocfs2 and clvm
  • Medium (LF 2164): Tools: hb_report: expand the crm status command
  • Medium (LF 2184): Tools: crm: extend ptest command
  • Medium (LF 2185): Tools: crm: add resource promote/demote commands
  • Medium (LF 2198): Tools: crm: add node fence command
  • Medium: ais: Attempt to enable core file generation if it was disabled
  • Medium: ais: Include version details in plugin name
  • Medium: Build: Re-enable asciidoc documentation
  • Medium: Build: Shell templates arent documentation
  • Medium: cib: Remove delay for remote plaintext connections
  • Medium: Core: Disable syslog for any process that doesn’t want its arguments logged
  • Medium: crmd: Requery the resource metadata after every start operation
  • Medium: cts: add —benchmark for scalability tests
  • Medium: cts: Prepare for corosync testing
  • Medium: Extra: Include SNMP MIB file for crm_mon (from Michael Schwartzkopff)
  • Medium: PE: Bug lf#2178 - Indicate unmanaged clones
  • Medium: PE: Bug lf#2180 - Include node information for all failed ops
  • Medium: PE: Bug lf#2189 - Incorrect error message when unpacking simple ordering constraint
  • Medium: PE: Correctly log resources that would like to start but can’t
  • Medium: PE: Correctly log the state of orphaned clone instances
  • Medium: PE: If no migrate_(from|to) action is defined, look for migrate instead
  • Medium: PE: Only re-instate target-role if it is less than the calculated one
  • Medium: PE: Provide details for the maintenance-mode option
  • Medium: PE: Stop ptest from logging to syslog
  • Medium: Tools: attrd_updater - Suppress all logging with —quiet
  • Medium: Tools: crm: add extra flag to CibObject for invalid objects
  • Medium: Tools: crm: do return cached resources dom node
  • Medium: Tools: crm: expand template documentation
  • Medium: Tools: crm: first child of a removed parent inherits constraints
  • Medium: Tools: crm_attribute - Suppress all logging with —quiet
  • Medium: Tools: crm_shadow - log diffs to stdout instead of stderr
  • Medium: Tools: Use -q as the short form for —quiet (for consistency)

29 Oct 2009

Posted in Advogato blog for lmb at October 29, 2009 11:02 AM

Again a tip on how to write your OpenAIS/Pacemaker configuration in a simpler fashion; this applies to SUSE Linux Enterprise 11 High-Availability Extension too, of course.

For the full cluster functionality with OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to configure DLM, O2CB, cLVM2 clones, one to start the LVM2 volume group, and Filesystem resources to mount the file system. Add in all the dependencies needed, and you end up with a configuration pretty much like this (shown in CRM shell syntax, which is already much more concise than the raw XML):

primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
clone c-ocfs2-2 ocfs2-2 \
        meta target-role="Started" interleave="true"
clone clvm-clone clvm \
        meta target-role="Started" interleave="true"
ordered="true"
clone dlm-clone dlm \
        meta interleave="true" ordered="true"
target-role="Stopped"
clone o2cb-clone o2cb \
        meta target-role="Started" interleave="true"
ordered="true"
clone vg1-clone vg1 \
        meta target-role="Started" interleave="true"
ordered="true"
colocation colo-clvm inf: clvm-clone dlm-clone
colocation colo-o2cb inf: o2cb-clone dlm-clone
colocation colo-ocfs2-2 inf: c-ocfs2-2 o2cb-clone
colocation colo-ocfs2-2-vg1 inf: c-ocfs2-2 vg1-clone
colocation colo-vg1 inf: vg1-clone clvm-clone
order order-clvm inf: dlm-clone clvm-clone
order order-o2cb inf: dlm-clone o2cb-clone
order order-ocfs2-2 inf: o2cb-clone c-ocfs2-2
order order-ocfs2-2-vg1 inf: vg1-clone c-ocfs2-2
order order-vg1 inf: clvm-clone vg1-clone
That's quite a bite, and becomes cumbersome for every fs you add.

However, there is a little known feature - you can actually clone a resource group:

primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
group base-group dlm o2cb clvm vg1 ocfs2-2
clone base-clone base-group \
	meta interleave="true"

I think this speaks for itself; 20 lines of configuration reduced. You will also find that crm_mon output is much simpler and shorter, allowing you to see more of the cluster status in one go.

State of the art: Galera – synchronous replication for InnoDB

Posted in MySQL Performance Blog » High Availability by Vadim at October 27, 2009 03:08 PM

First time I heard about Galera on Percona Performance Conference 2009, Seppo Jaakola was presenting “Galera: Multi-Master Synchronous MySQL Replication Clusters”. It was impressed as I personally always wanted it for InnoDB, but we had it in plans at the bottom of the list, as this is very hard to implement properly.
The idea by itself is not new, I remember synchronous replication was announced for SolidDB on MySQL UC 2007, but later the product was killed by IBM.

So long time after PPC 2009 there was available version mysql-galera-0.6, which had serious flow, to setup a new node you had to take down whole cluster. And all this time Codership ( company that develops Galera) was working on 0.7 release that introduces node propagation keeping cluster online. You can play with 0.7pre release by yourself MySQL/Galera Release 0.7pre.

In current version propagation is done by mysqldump from one of nodes (”donor”). In next release Codership is going to support LVM snapshot and xtrabackup which will make the setup of new node even easier. The current annoyance I see is that if you shutdown one node for short period of time for quick maintenance, after start, the node has to load whole mysqldump, like it is new empty node. I hope Codership guys will address this also.
Another thing I miss for now is support of InnoDB-plugin, which as we know performs much better than standard InnoDB ®.

So what is so interesting about Galera. Couple things:

- High Availability. Any of N standby nodes are available immediately when main node fails. Galera is serious pretender to be included to the list, Yves put recently, http://www.mysqlperformanceblog.com/2009/10/16/finding-your-mysql-high-availability-solution-%e2%80%93-the-questions/. I am not sure how many nines it will provide :) , but efforts on test setup and deployment should be comparable with MMM setup.

- Scale Writes. Galera allows to write to any of N nodes and automatically propagate to other nodes. It sounds too ideal, and there is drawback – with increasing amount of nodes you write to, your transaction rollback rate may increase, especially if you working on the same dataset. You can find some results on Codership’s page, and I am going to run my own benchmarks also. Also from benchmark you can see that communication overhead maybe significant for short writes.

- Scale Reads. It can be done with regular replication, but with synchronous your “slaves-nodes” are in the same state, there is no “slave behind”. When you read from any slave, you read actual data. Although it also has serious drawback – our cluster is fast as fast the “weakest” node in the chain. So if one node gets overloaded and performance degrades, the same happens with whole cluster.

- Heterogeneous-database replication. It is not here yet, and I do not know what’s in Codership roadmap, but group manager protocol in Galera is database independent, and it’s only matter of database drivers. For InnoDB currently it is set of patches, and I see it is quite possible to make the same for Postgres. So MySQL-Postgres cluster setup is not so far ahead :)

On “Company page” Codership says their goal is “to promote and exploit the latest developments in computer science to produce fast and scalable synchronous replication solution that “just works” for databases and similar applications”, which I think they have success in. Implementing fast, scalable and working group communication and transaction manager is the art.

As for now I would not put 0.7 release into production yet, but you may seriously consider to play with it in test environment, and report bugs to Codership team, they are very responsive.
I am waiting for next releases and looking to make integration with XtraDB.


Entry posted by Vadim | 11 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nines , Damn Nines and More Nines

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 19, 2009 09:46 PM

Funny how different experiences lead to different evaluations of tools. The MySQL HA solutions the MySQL Performanceblog list, are almost listed in the complete opposited order of what my impressions are.

Ok agreed, I should probably not put my MySQL NDB experiences from 2-3 years ago with multiple Query of deaths and more problems than you into account anymore , but back then went in the list Less stable than a single node. I've had NDB POC setups going down for much more than 05:16 minutes
Ndb comes with a lot of restrictions, there are

As for MySQL on DRBD, I've said this before , I love DRBD, but having to wait for a long InnoDB recovery after a failover just kills your uptime ,
I remember being called by a customer during Fred last holiday who was waiting over 20 minutes for recovery , twice, so putting the DRBD/San setup second would not be my preference. But agreed .. it's only listed at 99.9% meaning almost 9 hours of downtime per year are allowed.

On the other hand we've seen database uptime of MySQL MultiMaster setups with Heartbeat reaching better figures than 99.99% Heck I've seen single nodes achieve better than 99.99% :)

So what does this teach us ... there is no golden rule for HA, lots of situations are different, it's the preferences of the customer, the size of the database, the kind of application , and much
more .. you always need to think and evaluate the environment ...

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/951

Finding your MySQL High-Availability solution – The questions

Posted in MySQL Performance Blog » High Availability by yves at October 16, 2009 09:02 PM

After having reviewed the definition my the previous post (The definitions), the next step is to respond to some questions.

Do you need MySQL High-Availability?

That question is quite obvious but some times, it is skipped. It can also be formulated “What is the downtime cost of the service?”. In the cost, you need to include lost revenue from the service and you also need to consider less direct impact like loss of corporate image and other marketing costs. If your downtime cost is under $10/h, you can stop reading this document, you don’t need HA. For the others, let’s move on!

How to determine which MySQL High-Availability solution is best?

What is really tricky with MySQL is the number of possible HA solutions. From the simplest the most complex let’s list the most common ones:

- MySQL replication with manual failover
- Master-Master with MMM manager
- Heartbeat/SAN
- Heartbeat/DRBD
- NDB Cluster

These technologies are by far, not a one size fits all and many deployments use combination of solutions. I will not cover ScaleDB and Continuent because I know almost nothing of these solutions. There are many more questions you need to ask yourself before being able to pick the right one. Below, I listed the most common questions, I might have missed some.

1. What level of HA do you need?

Since all the technologies do not offer the same level of availability, this is a first important sorting factor. Here are estimates of the level of availability offered by the various solutions.

  Level of availability
Simple replication 98 to 99.9+%
Master-Master with MMM manager 99%
Heartbeat/SAN (depends on SAN) 99.5% to 99.9%
Heartbeat/DRBD 99.9%
NDB Cluster 99.999%

From the table, if your requirements are for 99.99%, you are restricted to NDB Cluster while if it is only 99% you have more options. I recall that the level of availability is hard to estimate and subject to debate. These are the usually accepted level of availability for these technologies.

2. Can you afford to lose data?

Obviously, if you are concerned about loss of data, you are most likely using the InnoDB storage engine, since MyISAM is not transactional and do not sync data to disk. Similarly, MySQL replication is an asynchronous process and although it is fairly fast at transferring data between the master and the slaves, there is a window of time where data loss is possible.

If you can afford to lose some data, you can consider “MySQL replication” and “Master-Master with MMM manager” otherwise, you can only consider the other three solutions.

  Data 100% safe
MySQL replication no
Master-Master with MMM manager no
Heartbeat/SAN (depends on SAN) yes
Heartbeat/DRBD yes
NDB Cluster yes

3. Does your application use MyISAM only features?

There are some features like Full text indexes and GIS indexes that are supported only by MyISAM. The HA solutions that work well with MyISAM are “MySQL replication” and “Master-Master with MMM manager”. Depending on the application, the MyISAM Full text indexes might be replaced by another search engine like Sphinx in order to remove the restriction. There is no HA solution other than the ones based on replication that handles GIS indexes.

  HA solutions
Need MyISAM Full text or GIS indexes “MySQL replication” and “Master-Master with MMM manager”
Don’t use any special MyISAM feature All
Can change MyISAM Full text to Sphinx All

4. What is the write load?

The HA solutions we present are not equal in term of their write capacity. Due to the way replication is implemented, only one thread on the slave can handle the write operations. If the replication master is multi-cores servers and is heavily writing using multiple threads, the slaves will likely not be able to keep up. Replication is not the only technology that put a strain on the write capacity, DRBD, a shared storage emulator for Linux, also reduce by about 30% (very dependent on hardware) the write capacity of a database server. In term of write capacity here are you choices.

  Write capacity
MySQL replication Fair
Master-Master with MMM manager Fair
Heartbeat/SAN (depends on SAN) Excellent
Heartbeat/DRBD Good
NDB Cluster Excellent

5. For what level of growth are you planning?

Since NDB Cluster is an integrated sharding environment, if you are planning for a growth that will need sharding (splitting the database over multiple servers), then you might need to take a serious at that solution. If not, then, apart from the write capacity, all the solutions are approximately equal.

6. How qualified is your staff or support company?

There is a quite direct relationship between the level of availability and the complexity of the solution. In order to reach the promised level of availability, the staff maintaining the HA setup, either internal or external, must have the required level of expertise. The required expertise level is summarized in the table below.

  Expertise level
MySQL replication Typical, average MySQL DBA + some Sysadmin skills
Master-Master with MMM manager Good, average MySQL DBA + good Sysadmin skills
Heartbeat/SAN (depends on SAN) High, Good MySQL DBA + strong Sysadmin skills
Heartbeat/DRBD High, Good MySQL DBA + strong Sysadmin skills
NDB Cluster Very high, Specific NDB knowledge, strom MySQL skills and strong Sysadmin skills

7. How deep are your pocket?

The last aspect that needs to be considered is the budget, complexity is expensive. We will consider two types of setup. The first one is a basic proof of concept of the technology with the hardware tested, the data imported and basic testing and documentation. A proof of concept setup is a good way to get used to a technology and experiment with it in a test environment. The other type of setup we will consider is a full production setup that includes extensive testing, fire drills, full documentation, monitoring, alerting, backups, migration to production and post migration monitoring. Of course, it is the safest way to migrate an HA solution to production. All the times here are estimates based on field experience, the values presented here are fairly typical and contains some buffers for unexpected problems. Although an HA solution can be built remotely through a KVM over IP and adequate remote power management, an on site intervention with physical access to the servers is the preferred way, especially for the most complex solutions.

  Proof of concept Migration to Production
MySQL replication 4 hours 12 hours
Master-Master with MMM manager 8 hours 24 hours
Heartbeat/SAN (depends on SAN) 32 hours 120 hours
Heartbeat/DRBD 40 hours 120 hours
NDB Cluster 40 hours 120 hours+

Editor’s Note: We’ve gotten many questions about the time estimates mentioned here. The above estimates shouldn’t be used to compare against any specific situation. Time will vary greatly depending on your project. For example, “setting up replication” can be as simple as CHANGE MASTER TO, and can take as little as a few minutes in some circumstances. Yves’s estimate is for a project to create a replication slave for HA purposes, not for “setting up replication.” There is a big difference between an HA project and a DBA task. – Baron Schwartz


Entry posted by yves | 32 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Heartbeat 2 OpenAIS

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 16, 2009 06:53 PM

While upgrading a pretty recent Heartbeat cluster to OpenAis earlier today I ran into the following weird situation

  1. Last updated: Fri Oct 16 08:50:03 2009
  2. Stack: openais
  3. Current DC: CO_NMS-1 - partition with quorum
  4. Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
  5. 4 Nodes configured, 2 expected votes
  6. 1 Resources configured.
  7. ============
  8.  
  9. Online: [ CO_NMS-1 CO_NMS-2 ]
  10. OFFLINE: [ co_nms-1 co_nms-2 ]

or

  1. crm(live)node# show
  2. co_nms-1(5c48ab4f-767f-e2dc-20ec-5969cddad152): normal
  3. co_nms-2(922ff786-eca9-bed0-d79d-8222727a2c5b): normal
  4. CO_NMS-1: normal
  5. CO_NMS-2: normal

Whohoo.. OpenAIS must have realized I have upperase and lowercase cores :)

Funny to see .. but quickly solved..

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/950

Monitoring MySQL

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 11, 2009 06:18 PM

Ronald Bradford wants to know what kind of Monitoring you use..
He specifically wants to know about Alerting tools

There's different cases , looking at it from a full infrastructure point my current favourite is Zabbix or good old Nagios,

But when looking at it from a debugging perspective you have MySQLAR or Hyperic, but those aren't in the alerting list.

However, when you are building HA clusters, you have custom scripts running either from mon or from pacemaker ..

Still .. Ronald probably wants more input :)

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/945

Finding your MySQL High-Availability solution – The definitions

Posted in MySQL Performance Blog » High Availability by yves at October 09, 2009 11:03 PM

As my first contribution to the MySQL Performance Blog, I joined Percona at the beginning September, I chose to cover the various high-availability (HA) options available for MySQL.  I have done dozen of MySQL HA related engagements while working for Sun/MySQL over the last couple of years using Heartbeat, DRBD and NDB cluster and I’ll probably be doing the same at Percona.  I have built my first DRBD based HA solution nearly 10 years ago.

There is quite a lot of confusion surrounding HA solutions for MySQL, I will try to present them objectively, my goal here been not to sell any specific technology but to help people choose the right one for their needs.  This post is first of a series,  I don’t yet know how many I will write in the series.

Before we start, it must be stated that high-availability is not only a matter of technical solutions, good management practices covering monitoring, alerting, security and documentation are also needed to insure a successful solution. In other words, no solution is fool proof, if a high-availability solution is running in recovery mode for months without nobody caring about it, the risk of a complete failure is much higher.

In order to all be on the same page, I will first give some definitions of the key terms.  I don’t pretend those definitions are perfect but let’s build on them.

High-Availability

Let’s first define what is meant by high-availability.  The most general definition would be that a high-availability setup is special  computer architecture designed to improve the availability of a computer service, like a MySQL database.  High-availability, HA for short, introduces a wealth of peculiar concepts, we will first review the main ones.

Uptime/Downtime

Uptime means the service is available even if degraded as long as it is above some defined performance threshold. Downtime means the opposite, either the service is completely down or unresponsive according to the defined performance threshold.  In many cases, people don’t define a performance threshold, it is basically the service monitoring frequency and timeout that fix it.

Level of Availability

The level of availability is basically the guaranteed percentage of uptime you will get over a year.  It has always been a subject of debate and it is something hard to evaluate since, most of the time, the samples are small and all the conditions of the deployments are not easily controlled. See the level of availability as the availability you, as the operator of the service, can promise in case of a worse case scenario. For example, 98% availability means a downtime of a little more than 7 days per year.  The cost is approximately an exponential function of the level of availability and has to be compared with the downtime cost. If an HA setup with a level of availability of 99% is fairly simple and affordable, moving to 99.9% and 99.99% can be much more expensive and complex. Also, you need to consider the environmental factors.  If your ISP cannot guarantee you a level of availability of more then 99.9% for the Internet access, it is useless to go beyond that no matter the importance of the application.

Single point of failure (SPOF)

Single points of failure are the things you are looking to remove when you build an HA solution.  Basically, they are the devices/things that if they are not available, the service is down.  A data center can be considered a SPOF at a high enough level of availability.  Usually the more SPOFs you consider, the higher the availability of your solution and the higher its cost.

Recovery/failover

Recovery (or failover) is the process by how a HA setup recovers from a failure. During the recovery time, the service is down.  With the most simple solutions, it can be a manual process but most of the time, it is automatic.  Also, there is a time associated with the recovery.  If a failure happened during the night and the operator is only available from 8am to 5pm then, you might have a recovery time of more than 12 hours.  The more complex solutions have automatic recovery and do not need human intervention.  Once again, although they are some exceptions, faster and automatic recovery usually means higher costs.

Cluster

A bunch of servers used for the same task.  In our case, dedicated to high availability of the MySQL database service.

With theses common definitions, we will then be able to move to the second step, the questions.


Entry posted by yves | 9 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Why learn to type ?

Posted in Everything is a Freaking DNS problem - ha by Kris Buytaert at October 09, 2009 10:06 AM

When your machine knows what you mean ..

  1. [s3p-root@XMS-1 tomcat6]# crm configure
  2. crm(live)configure# bye
  3. [s3p-root@XMS-1 tomcat6]# crm confiure
  4. crm(live)configure# bye
  5. [s3p-root@XMS-1 tomcat6]# crm confiture
  6. crm(live)configure# bye
  7. [s3p-root@XMS-1 tomcat6]#

I'd better

  1. apt-get install coffee

Technorati Tags:Technorati Tags:

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/944

Advisory: Don't use Pacemaker on Corosync (yet)

Posted in The Cluster Guy at October 06, 2009 02:20 PM

I spent some time looking into the state of the Pacemaker/Corosync integration today and I can only recommend Pacemaker users stay on the previous version of OpenAIS (aka. Whitetank).

In a nutshell, shutdown is utterly broken.

r2140 of Corosync removed the shutdown worker thread which allowed plugins such as Pacemaker to continue sending and receiving cluster messages.
Without it, Corosync waits for Pacemaker to finish and Pacemaker waits for the messages it tried to send to arrive and be acted upon. Needless to say no-one makes any progress.

Stay tuned, now that integration testing has started it shouldn’t take too long to get everything sorted out.

Update

Since writing this, the necessary testing has been done and Pacemaker is now supported on Corosync provided you have corosync >= 1.1.2 and pacemaker >= 1.0.6

GitHub: Speed matters

Posted in Anchor Web Hosting Blog » drbd by bsmith at September 29, 2009 06:39 AM

Impressions from the first article (in its first day) and the first 24 hours of the GitHub migration, have caused us at Anchor to believe that;

  1. GitHub is just as popular as we thought,
  2. The migration was worth it, as things are running much faster (just check your twitter feeds, or better yet, check your GitHub source tree for no reason ;) ); and,
  3. People are interested in what has gone under the hood of the new GitHub (insert your favorite fast car here; otherwise lets say a roadster).

Taking these three things into account, this installment will discuss why things are so much faster post migration compared to prior.

I said ‘faster’ and not ‘fast’, because GitHub is now as fast as any website should be. So in comparison, yes, GitHub is fast now, however it is akin to riding your bicycle with half inflated tires: when fully inflated, suddenly your old bike is blazing fast. Now this is not to be critical of the former architecture which held its merits when GitHub was founded. GitHub had simply moved to a stage where a infrastructure architecture refresh was logical.

The main thing, in the large, that made this new architecture fast was that we were given a blank slate and large amounts of freedom to make an architecture that would do the job well.  This is an incredibly rare thing, and it no doubt took a lot of courage on Github’s part.  For that, we have to say “thankyou” to the Github team for letting us have that freedom.  I like to think that we’ve repaid that trust with a pretty awesome architecture that will serve them well for some time to come.

SCALE: When looking at the new architecture as a whole, the increased scale is immediately evident. GitHub now consumes far more hardware than ever before:

Old Infrastructure:

  • 10 VMs
  • 39 VCPUs
  • 54GB RAM

New Infrastructure:

  • 16 physical machines
  • 128 physical cores
  • 288GB RAM

Or for those who enjoy visual cues:

Resource comparison old to new infrastructure

It is a credit to the old infrastructure and GitHub’s code that it ran so well on so little (in comparison). The first credit for increased performance is increased scale.

An important note regarding the hardware is that there is nothing special (or industry secretive) regarding it. The solution in its entirety is run from commodity hardware. No special black boxes doing scary things with packets and routes. No appliance servers. The solution architecture developed by Anchor can be used with any hardware vendor (insert: Dell, HP, IBM, SuperMicro, etc). Vendor neutrality provides GitHub with no encumbrance with either scaling up or out, a key issue when considering growth and future flexibility.

Note: The architectures flexibility allows for the user repository storage to be expanded with a mix of vendor hardware (should GitHub ever change hardware vendor). Furthermore, any component can be exchanged for another vendor’s hardware with no change to GitHubs architecture or software.

In a nutshell, the increased scale provides:

  • More GitHub front-end servers to service your requests;
  • More storage; and
  • More I/O bandwidth when working with your repository data

HARDWARE PERFORMANCE: The speed specifications of the underlying components is important, in addition to how that hardware is utilised.

Storage I/O: A common factor in poor performance with any solution is an I/O bottleneck at the storage level.  This pain was GitHub’s. To alleviate this, not only is the storage now distributed across several servers (distributing the I/O), but it is now running on direct-attached 15,000 RPM SAS disks on battery-backed hardware RAID. Therefore, the second credit for increased performance is faster storage.

Direct access to hardware: Virtualisation is great. What isn’t great is when virtualisation is used as a universal solution. At Anchor we believe there is a place for virtualisation, and systems with massive I/O or CPU requirements is not that place. By moving resource heavy systems onto dedicated hardware, any contention for resources between individual VMs is removed. The third credit goes to less overhead.

ARCHITECTURE: Throwing hardware at a scaling problem is an easy solution, but without the right division of resources and the right software to properly use it, it’s not going to run real fast.

For GitHub, this was their innovative Git command proxying systems, which do an excellent job of taking requests from the frontends (where users connect with their web browser, git client, or SSH client) and shipping them to the fileservers.  The database structure, filesystem layout, and code efficiency also contribute to this.

Given that the software isn’t our speciality, there’s not a lot for us to say about this, but Github are planning a series of posts on their blog, and I’m quite sure it’ll be enlightening.

TO REVIEW: The factors involved in GitHub’s faster response on the new infrastructure include (but are not limited to):

  • Increased Infrastructure (Scale)
  • Faster Hardware ( Storage)
  • No resource contention (More resources per server)
  • Solid, scalable architecture (Awesomeness)

Keep an eye on this space, as we delve into technology specific posts regards what kinds of 11 herbs and spices Anchor used to realise the new GitHub architecture.


Clusters From Scratch

Posted in The Cluster Guy at September 21, 2009 08:57 AM

The first of a new series of step-by-step guides for Pacemaker.

This installment covers installation, the creation of an active/passive cluster and its conversion to active/active.

Technologies used include:

  • Fedora 11 as the host operating system
  • OpenAIS to provide messaging and membership services,
  • Pacemaker to perform resource management,
  • DRBD as a cost-effective alternative to shared storage,
  • OCFS2 as the cluster filesystem (in active/active mode)
  • The crm shell for displaying the configuration and making changes
  • Apache as the example service.

The PDF is available from our Documentation page or directly via http://www.clusterlabs.org/mediawiki/images/9/9d/Clusters_from_Scratch_-_Apache_on_Fedora11.pdf

Future guides are anticipated to include MySQL, mail servers and asymmetrical clusters. Feedback and suggestions for additional topics are welcome.

Version Control Prompt

Posted in The Cluster Guy at September 21, 2009 07:41 AM

I find it convenient to include current SCM data before my regular Bash prompt (reduces the chance of “accidents”). Perhaps someone else will find it useful too.

function prompt-pre-exec() {
    scm=""
    repo_root=$(hg root 2>/dev/null)
    if [ -e CVS ]; then
        scm=":: cvs ::"

    elif [ -e .svn ]; then
        scm=":: svn : ${prompt_hl}r$(svn info | grep Revision | sed s/.*:\ //)${prompt_n} ($(svn info | grep Date | sed s/.*\(\//)"

    elif [ -e .gitignore ]; then
        repo_branch=`git branch --no-color 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/\1#/'`
        scm=`git show --pretty="format: : git : ${prompt_hl}${repo_branch}%h${prompt_n} : %an, %cr\n:: %s\n" | head -n 2`

    elif [ x != "x$repo_root" ]; then
        repo_cs=$(hg id -i)
        scm=`hg log --template " : hg : ${repo_root##*/} : ${prompt_hl}${repo_cs}${prompt_n} {tags} : {author|user}, {date|age} ago\n:: {desc|firstline|strip}\n" -r ${repo_cs%%+}`
    fi

    if [ "x$scm" != x ]; then
        # Trailing \n characters don't seem to expanded 
        scm="$scm
"
    fi
    export scm    
}

if [ x"$-" = "xhimBH" ]; then
  # Execute the following function before displaying the prompt
  export PROMPT_COMMAND='prompt-pre-exec'

  # Use \[ and \] to exclude the color code from the line wrapping calculations 
  export PS1='${scm}[\@] \u@\h \[${prompt_hl}\]\w #\[${prompt_n}\] '
fi

Then to add color, simply define prompt_hl and prompt_n. I use

export prompt_n="^[^E[00m^]"      # Default color
export prompt_hl="^[^E[01;32m^]"  # Highlight codes

To enter ^[ in emacs, type Ctrl-q then Ctrl-[. Likewise ^E is Ctrl-q Ctrl-e.

LINBIT mount their bikes to support Butterfly Children

Posted in Florian's blog by Florian Haas at September 17, 2009 09:20 AM


A completely non-technical post for a change.

Last weekend, seven employees of LINBIT’s European division participated in the World Games of Mountain Biking in Saalbach-Hinterglemm, Austria. We took part as Marathon race participants and co-sponsors of Biking for Butterfly Children, a charity dedicated to the fight against epidermolysis bullosa (EB).

Currently, no cure for any of the over 30 subtypes of EB exists — dermatologists and care givers can, however, greatly improve patients’
quality of life. Still, EB can be an excruciatingly painful, disfiguring, and debilitating disease that affects one in 20,000 live births and as such, makes the condition an orphan disease. The health care industry has little incentive to research the condition (as there is little money to be made off of it), and those who dedicate their careers to EB patient care and research rely on charitable donations for funding. Biking for Butterfly Children acts as a reliable fund raiser rounding up much-needed donations in the course of amateur cycling events (such as the World Games).

During this year’s World Games, B4BC raised a total of about 7,000 euros in donations — a respectable sum for an all-amateur event, but a lot more money is needed to improve the quality of life of EB patients, and potentially discover a cure to the disease. If you consider joining the fight against EB, please contact your local DebRA chapter.