You will face the "CAP" theorem problem. You cannot have consistency, availability and partition-tolerance at the same time.
DRBD / MySQL HA relies on synchronous replication at the block device level. This is fine while both nodes are available, or if one suffers a temporary fault, is rebooted etc, then comes back. The problems start when you get a network partition.
Network partitions are extremely likely when you're running at two datacentres. Essentially, neither party can distinguish a partition from the other node failing. The secondary node doesn't know whether it should take over (the primary has failed) or not (the link is gone).
While your machines are in the same location, you can add a secondary channel of communication (typically a serial cable, or crossover ethernet) to get around this problem - so the secondary knows when the primary is GENUINELY down, and it's not a network partition.
The next problem is performance. While DRBD can give decent** performance when your machines have a low-latency connection (e.g. gigabit ethernet - but some people use dedicated high speed networks), the more latency the network has, the longer it takes to commit a transaction***. This is because it needs to wait for the secondary server (when it's online) to acknowledge all the writes before saying "OK" to the app to ensure durability of writes.
If you do this in different datacentres, you typically have several more milliseconds latency, even if they are close by.
** Still much slower than a decent local IO controller
*** You cannot use MyISAM for a high availability DRBD system because it doesn't recover properly/ automatically from an unclean shutdown, which is required during a failover.
For what it is worth, I feel your pain. It seems that heartbeat considers the loss of the passive node the same as a takeover of the passive node, so it starts its services. When the start scripts failed, and there was no other node to which to failover, heartbeat stayed primary and shutdown all the services. The only way to get back up again is to re-start heartbeat when this happens.
We dealt with this problem by making a single script that starts all of the cluster services (IP, FS mount, ipvsadm, Apache, etc) only if they are not already running. We make sure the "all in one" init script only returns non-zero for actual startup failures (and not for warnings like "already running") to avoid problems like this.
Best Answer
i assume you're interested in simple active-passive setup.
ucarp & heartbeat in such setup do pretty much the same thing. in essence - they run provided scripts when machine is elected to be master / hot-standby.
heartbeat might look much more complicated [ since it can help you autoamte drdb mounts, restarting multiple services etc ] but at the end - you can script all of this and let ucarp invoke it].
personally - i run heartbeat with single resource - that is script that does following:
my very simplistic setup [ heartbeat 2.1.3-6 under debian lenny ]: i have two servers:
'floating ip' - assigned to the active node is 10.0.1.1/24 assigned to eth1
in this case service that gets high availability is apache. i separately sync apache's configs and content that is served from ser0 to ser0b.
files below are identical on both machines with one marked exception:
/etc/ha.d/authkeys:
/etc/ha.d/haresources
/etc/ha.d/ha.cf
/etc/init.d/ha.cf [ it can as well be in /etc/ha.d/resources.d/ha.cf ]