Simple clustering for FreeBSD v0.1 ---------------------------------- This is a simple clustering system for FreeBSD. It does load balancing and has basic fault tolerance provisions. How it works: load balancing ---------------------------- The load balancing functionality is achieved by selecting hosts to service based on source IP address. This is done using some slight modifications to the FreeBSD kernel (ARP and ip code) to accomplish two things. First, the ARP code is modified in two ways. The first is that it is told to ignore IP address conflicts detected through ARP for flagged addresses. Second, ARP responses for flagged addresses are handled by looking them up in the system ARP table rather than using the ethernet address of the related card. Second, ip_input was modified to drop the multicast flag on incoming packets to flagged aliases. To support the flagging, in.c was modified to allow you to set flags on the aliases, and when the IA_SRCSELECT flag is set, it will also enable the reception of (all) multicast packets on that interface. These two modifications allow you to set up the same IP alias address (flagged for source selection) on multiple machines. Additionally, you can publish a multicast ethernet MAC address via each system's ARP table so an ARP lookup will yield a MAC address that will get the packet to each system. IP firewall rules are used to filter which packets are processed at each host. How it works: fault tolerance ----------------------------- I wrote a simple clustering daemon to create fault tolerance. The clustering system is completely distributed, provides automatic failover and rejoining capabilities, and makes a basic attempt at detecting and evicting cluster members that are dying off and then rejoining repeatedly. The clustering system is quite basic and is based on simple UDP communication between hosts. Repitition is used to try to avoid reliability problems and should be adequate on a local ethernet. The clustering system has virtually no security features built in. It needs firewall protection to prevent external folks from mucking with it. Ideally it would be rewritten to have its own security, but not by me :). What is required? ----------------- The patches are from FreeBSD 3.1-RELEASE, so you'll need a box to which the patches will apply. I'd guess that they will apply to any relatively recent version of FreeBSD without much trouble if you don't have a 3.1 box handy. After applying the patches, you'll need to rebuild, and include IPFW in your kernel (I'd recommend that you use the IPFW option that allows everything by default if you don't have other plans for it--just to make things a little easier). You need to copy the modified header files (net/if.h, netinet/in_var.h) to /usr/include/[net, netinet] so that the rebuild of ifconfig will work properly. You'll also need to patch ifconfig.c (which means you need full source) BEFORE you install and reboot your machine. If you don't do this, ifconfig won't work and your machine won't connect to the network. You need interface cards with working multicast code. This eliminates the de cards. fxp (Intel) and ep (3com 3c5x9) cards are known to work. The clusterd daemon should build by just typing "make" in the clusterd subdirectory. Install this in a convenient place (/usr/local/bin). It needs to run as root to change ipfw rules, etc. This could be avoided using standard techniques (suid external exe, etc), but I haven't gotten around to dealing with that as yet. You'll need to create a cluster.conf file. This must have the following entries (it will find it in /usr/local/etc/clusterd.conf, or specify as the only parameter on the command line): # host id number -- must be unique for each host, and no empty spaces! host_id=0 # number of peers in the cluster num_peers=2 # the cluster broadcast address. probably fine as 255.255.255.255 cluster_broadcast=255.255.255.255 # the id number of the cluster -- all nodes participating need the same id cluster_id=1 # an ipfw destination specification for the cluster cluster_destination=128.230.143.88 # a list of bits to use when dividing up the IP address space # 8,9,10,11 will divide up by the least significant nibble of the # third octet in an address (ie x where the address is 10.0.x.0) # 0,1,2,3 will divide up by the leas significant nibble of the last # octet in an address (ie x where the address is 10.0.0.x) cluster_bits=0,1,2,3 ####End Once this is created, you probably want to add a few lines to /etc/rc.local. Here's what mine looks like (with bogus addresses substituted): /sbin/ipfw add 15500 deny ip from any to 10.0.0.5 /sbin/ifconfig fxp0 alias 10.0.0.5 netmask 255.255.240.0 srcselect /usr/sbin/arp -s 10.0.0.5 01:00:ff:ff:00:01 pub /usr/local/etc/clusterd After the files are all in place, you can just "sh /etc/rc.local" to bring each node up. (Note: avoid the temptation to use a MAC address in the IP multicast range (01:00:5e:xx:xx:xx)--funny things will probably happen. On my network, packets got duplicated by a multicast router.) Operations ---------- The clusterd process will respond to the following signals: SIGTERM will cause clusterd to notify peers that its going down, close the log file, remove the pid file, and exit. The "reboot" command causes an immediate exit from the cluster when it sends SIGTERM. Services should move within a second or two at the most. SIGINFO will output the nodes current view of the cluster to the log file. SIGUSR2 will reset any evictions that the process is holding for other nodes. SIGINT will cause the cluster node to switch to STATUS_DOWN, announce it (which should cause the other nodes to pickup their shares of the traffic). SIGCONT will cause the cluster node to try to reinitialize (if it is not in STATUS_GOINGONLINE or STATUS_CORE). Most useful after another blackballing node has been reset or after a SIGSTOP. Eviction -------- Each cluster node tracks the number of status changes for each other node. This counter is reset hourly. If the counter exceeds 8 transitions in a one hour period, the node is considered to be flaky and is evicted from the cluster. Eviction is intended to rid a host which is in a start->overload->crash->restart (or similar) cycle. Heartbeats ---------- Each cluster node broadcasts a heartbeat every two seconds. If 9 heartbeats are missed, the cluster node will be considered down, and reception rules will be renegotiated so that all hosts can access a "running" host. Monitoring other processes -------------------------- There is no internal functionality for monitoring other services (httpd, innd, etc) provided. External monitors should be used, and send SIGINT to the clusterd process to cause a failover. Files of note ------------- logging goes to /var/log/clusterd.log the pid file is /var/run/clusterd.pid cluster.conf should be in /usr/local/etc or specified on the command line clusterd will syslog if it finds its own pid file, or if it cannot open the log file. Comments on the code -------------------- I've never written clustering code before, so excuse my mess, please. CAVEATS ------- 1. This code is not extensively tested. You should do significant testing before putting any services you care about under its supervision. 2. Some IP stacks refuse to accept a multicast ethernet address in response to an ARP request. NT 5 beta 2 is one. There probably are others. I'm working a bit on this, but not too aggressively. There are a number of simple resolutions to this problem, drop me a note if you want to discuss it. 3. Using a multicast MAC address is something you almost certainly want to limit to a small ROUTED subnet. Doing this on a large switched network would have EXTREMELY NEGATIVE effects. If this is your environment, you may want more work on #2 to take place before using this in production. 4. Be careful when modifying this code. If you mess up the negotiation, you can cause a broadcast flood. I clocked a PII-333 dumping 30000 BROADCAST PACKETS PER SECOND onto my wire. This will bring everything on your network to an immediate halt. Copyright --------- Copyright 1999 Christopher M Sedore cmsedore@maxwell.syr.edu Please do not distribute these files without my permission. Notes on copyright ------------------ If there is interest, I will distribute these with a freer copyright after they've been tested an utilized by a few people. Contact me with questions. Author information ------------------ Code and docs were written by Christopher M Sedore, cmsedore@maxwell.syr.edu. Please feel free to contact me via electronic mail.