deriamis/fluffy-linux-cluster
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Fluffy Cluster - by Mark Robson (markxr@gmail.com) This is a cluster which works with any tcp service and is self-balancing. It does health monitoring of the other nodes and automatically spreads the load between available nodes. Others who have done this: CLUSTERIP target implements this in a similar fashion (but not all of the parts) The technique documented here is also similar: http://www.ultramonkey.org/papers/active_active/active_active.shtml DESIGN GOALS ------------ 1. Minimum configuration * All nodes should be able to share the same config * No "node number" or "Node ID" should need configuring * Nodes should be able to leave, have their IP address changed and re-join without a problem 2. Very high availability * Small amount of lost traffic when a node dies * Ideally no lost traffic when a node is administratively shut down or added. * Quick recovery time * No complicated manual procedures required when nodes fail or recover. * Capacity can be added (new nodes) without interrupting service and without prior knowledge. 3. Good performance * If we care enough. * In most cases, clustering will be used to provide better performance at the app layer; the clustering system itself should be performant enough not to matter. 4. Reasonable scalability * Should support a fair few nodes NON-DESIGN GOALS ---------------- * Run on OSs other than Linux * Be very configurable etc * Any security against internal network malicious traffic * Any defence against "byzantine general" attacks, broken nodes etc. DESIGN/ HOW IT WORKS ------------ 1. We have an IP address which is shared between all nodes in the cluster. This is not configured in the normal way. Rather, the daemon will add it to the specified interface at startup (and remove it when it quits) 2. We respond to ARP requests with a multicast ethernet address. This is achieved by first using arptables to block the kernel's replies from that IP address, then having a userspace daemon send its own replies. (If possible only one node in the cluster needs to reply to ARPs, otherwise the sender gets multiple pointless identical replies) NB: This is not technically legal, in that it's explicitly forbidden by RFC1812: " A router MUST not believe any ARP reply that claims that the Link Layer address of another host or router is a broadcast or multicast address. " HOWEVER, some other cluster / high availability products use it (I believe Microsoft's network clustering product does), so it's probably ok with most routers. Only the upstream router(s) need be compatible, nobody else on the internet can tell. Anyone who wants to use this with a RFC1812 compliant router, will need to change some option which it no doubt has, to disable this feature. 3. Incoming packets to that IP address get sent via an iptables chain. This is dynamically generated by the tool and added at startup time and removed again at shutdown. This chain is added to the INPUT chain for packets with a destination of our cluster IP, and does the following: - Using connection tracking, accept packets which are part of an existing connection to the current node, i.e. state=ESTABLISHED (also state=RELATED for icmp errors etc) - Drop packets which are in an INVALID state, as they are most likely being handled by another node and we don't want to interrupt a connection to another node with RSTs etc. - When a NEW packet comes in, queue it using NFQUEUE to a userspace daemon which decides what to do with it. 4. Userspace daemon handling NEW connections Each new connection is taken in by this userspace daemon, which hashes the source IP / port into a (e.g.) 16bit value. This is then compared with the node's current minimum / maximum hash range values. If it's within, we accept it. If it's not, we drop it. 5. Userspace daemon managing hash ranges Each node within the cluster will have a hash range. These are managed cooperatively between the ranges such that the entire hash space is covered (i.e. all connections are handled by SOMEONE) and there are no overlaps, at least, not if possible. The hash ranges are set up by agreement using UDP multicast packets. When the hash ranges change it only affects NEW connections. Existing already-established ones are allowed to continue. Additionally the weighting of a given node is factored in when building the hash ranges. 6. Mechanism of setting node weight Each node has a weight which is configurable dynamically. When a node weight changes, the hash ranges are (somehow, somewhere) recomputed and new ranges communicated to all nodes. If a node weight is set to 0, that node will not be accepting any new connections *BUT ESTABLISHED CONNECTIONS CONTINUE* If all nodes have 0 weight, no new connections will be accepted at all. 7. Link status handling If the link on the interface goes down, i.e. its carrier is lost, then we handle that by (temporarily) setting our own node to zero weight and ensuring that when the link comes back up our own hash ranges are not set - we will drop packets when the link comes back until the ranges are recalculated (following the successful re-establishment). This ensures that if a switch fails, or if a network cable is unplugged, the other nodes in the cluster (which still have connection) can take the spare hash ranges and we won't get overlap when the node with the failed link comes back. ------------------- Malarky: MAC addresses to use for multicast: Use 03:00:01: followed by the last 3 bytes of the shared IP. ---------------- The following comment in CLUSTERIP is unbelivably important: /* despite being received via linklayer multicast, this is * actually a unicast IP packet. TCP doesn't like PACKET_MULTICAST */ skb->pkt_type = PACKET_HOST; It will be necessary to replicate this behaviour through a (very simple) kernel module. ---------------- In our netfilter queue processing daemon, we can only ACCEPT or DROP packets; but if we accept them processing continues? If processing continues, we can affect it by using a netfilter packet "mark" which we can set using nfq_set_verdict_mark -------------------- MULTICAST PACKET FORMAT 1. 32-bit magic number so we know it's not junk 0x09bea717 (This will have to be changed if we ever change the protocol) 2. Cluster IP address - so we know which is which if several clusters run on the same network. 3. Node's own IP address - so the recipient knows it's not somehow been NAT'd or come out of the wrong interface (32bit) - compared with the sender IP - if different, drop. 4. Command type - 32bit word: 0x01 - Weight /status announcement 0x02 - Master message 0x03 - set node weight 0x04 - set node weight - response 5. Weight of the node (32bit) (Can be zero) 6. In the case of a master message, a structure giving the boundaries of all nodes in the cluster byte - number of nodes for each node: IP address (32bit) Lower boundary (32bit) (included) Upper boundary (32bit) (excluded) Everything is in "network byte order" i.e. big endian. Nodes can only be a master IF: 1. It has a non-zero weight 2. It has the lowest IP address of any candidate node. --------------------- Setting node weight Send a unicast UDP frame to our port number on localhost; it will be received by the daemon and have the appropriate changes. The format is as above except: 1. Cluster IP is all zeros 2. Node's own IP address is all zeros 3. Command type = 0x03 4. Weight of node is the desired weight ------------------- Extra security todo: Using recvmsg, get anciliary data to determine which i/f and dest address the communication packets came to. Drop them if they didn't arrive with the right i/f and dest. this means that spoofed messages which have the right info will still be dropped, because other hosts on the internet can't send to our multicast address because they won't be routed. Also "Set weight" packets must be received on 127.0.0.1, which can't be routed from anywhere else so can't be spoofed.
About
Automatically exported from code.google.com/p/fluffy-linux-cluster
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published