Skip to content

deriamis/fluffy-linux-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fluffy Cluster - by Mark Robson (markxr@gmail.com)

This is a cluster which works with any tcp service and is 
self-balancing. It does health monitoring of the other nodes and automatically
spreads the load between available nodes.

Others who have done this:

CLUSTERIP target implements this in a similar fashion (but not all of
the parts)

The technique documented here is also similar:
http://www.ultramonkey.org/papers/active_active/active_active.shtml

DESIGN GOALS
------------
1. Minimum configuration
	* All nodes should be able to share the same config
	* No "node number" or "Node ID" should need configuring
	* Nodes should be able to leave, have their IP address changed
		and re-join without a problem
2. Very high availability 
	* Small amount of lost traffic when a node dies
	* Ideally no lost traffic when a node is administratively
		shut down or added.
	* Quick recovery time
	* No complicated manual procedures required when 
		nodes fail or recover.
	* Capacity can be added (new nodes) without interrupting
		service and without prior knowledge.

3. Good performance
	* If we care enough.
	* In most cases, clustering will be used to provide
		better performance at the app layer; the clustering
		system itself should be performant enough not to matter.

4. Reasonable scalability
	* Should support a fair few nodes

NON-DESIGN GOALS
----------------
* Run on OSs other than Linux
* Be very configurable etc
* Any security against internal network malicious traffic
* Any defence against "byzantine general" attacks, broken nodes etc.

DESIGN/ HOW IT WORKS
------------

1. We have an IP address which is shared between all nodes in the cluster. 
	This is not configured in the normal way. Rather, the daemon
	will add it to the specified interface at startup (and remove it
	when it quits)

2. We respond to ARP requests with a multicast ethernet address. This is
	achieved by first using arptables to block the kernel's replies 
	from that IP address, then having a userspace daemon send its
	own replies.

	(If possible only one node in the cluster needs to reply to ARPs,
	otherwise the sender gets multiple pointless identical replies)

	NB: This is not technically legal, in that it's explicitly
	forbidden by RFC1812:
		" A router MUST not believe any ARP reply that claims that the Link
		Layer address of another host or router is a broadcast or multicast
		address. "
	
	HOWEVER, some other cluster / high availability products use it
	(I believe Microsoft's network clustering product does), so it's
	probably ok with most routers.

	Only the upstream router(s) need be compatible, nobody else on
	the internet can tell.

	Anyone who wants to use this with a RFC1812 compliant router, will need
	to change some option which it no doubt has, to disable this feature.

3. Incoming packets to that IP address get sent via an iptables chain. This
	is dynamically generated by the tool and added at startup time and
	removed again at shutdown.

	This chain is added to the INPUT chain for packets with a destination
	of our cluster IP, and does the following:

	- Using connection tracking, accept packets which are part of an
		existing connection to the current node, i.e. state=ESTABLISHED
		(also state=RELATED for icmp errors etc)
	- Drop packets which are in an INVALID state, as they are most likely
		being handled by another node and we don't want to interrupt
		a connection to another node with RSTs etc.
	- When a NEW packet comes in, queue it using NFQUEUE to a userspace
		daemon which decides what to do with it.

4. Userspace daemon handling NEW connections
	Each new connection is taken in by this userspace daemon, which
	hashes the source IP / port into a (e.g.) 16bit value.

	This is then compared with the node's current minimum / maximum
	hash range values. If it's within, we accept it. If it's not, we
	drop it.

5. Userspace daemon managing hash ranges
	
	Each node within the cluster will have a hash range. These are managed
	cooperatively between the ranges such that the entire hash space
	is covered (i.e. all connections are handled by SOMEONE) and there are
	no overlaps, at least, not if possible.

	The hash ranges are set up by agreement using UDP multicast packets.

	When the hash ranges change it only affects NEW connections. Existing
	already-established ones are allowed to continue.

	Additionally the weighting of a given node is factored in when building
	the hash ranges. 

6. Mechanism of setting node weight

	Each node has a weight which is configurable dynamically. When a node
	weight changes, the hash ranges are (somehow, somewhere) recomputed
	and new ranges communicated to all nodes.

	If a node weight is set to 0, that node will not be accepting any
	new connections *BUT ESTABLISHED CONNECTIONS CONTINUE*

	If all nodes have 0 weight, no new connections will be accepted at all.

7. Link status handling

	If the link on the interface goes down, i.e. its carrier is lost,
	then we handle that by (temporarily) setting our own node to zero
	weight and ensuring that when the link comes back up our own hash
	ranges are not set - we will drop packets when the link comes back
	until the ranges are recalculated (following the successful 
	re-establishment).

	This ensures that if a switch fails, or if a network cable is
	unplugged, the other nodes in the cluster (which still have connection)
	can take the spare hash ranges and we won't get overlap when the
	node with the failed link comes back.

-------------------
Malarky:

MAC addresses to use for multicast:

Use 03:00:01: followed by the last 3 bytes of the shared IP.

----------------

The following comment in CLUSTERIP is unbelivably important:

/* despite being received via linklayer multicast, this is
 * actually a unicast IP packet. TCP doesn't like PACKET_MULTICAST */
     skb->pkt_type = PACKET_HOST;

It will be necessary to replicate this behaviour through a (very simple) 
kernel module.

----------------
In our netfilter queue processing daemon, we can only ACCEPT or DROP packets;
but if we accept them processing continues?

If processing continues, we can affect it by using a netfilter packet "mark"
which we can set using

nfq_set_verdict_mark

--------------------
MULTICAST PACKET FORMAT

1. 32-bit magic number so we know it's not junk
	0x09bea717
	(This will have to be changed if we ever change the protocol)
2. Cluster IP address - so we know which is which if several
	clusters run on the same network.
3. Node's own IP address - so the recipient knows it's not somehow
	been NAT'd or come out of the wrong interface
	(32bit) - compared with the sender IP - if different, drop.
4. Command type - 32bit word:
	0x01 - Weight /status announcement
	0x02 - Master message
	0x03 - set node weight
	0x04 - set node weight - response
5. Weight of the node (32bit) (Can be zero)
6. In the case of a master message, a structure giving the
	boundaries of all nodes in the cluster

	byte - number of nodes
	for each node:
		IP address (32bit)
		Lower boundary (32bit) (included)
		Upper boundary (32bit) (excluded)

	Everything is in "network byte order" i.e. big endian.

Nodes can only be a master IF:
1. It has a non-zero weight
2. It has the lowest IP address of any candidate node.

---------------------
Setting node weight

Send a unicast UDP frame to our port number on localhost; it will
be received by the daemon and have the appropriate changes.

The format is as above except:
1. Cluster IP is all zeros
2. Node's own IP address is all zeros
3. Command type = 0x03
4. Weight of node is the desired weight

-------------------
Extra security todo:

Using recvmsg, get anciliary data to determine which i/f and
dest address the communication packets came to. Drop them if they didn't
arrive with the right i/f and dest.

this means that spoofed messages which have the right info
will still be dropped, because other hosts on the internet
can't send to our multicast address because they won't be
routed.

Also "Set weight" packets must be received on 127.0.0.1, which
can't be routed from anywhere else so can't be spoofed.

About

Automatically exported from code.google.com/p/fluffy-linux-cluster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published