Layman: A Distributed File System,
AKA: POSIX DFS With Probabilistic Routing

by Erik ParawellNov 30th, 2021

Project Motivation

I am always tinkering with my pride and joy of a server in my apartment. It has 20TBs of raw storage with 12TB usable with ZFS, however I do not have a backup method for my backup for all my important files. The laws of backup dictate that I should follow the Rule of Three. However I have a lot of mixed media and the majority of it should be referred to as "Linux ISOs" and are easily retrievable. It would be nice to have a system where I could mount a network file system on my backup machine home NAS where all my non replaceable files can live and be automatically propagated in a distributed fashion. One that I control absolutely.

In order to exercise what I learned in my classes and practical experience from work I have decided to resurrect an old project of mine from a graduate course where I have already implemented some of the distributed aspect of the file system. Over the last couple months I have made bug fixes and improvements such as POSIX compatibility while I made it ready for the long term burn in test.

Project Features

  • POSIX Compatibility: unlike many other DFS, this will be POSIX-compatible, meaning that the file system can be mounted like any other disk on the host operating system.
  • Probabilistic Routing: to enable lookups without requiring excessive RAM, client requests will be routed probabilistically to relevant storage nodes via bloom filters.
  • Parallel retrievals: large files will be split into multiple chunks. Client applications retrieve these chunks in parallel using threads.
  • Interoperability: the DFS will use Google Protocol Buffers to serialize messages. This allows other applications to easily implement the wire format.
  • Asynchronous Scalability: we will use non-blocking I/O to ensure the DFS can scale to handle hundreds of active client connections concurrently.
  • Fault tolerance: the system must be able to detect and withstand two concurrent storage node failures and continue operating normally. It will also be able to recover corrupted files.

Those goals are nice and all, but it would be all for nothing if my prized photos and school papers suddenly disappeared. After all a picture is worth a thousand words and with so many selfies to protect that's a lot of lost knowledge on the line.

Thus testing will be my top priority. I will go into further detail about it in a later section.

System Architecture

The system architecture consists of three main components.
1)  The Client
2)  The Controller
3)  N number of Storage Nodes

The Client:
The client is built to utilize the DFS to its fullest capabilities. This means that not only is the client fully Non Blocking like the server infrastructure; when the Client retrieves a file it gets the chunks from separate servers in order to both distribute the bandwidth load, but also possibly to increase throughput due to disk or network I/O limitations.

Another key feature is that it is POSIX compatible. This means I can use typical bind mounts and have this networked file system act like it lives physically on the Client's machine.

The Controller:
The Controller is actually quite special. The whole system is closely related to Hadoop's HDFS where the Controller plays the role of tracking the locations of files. In this case I am utilizing a Bloom filter in order to reduce RAM consumption on the controller. A Bloom filter allows me to route the client to the correct servers with a large amount of accuracy. Worst case scenario the Controller can manually look at each server until it finds the file, however this event will almost never need to happen.

The controller also acts as the main orchestrator of the entire system. It not only keeps track of where files exist and are located, but also is in charge of replication. In order to handle this, each storage node is responsible for sending the Controller a heartbeat to tell the controller that it is alive. If a Storage Node dies then all the affected files need a new replica node in order to maintain the expected level of replication. Similarly if a Storage Node finds a corrupted file chunk the Controller helps coordinate the finding of an uncorrupted version of that chunk for the repair of the file integrity on that Storage Node.

Storage Nodes:
On startup the Storage Node will register itself with the controller at which point it is ready to receive files. This node has a few main functions. The first one is at retrieval time the file being sent to the client is checked against its hash and checked for file corruption. If a corruption is detected, it is promptly repaired at which point the chunk is then sent to the client with minimal delay. Another function is the heartbeats it sends to the Controller Node to let it know that this Node is still accessible. In the event of a network or other type of outage like a crash the Controller Node is able to maintain the availability of the files associated with the Storage Node by assigning new replica locations. Lastly when a new chunk is received, if this node was the first Storage Node to receive this chunk, it will forward it to its assigned replica locations to maintain the required replication level that is expected.

Burn in Testing

A typical and normally functioning server

Testing is perhaps the most important part of this project. In order to ensure my system is reliable I need proof. A long burn in test should cover long term reliability.

In order the achieve this goal I made a plan.

Plan:
Due to the distributed nature of the system it would be best to test it on physically separated networks. The current plan is to distribute the Storage Nodes among the homes of friends and family.

Physical Infrastructure:
The humble Raspberry Pi is my weapon of choice. It is both power efficient and affordable which makes it an attractive option. In particular I will try to use CM4 modules where possible with M.2 SATA or NVME storage drives as SD Cards do not have a great write endurance. However it might be interesting to have one node with a SD card with the expectation of it failing.

Virtual Infrastructure:
Each component should be able to be run in Docker which would provide a consistent runtime environment and easy setup.
Eventually it would be nice to be able to roll out new servers with Ansible, but that is a stretch goal.

Project Status

What is working and what isn't

Feature Status
POSIX Compatibility
Probabilistic Routing
Parallel Retrievals
Interoperability
Asynchronous Scalability
Fault Tolerance
Health Checking
Docker