code: mafs

ref: b3e875d7db45fca0cde9f4efa12872f61afc54d7
dir: /docs/mafs.ms/

View raw version
.\" The current font is \n(.f
.\" The current point size is \n(.s
.\" The current vertical line spacing is \n(.v
.\" The line length is \n(.l
.\" The page length is \n(.p
.\" The page offset is \n(.o
.\" need this RT call else, the subsequent pages are losing the indent
.RT
.ft B
.ce
Mafs - Plan 9 userspace file systems
.ft R
.sp
Mafs wants you to be able to understand it, so you can be self-sufficient and fix a crash at two in the morning or satisfy your need for speed or a feature. This empowerment is priceless for those with skin in the game. It provides a reference implementation for creating use-case optimized filesystems.
.sp
Mafs is a user space file systems to provide system stability and security. It is based on kfs. It maintains copies of all metadata, creates a copy on any data write and writes to the disk lazily. These equate to greater throughput, the prevention of disk fragmentation, and the display of a recovery path on unsafe shutdowns.
.sp
As this document aims to also provide working knowledge, it gratuitously uses the actual commands and the relevant C data structure definitions to convey information.
.sp
.ft B
Mafs Workflow
.ft R
.sp
.PS
right;
{
	Client: box invis height 4*boxht wid 2*boxwid "" "" "" "Chan.aux has" "file offset, etc.";
	"Client" at Client.n
	line from Client.ne to Client.se
}
move 2*boxwid
{
	Multiple: box invis
	{
		" multiple" at Multiple.nw - 0,0.1 ljust
		" workers" at Multiple.sw + 0,0.1 ljust
		line <-> from Client to Multiple.w "9p" above
	}
	move 0.5*boxwid
	Abstractions: box invis "Abstractions"
	{ "Directory" "File" at Abstractions.s }
	move 0.75*boxwid
	Datastructures: box invis "Data Structures"
	{ "Dentry" at Datastructures.s }
	Buffercache: box invis "Buffer cache" "used blocks" with .sw at Datastructures.ne + 0.5i,0
	Extents: box invis "Extents" "free blocks" with .nw at Datastructures.se + 0.5i,0
	move 0.5*boxwid
	Inmemory: box invis "In-memory" "block contents" with .sw at Buffercache.n + 0.5i,0
	down
	move 0.5*boxwid
	Disk: box "Disk" "blocks" height 1.5*boxht with .sw at Extents.se + 0.4i,0
}
	line <-> from Multiple.e to Abstractions.w - 0.1i,0
	line <-> from Abstractions.e + 0.1i,0 to Datastructures.w - 0.2i,0
	line <-> from Datastructures.e + 0,0.1i to Buffercache.w - 0.1i,0
	line <-> from Datastructures.e - 0,0.1i to Extents.w
	line <-> from Buffercache.se + 0.1i,0 to Disk.w
	line <-> from Extents.e to Disk.w
	line <-> from Buffercache.s to Extents.n
	line <-> from Buffercache.ne - 0,0.1i to Inmemory.sw + 0,0.1i
.PE
.sp
.sp
.ft B
Disk Contents
.ft R
.sp
Mafs organizes and saves content on a disk as directories and files, just like any other filesystem.
.sp
The unit of storage is a logical block (not physical sector) of data. Disk space is split into blocks of 512 bytes. A directory entry uses a block and a data block could use upto 2048 blocks (1MiB).
.sp
Keeping the directory entries to a block reduces "snowdust" due to the "best fit" algorithm used for assigning disk blocks and memory.
.sp
.ne 14
A sample disk of 2048 bytes with 4 blocks.
.PS
right
{
	down;
."	{box dashed; box dashed; box dashed; box dashed;}
	box height 4*boxht;
	move 0.2i;
	"disk of" " 2048 bytes"
}
move; move
{
	move 0.5i;
	down;
	{
		Block0: box dashed;
		Block1: box dashed;
		Block2: box dashed;
		Block3: box dashed;
	}
	box height 4*boxht;
	move 0.2i
	"disk of" " 2048 bytes"
	"Block  " at Block0.nw rjust
	"0  " at Block0.w rjust
	"1  " at Block1.w rjust
	"2  " at Block2.w rjust
	"3  " at Block3.w rjust
}
.PE
.sp
A block is stored to the disk with a tag in the first byte and the Qid.path in the last 8 bytes. The different types of blocks on a disk are:
.br
.nf
enum
{
	Tblank = 0,
	Tfree  = 0,	/* free block */
	Tnone  = 0,
	Tdata,		/* actual file contents */
	Tdentry,		/* directory entry, size = Dentrysize */
				/* Tdata & indirect blocks are last, to allow for greater depth */
	Tind0,		/* contains a list of Tdata block numbers for files
					and Tdentry block numbers for directories.*/
	Tind1,		/* contains a list of Tind0 block numbers */
	Tind2,		/* contains a list of Tind1 block numbers */
	Tind3,		/* contains a list of Tind1 block numbers */
	Tind4,		/* contains a list of Tind1 block numbers */
				/* gap for more indirect block depth in future.
					It can be put upto Tind7 without needing any code changes */
	Maxtind,		/* should be Tind0+Niblock */
	MAXTAG = Maxtind,

	Tmaxind = Maxtind - 1,
};
.fi
.sp
Every file or directory is represented on the disk by a directory entry (Dentry). A directory entry uses a  block (tag = Tdentry) and is uniquely identifiable by a Qid.
.sp
A file's contents are stored in the directory entry itself if they are 320 bytes or lesser. A file stores its contents in blocks with a tag of Tdata if the file size is more than that. A directory holds the directory entries of it's children in blocks each with a tag of Tdentry.
.sp
The blocks used by a file or directory entry are listed in their directory entry. As it is not possible to represent big files using the list of blocks available in the directory entry, the blocks are structured to use multiple levels of indirection as the file size increases.
.sp
A file's data blocks are identified by a tag of Tdata and that file's Qid.path. A directory's data blocks are identified by a tag of Tdentry and Qid.path of the child directory entry. (Is this quirky? Should the child's directory entry have a tag with the parent's Qid.path?)
.sp
A block number of zero represents the end of the file's contents. If a file is truncated, the data and indirect blocks are given up and the dentry.dblocks[0] = 0.
.sp
Mafs does not store the last access time of a file or directory.
.ne 20
.sp
A directory entry is defined as:
.nf
enum {
	Blocksize	= 512ULL,	/* minimum data unit size */

	Maxdatablockunits = 2048,
	Nindperblock= (Blocksize-3*sizeof(u64))/sizeof(u64),/* number of pointers per block */
	Nu64perblock= (Blocksize/sizeof(u64)),		/* number of u64's in a block */
	Dpathidx	= (Blocksize/sizeof(u64) -1),	/* index of path in the last data block, last u64 */

	Namelen = 128,	/* maximum length of a file name, calculated manually */
	Ndblock	= 32,	/* number of direct blocks in a Dentry */
	Niblock	= 5,	/* maximum depth of indirect blocks, can increase it to 8 without issues */
};
struct Dentryhdr
{

	u8 tag;
	u8 namelen;
	s16 uid;
	s16 gid;
	s16 muid;		/* 8 */
	u64 size;		/* 0 for directories. For files, size in bytes of the content - 16 */
	u64 pdblkno; 	/* block number of the parent directory entry. Will be 0 for root. - 24 */
	u64 pqpath; 	/* parent qid.path - 32 */
	u64 mtime;		/* modified time in nano seconds from epoch - 40 */
	u64 qpath;		/* unique identifier Qid.path 48 */
	u32 version;	/* Qid.version 52 */
	u32 mode;		/* same bits as defined in /sys/include/libc.h:/Dir\.mode/ - 56 */
	s8 name[Namelen]; /* Namelen = 128 - 184*/
};
struct Datahdr
{
	u8 tag;
	u8 unused;	/* for alignment and future use */
	u16 len;
	u64 dblkno;	/* block number of the directory entry */
};

enum {
	/* size of identifiers used in a Tdata block */
	Ddataidssize = sizeof(Datahdr) +sizeof(u64 /* trailing path */),
	/* max possible size of data that can be stuffed into a Dentry */
	Ddatasize = Blocksize -sizeof(Dentryhdr) -sizeof(u64 /* trailing path */),
	Maxdatablocksize = Maxdatablockunits*Blocksize -Ddataidssize,
};

struct Super
{
	u64 qidgen;	/* generator for unique ids. starts from 1024. */
	u64 fsok;	/* fsck status, using 64 bits to keep it all aligned */
};
struct Dentry
{
	Dentryhdr;
	union
	{
		struct
		{
			u64 dblocks[Ndblock];	/* direct blocks. */
									/* List of Tdata block numbers for files and
										Tdentry block numbers for directories */
			u64 iblocks[Niblock];	/* indirect blocks */
		};
		Super;

		/* when size <= Dentrysize-184-sizeof(Tag), store the data here itself */
		s8 buf[Ddatasize];
	};
	u64 path;	/* same as qid.path */
};
struct Indirect
{
	u8 tagi;	/* the suffix i to avoid union complaining about ambiguous fields */
	u8 pad[7];	/* unused, to align to a multiple of 8 */
	u64 dblkno;	/* block number of the directory entry */
	u64 bufa[Nindperblock];
	u64 pathi;	/* same as qid.path */
};
struct Data		/* used to unmarshall the disk contents */
{
	Datahdr;
	u8 buf[1];	/* upto Maxdatablocksize, followed by u64 qid.path */
	/* u64 path; same as qid.path at the end of the data content - check consistency */
};
.fi
.sp
A directory entry once assigned is not given up until the parent directory is removed. It is zero'ed if the directory entry is removed. It is reused by the next directory entry created under that parent directory. This removes the need for garbage collection of directory entries on removals and also avoids zero block numbers in the middle of a directory entry's list of blocks. A zero block number while traversing a directory entry's dblocks or iblocks represents the end of directory or file contents. When a directory is removed, the parent will have a directory entry with a tag of Tdentry and Qpnone and the rest of the contents set to zero.
.sp
A directory entry is stored in two blocks to have a copy on write. These entries are in consecutive blocks.
.sp
A directory's size is always zero.
.sp
.nf
	; tests/6.sizes
	Blocksize 512 Maxdatablockunits 2048
	Dentryhdr size 184 Ddatasize 320
	Dentry size 512 Namelen 128
	Datahdr size 12 Ddataidssize 20 Maxdatablocksize 1048556
	Namelen 128 Ndblock 32 Niblock 5
	Nindperblock 61 Maxdatablocksize 1048556
	A Tind0 unit points to 1 data blocks (1048556 bytes)
			block points to 61 data blocks
			reli start 32	max 92
			max size 93*Maxdatablocksize = 97515708 bytes	= 92 MiB
	A Tind1 unit points to 61 data blocks (63961916 bytes)
			block points to 3721 data blocks
			reli start 93	max 3813
			max size 3814*Maxdatablocksize = 3999192584 bytes	= 3 GiB
	A Tind2 unit points to 3721 data blocks (3901676876 bytes)
			block points to 226981 data blocks
			reli start 3814	max 230794
			max size 230795*Maxdatablocksize = 242001482020 bytes	= 225 GiB
	A Tind3 unit points to 226981 data blocks (238002289436 bytes)
			block points to 13845841 data blocks
			reli start 230795	max 14076635
			max size 14076636*Maxdatablocksize = 14760141137616 bytes	= 13 TiB
	A Tind4 unit points to 13845841 data blocks (14518139655596 bytes)
			block points to 844596301 data blocks
			reli start 14076636	max 858672936
			max size 858672937*Maxdatablocksize = 900366660128972 bytes	= 818 TiB
.fi
.ne 30
.sp
On an empty mafs filesystem mounted at /n/mafs, the disk contents added by the below commands are:
.nf
mkdir /n/mafs/dir1
echo test > /n/mafs/dir1/file1
.fi
\" from test.9
.PS
right
bigboxht = boxht
fieldht = 0.35*boxht
{
	down
	{ Bound: box height 9*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 64"
	box height fieldht invis "name dir1"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 0"
	box height fieldht invis "pdblkno 10"
	box height fieldht invis "pqpath 10"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "path 64"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "direct blocks"
	box height fieldht invis "        0 12"
	box height fieldht invis "        1 0"
	box height fieldht invis "        2 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "        30 0"
	box height fieldht invis "        31 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "        1 0"
	box height fieldht invis "        2 0"
	"Block 11 contents: /dir1 Dentry" at Bound.nw + 0,0.1i ljust
	"Representation of a file in a directory: /dir1/file1" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 65"
	box height fieldht invis "name file1"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 5"
	box height fieldht invis "pdblkno 11"
	box height fieldht invis "pqpath 64"
	box height fieldht invis "mtime 1653302180823455071"
	box height fieldht invis "path 65"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 666"
	box height fieldht invis "test"
	"Block 12 contents: file1 Dentry" at Bound.nw + 0,0.1i ljust
}
.PE
.sp
.PS
right
Start: {
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 66"
	box height fieldht invis "name dir2"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 0"
	box height fieldht invis "pdblkno 10"
	box height fieldht invis "pqpath 10"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "path 66"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "direct blocks"
	box height fieldht invis "        0 14"
	box height fieldht invis "        1 15"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "        31 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "        1 0"
	box height fieldht invis "        2 0"
	"Block 13 contents: /dir2 directory entry" at Bound.nw + 0,0.1i ljust
	"Representation of two files in a directory (/dir2/file1 and /dir2/file2)" ljust at Bound.nw + 0.2,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 67"
	box height fieldht invis "name file1"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 5"
	box height fieldht invis "pdblkno 13"
	box height fieldht invis "pqpath 66"
	box height fieldht invis "mtime 1653302180823455071"
	box height fieldht invis "path 67"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 666"
	box height fieldht invis "test"
	"Block 14 contents: file1 directory entry" at Bound.nw + 0,0.1i ljust
}
.PE
.sp
iblocks[0] holds the block number of a Tind0 block. A Tind0 block contains a list of Tdata block numbers for files or a list of Tdentry block numbers for directories.
.sp
iblocks[1] has the block number of a Tind1 block. A Tind1 block contains a list of Tind0 block numbers.
.sp
Similarly, for other iblocks[n] entries, iblocks[n] has the block number of a Tind\fIn\fR block. A Tind\fIn\fR block contains a list of Tind\fI(n-1)\fR block numbers.
.sp
.sp
Relative index
.sp
The zero'th relative index in a directory entry is the first data block. The next relative index is the second data block of the directory entry, and so on.
.sp
tests/6.reli shows how a relative index (reli) is translated into an actual disk block number.
.sp
To find the actual block number where the first block (zero'th as zero indexed) of a file is stored:
.nf
	tests/6.reli 0 # command, below is the output of this command
	reli 0
	dblock[0]
.fi
.sp
To find the actual block number where the second block of a file is stored:
.nf
	tests/6.reli 1
	reli 1
	dblock[1]
.fi
.sp
And so on, for the 32nd and 33rd blocks of a file:
.nf
	tests/6.reli 31
	reli 31
	dblock[31]

	tests/6.reli 32
	reli 32
	iblock[0] tagstartreli 32
	Tind0 reli 0 is at [0] nperindunit 1
.fi
.sp
.PS
right
Start: {
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 70"
	box height fieldht invis "name 2MB.file"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 2056192"
	box height fieldht invis "pdblkno 16"
	box height fieldht invis "pqpath 69"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "path 70"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 664"
	box height fieldht invis "direct blocks"
	box height fieldht invis "        0 18"
	box height fieldht invis "        1 2066"
	box height fieldht invis "        2 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "indirect blocks"
	box height fieldht invis "       0 0"
	box height fieldht invis "       1 0"
	box height fieldht invis "       2 0"
	"Block 17 contents" at Bound.nw + 0,0.1i ljust
	"Representation of a 2 MiB file (/dir3/2MB.file)" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 6*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata qid.path 70 dblkno 17 len 2048"
	box height fieldht invis "0 0123456789"; {"contents of 2MB.file" at last box.e + 1i,0 ljust}
	"Block 18 contents" at Bound.nw + 0,0.1i ljust
}
.PE
\" from test.a
.PS
right
Start: {
	down
	{ Bound: box height 9*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry qid.path 64"
	box height fieldht invis "name big.file"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "size 26214400"
	box height fieldht invis "pdblkno 10"
	box height fieldht invis "pqpath 10"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "path 64"
	box height fieldht invis "version 0"
	box height fieldht invis "mode 666"
	box height fieldht invis "direct blocks"
	box height fieldht invis "       0 12"
	box height fieldht invis "       1 2060"
	box height fieldht invis "       2 4108"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "       31 63500"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "       0 67596"
	box height fieldht invis "       1 192525"
	box height fieldht invis "       2 0"
	"Block 11 contents" at Bound.nw + 0,0.1i ljust
	"Representation of a 106490448 bytes file (/big.file)" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 4*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata qid.path 64 dblkno 11 len 2048"
	box height fieldht invis "0 0123456789"; {"starting contents" at last box.e + 1i,0 ljust;}
	box height fieldht invis "."; {"of big.file" at last box.e + 1i,0 ljust}
	box height fieldht invis "."
	box height fieldht invis "."
	"Block 12 contents" at Bound.nw + 0,0.1i ljust
}
move to Start - 0,9.5*bigboxht
{
	down
	{ Bound: box height 3*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tind0 qid.path 64 dblkno 11"
	box height fieldht invis "	0 65548"
	box height fieldht invis "	1 67597"
	box height fieldht invis "	2 69645"
	box height fieldht invis "."
	box height fieldht invis "."
	"Block 67596 contents" at Bound.nw + 0,0.1i ljust
}
right
move 4*boxwid
{
	down
	{ Bound: box height 3*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata qid.path 64 dblkno 11 len 2048"
	box height fieldht invis " 0123456789"; {"more content" at last box.e + 1i,0 ljust}
	box height fieldht invis "."; {"of big.file" at last box.e + 1i,0 ljust}
	box height fieldht invis "."
	"Block 65548 contents" at Bound.nw + 0,0.1i ljust
}
.PE
.sp
.TS
box;
c s
c l
l a .
System Files
=
Block	Description
0	magic dir entry and data
1	/a/config dir entry
2	/a/super dir entry
_
3	/a/ dir entry
4	/a/users/ dir entry
5	/a/bkp/ dir entry
_
6	/a/users/inuse dir entry
7	/a/frees dir entry
8	/a/ctl dir entry -- virtual file, empty contents
_
9	/a/users/staging dir entry
10	/ direntry
_
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.sp
The /a/ctl file is used to halt or sync the file system. /a/users is a r/w file that will reload users when written to it. The owner of the /a/ctl file or any user belonging to the sys group can ream the disk.
.sp
There is no /a/magic directory entry as the block number of the magic block is zero and zero block in a directory entry signifies the end of the directory contents.
.sp
.sp
.ne 4
Backup blocks
.sp
A copy of Config, Super and Root blocks is maintained. This ensures a backup of config, Super and root blocks.
.sp
The backup block numbers on the disk are calculated during ream based on the disk size.
.sp
.TS
box;
c l c
l a a .
Block	Description	Backup Block
_
1	/a/config	last block number -1
2	/a/super	last block number -2
10	/	last block number -3
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.sp
Mafs needs atleast Nminblocks=14 blocks (7 KiB).
.fi
.sp
kfs and cwfs use 8192 byte blocks. Hence, they store multiple directory entries (Dentry) per block. They use slot numbers to identify a particular directory entry in a block of directory entries. Mafs avoids that by using 512 byte blocks thus having only one directory entry per block. This avoids locking up other sibling directory entries on access, more compact disk usage and better throughput for large files but also slows down walk()'s as mafs would be using a read() per directory entry.
.sp
.sp
.ne 4
Users
.sp
Users are defined in /a/users/inuse file. Any changes to it are made through the /a/users/staging file. All changes are written to the staging file and then inuse file is updated by writing the command users to the /a/ctl file.
.sp
Either all changes to /a/users/inuse are installed or nothing is installed from the /a/users/staging file.
.sp
The format of /a/users/inuse is described in users(6).
.sp
.sp
.ft B
Buffer cache - Hash buckets with a circular linked list of Iobuf's for collisions.
.ft R
.sp
An Iobuf is used to represent a block in memory. An Iobuf is unique to a block. All disk interaction, except for free block management, happens through an Iobuf. We read a block from the disk into an Iobuf. To update a block on the disk, we write to an Iobuf, which, in-turn gets written to the disk.
.sp
An Iobuf is protected by a read-write lock (RWlock). This ensures synchronization across multiple processes updating the same file.
.sp
getbuf(), putbuf(), putbufs() and putbuffree() are used to manage Iobuf's. The contents of an Iobuf is not touched unless it is locked by getbuf(). It is unlocked by putbuf(), putbufs() or putbuffree() calls. The Iobuf.dirties Ref is decremented by the mafs writer's dowrite() without a lock(). This is to avoid deadlocks between putbuf() and the writer especially when the writer queue is full.
.sp
allocblock() allocates a free block into an Iobuf. allocblocks() allocates a bunch of free blocks with their own Iobuf's.
.sp
freeblock() erases the Iobuf and returns the block to the free block management routines.
.sp
Iobuf's are organized into a list of hash buckets to speed up access.
.sp
.nf
Hiob *hiob = nil;	/* array of nbuckets */
struct Hiob		/* Hash bucket */
{
	Iobuf* link;	/* least recently used Iobuf in the circular linked list */
	QLock;		/* controls access to this hash bucket */
	u64	n;		/* count of Iobuf's in the circular list */
};
struct Iobuf
{
	Ref;
	RWLock;			/* controls access to this Iobuf */
	u64	blkno;		/* block number on the disk, primary key */
	u16	len;		/* number of Units */
	Iobuf *fore;	/* for lru */
	Iobuf *back;	/* for lru */
	union{
		u8	*xiobuf;	/* "real" buffer pointer */
		u64	*xiobuf64;
		Data *io;
		Dentry *d;
		Indirect *i;
	};

	u8 *append;		/* appended data added not yet written to disk */
	u64 appendsize;
	u8 freshalloc;	/* uninitialized blocks on the disk */
	u64 atime;		/* to find old buffers to flush to the disk */
	u8 tag;
	u64 callerpc;	/* when locked, the locker pc for debugging */
};
.fi
.sp
The Iobuf's are arranged into a list of hash buckets. Each bucket points a circular linked list of Iobuf's to handle collisions. If all the Iobuf's in the circular linked list are locked, new Iobuf's are added to this linked list. This circular list is ordered on a least recently used basis. Iobuf's once added to this list are not removed. When an Iobuf is not in the list, the oldest unlocked Iobuf is reused.
.sp
Hiob hiob[nbuckets] is a valid representation of the list of hash buckets. The block number is hashed to arrive at the relevant hash bucket index.
.sp
hiob[hash(block number)].link = Address of Iobuf0, where Iobuf0 is the least recently used Iobuf.
.PS
{
	right
	Iobuf0: box "Iobuf 0"; move
	Iobuf1: box "Iobuf 1"; move
	Iobuf2: box "Iobuf 2"
}
down
move; move
{
	right
	Iobufn: box "Iobuf n"; move
	Iobufn1: box "Iobuf n-1"; move
	Iobufn2: box "Iobuf n-2"
}
arrow from Iobuf0.ne - 0,0.05i to Iobuf1.nw - 0,0.05i
arrow from Iobuf1.sw + 0,0.05i to Iobuf0.se + 0,0.05i
arrow from Iobuf1.ne - 0,0.05i to Iobuf2.nw - 0,0.05i
arrow from Iobuf2.sw + 0,0.05i to Iobuf1.se + 0,0.05i

arrow from Iobufn.ne - 0,0.05i to Iobufn1.nw - 0,0.05i
arrow from Iobufn1.sw + 0,0.05i to Iobufn.se + 0,0.05i
arrow from Iobufn1.ne - 0,0.05i to Iobufn2.nw - 0,0.05i
arrow from Iobufn2.sw + 0,0.05i to Iobufn1.se + 0,0.05i

arrow dashed from Iobuf0.sw + 0.05i,0 to Iobufn.nw + 0.05i,0
arrow dashed from Iobufn.ne - 0.05i,0 to Iobuf0.se - 0.05i,0
arrow dashed from Iobuf2.sw + 0.05i,0 to Iobufn2.nw + 0.05i,0
arrow dashed from Iobufn2.ne - 0.05i,0 to Iobuf2.se - 0.05i,0
.PE
.sp
The size of the buffer cache is: number of hash buckets * collisions per hash bucket * block size. The approximate size of the buffer cache is Nbuckets * Ncollisions * Blocksize * Maxdatablockunits.
.sp
When there is an ups, the ideal memnunits (-m) is disk throughput * Nrefresh * ups duration/2. Without an ups, Without an ups, disk throughput * Nrefresh * 3 seems optimal.
.sp
If you have RAM to spare, increase Nbuckets instead of Ncollisions as the hash index lookup is faster than searching through a linked list.
.sp
Using Maxdatablockunits of 2048 allows bigger data blocks thus providing better throughput for large files but also wastes a lot more memory while writing to small files. If your use case involves requiring more efficient use of memory, setting Maxdatablockunits to 16 keeps the data block sizes down to what cwfs uses.
.sp
Iobuf.Ref is used to avoid locking up the hash bucket when a process is waiting for a lock on an Iobuf in that hash bucket.
.sp
Iobuf.Ref ensures that an Iobuf is not stolen before another process can get to wlock()'ing it after letting go of the lock on the hash bucket. We cannot hold the lock on the hash bucket until we wlock() the iobuf as that blocks other processes from using the hash bucket. This could also result in a deadlock. For example, the directory entry is block 18, which hashes to a hash index of 7. A writer() locked the directory entry iobuf and wants to add a data block 84 to the directory entry. Block 84 hashes to the same hash index of 7. Another process wanting to access the directory entry is waiting for a lock on that io buffer. While doing so, it has locked the hash bucket. Now, this has caused a deadlock between both these processes. The first process cannot proceed until it can lock the hash bucket holding block 84 and is still holding the lock on the directory entry in block 18. The second process cannot lock block 18 and is holding the lock on the hash bucket.
.nf
	for locking a buffer:
		qlock(hash bucket); incref(buffer); qunlock(hash bucket);
			wlock(buffer); decref(buffer);

	for stealing an unused buffer:
		qlock(hash bucket);
		find a buffer with ref == 0 and wlock()'able.
		qunlock(hash bucket);

	for unlocking a buffer:
		wunlock(buffer);
.fi
.sp
.sp
.ne 4
.ft B
Free blocks
.ft R
.sp
Free blocks are managed using Extents. The list of free blocks is stored to the disk when shutting down. If this state is not written, then the file system needs to be checked and the list of free blocks should be updated.
.sp
When shutting down, the Extents are written to free blocks. This information can be accessed from /a/frees. Also, fsok in the super block is set to 1. Mafs does not start until fsok is 1. When fsok = 0, run the sanity check that the unused blocks and the free blocks in /a/frees match up. disk/reconcile identifies any missing blocks or blocks that are marked as both used and free.
.sp
This process of fixing issues and setting fsok to 1 is manual. There is no automatic file system checker as in other file systems. This document aims to empower you with the knowledge to fix your file system issues instead of entrusting your precious data to an arbitrary decision maker such as the file system checker.
.sp
A tag of Tfree and Qpnone represent a free block. If a directory entry is removed, the parent will have a zero'ed out child directory entry (Qid.path = 0) and a tag of Tdentry and Qpnone.
.sp
.sp
.ne 4
.ft B
Extents
.ft R
.sp
This is a more elaborate version of the storage allocator described in Section 8.7 of the K & R C Programming language book. It differs from the allocator described there by being a "Best first fit" instead of a "first fit". Also, instead of a circular linked list of free blocks, we maintain the free blocks in two doubly linked lists, one by start address and another by size and start address. We also maintain a list of recently used blocks to reduce the travel by linear scans.
.sp
Extents is an abstraction used to manage contiguous units. It is responsibility of the caller to manage the size and scaling of each unit.
.sp
Free blocks, memory and long names strings are managed using Extents.
.sp
An Extent represents a continuous list of items. An Extents is a list of such Extent's.
.sp
.nf
struct Extent {
	struct Extent *low, *high;	/* sorted by start */
	struct Extent *small, *big;	/* sorted by the number of blocks in this extent */
	u64 start;					/* where this extent starts from */
	u64 len; 					/* how many units in this extent */

	/* circular least recently used linked list limited to Nlru items */
	struct Extent *prev, *next;
};
struct Extents {
	Extent *lowest;	/* find the first block number in a jiffy */
	QLock lck;
	u64 n;			/* number of extents */
	Rendez isempty; /* fully used, nothing available */

	u8 nlru;		/* number of items in the lru linked list */
	Extent *lru;	/* least recently used extent in the circular lru linked list */
	char name[32];

	void (*flush)(void);
};
.fi
.sp
To allocate n items from Extents, we find the lowest (by block number or memory address) extent that can satisfy our request. If a bigger Extent is available, slice it and take the portion we need.
.sp
If there is no available Extent to satisfy our request, panic().
.sp
allocblock() and freeblock() use balloc() and bfree() respectively. balloc() assigns blocks from an extent and bfree() adds them to an extent for next allocation.
.sp
.PS
# define field { [right; box invis $1 ljust; box invis $2 rjust; down] }
# define field { [right; box $1 ljust; box $2 rjust; down] }
define field { [right; box invis $1; box invis $2; down] }
boxht = 0.5*boxht
down
{
	box invis "Extents at memory location 1"
	Extents: {box ht 3*boxht wid 2*boxwid}
	Lru: field("lru", "100")
	{ "  assuming that the Extent at 100 was used last" ljust at Lru.e }
	El: field("el","0")
	{ "  unlocked" ljust at El.e }
	field("n","3")
}
.PE
.PS
down
move 4*boxht
define extent {
	[
		down
		Extent: {box ht 4*boxht wid 2*boxwid}
		field("blkno", $1)
		Len: field("len",$2)
		{ right line dashed from Len.sw to Len.se }
		field("low",$3)
		High: field("high",$4)
		if $5 > 0 then {
			"Extent at" ljust above at Extent.nw
			"$5" ljust above at Extent.n
		}
	]
}
{
	right
	extent("10", "1", "0", "200", 100);
	move
	extent("20", "3", "100", "300", 200);
	move
	extent("30", "2", "200", "0", 300);
}
down
{
	move boxht*5
	right
	box invis "+"
	box invis width 2 "freed block numbers" "11,12,13,14"
	box invis "="
}
down
move boxht*10
{
	right
	extent("10", "5", "0", "200", 100);
	move
	extent("20", "3", "100", "300", 200);
	move
	extent("30", "2", "200", "0", 300);
}
.PE
.PS
# ../tests/extents/addabove
define delimiter {
	down
	line right 5 dashed
	move down 0.25
}
define headingfield {
	[
		right;
		Blkno: box invis $1;
		Len: box invis $2;
	]
}
define order {
	down
	arrowwid=0.15
	arrowht=0.15
	arrow 0.25i at $1
}
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("20", "3")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 40" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("20", "3")
	field("40", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergeabove
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("100", "5")
	field("110", "3")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 105" "followed" "by 4 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("100", "13")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergeprevious
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("105", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 101" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("101", "8")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergenext
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("101", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 105" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("100", "8")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/addabove1
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("180", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 250" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("180", "4")
	field("250", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/addbelow
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("250", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 180" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("180", "4")
	field("250", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.sp
Kfs stores the list of free blocks in a Tfrees block and the Superblock. Instead we use block management routines, similar to pool.h, to allocate and monitor free blocks. On shutdown(), the block management routines (extents.[ch]) store state into the free blocks. This can be read from /a/frees. On startup, this is read back by the block management routines. On a crash, the fsck can walk the directory structure to identify the free blocks and recreate /a/frees.
.sp
.sp
.ne 12
.ft B
Code details
.ft R
.sp
.TS
allbox;
c c
l a .
Program	Description
_
disk/mafs	Start mafs on a disk.
disk/fsck	Check file system on an improper shutdown.
disk/free	List the free blocks. It reads the contents of /a/frees.
disk/used	List the used blocks by traversing all directory entries.
disk/block	Show the contents of a block.
disk/unused	Lists the unused blocks when given extents of used blocks.
disk/updatefrees	Update the contents of /a/frees.
.TE
.sp
.TS
allbox;
c c r
l a r .
File	Description	chatty9p
_
9p.c	9p transactions	2
blk.c	routines to show blocks.
console.c	obsolete. /a/ctl is the console.
ctl.c	/a/ctl operations.
dentry.c	encode/decode the file system abstraction into block operations.	3
extents.[ch]	routines to manage the free blocks.	6
iobuf.c	routines on Iobuf's. The bkp() routines operate on Iobuf's.	5
sub.c	initialization and super block related routines.	2
tag.c	routines to manage a relative index (reli) in a directory entry.
user.c	user management routines.
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.in 0
.sp
A Chan's state could get out of sync with the contents if another process changes the on-disk state. Ephase error occurs when that happens.
.sp
For throughput, multiple processes are used to service 9p i/o requests when the -s flag is not used.
.sp
.sp
.ne 4
.ft B
Useful commands:
.ft R
.sp
Ream and start single process Mafs on a disk and also mount it for use.
.sp
.nf
	mount -c <{disk/mafs -s -r mafs_myservice mydisk <[0=1]} /n/mafs_myservice
.in 3n
.br
	-s: use stdin and stdout for communication
	-r mafs_myservice: ream the disk using mafs_myservice as the service name
	 mydisk: running mafs on the disk, mydisk
.in 0
.fi
.sp
Ream and start multiple-process mafs on a disk.
.sp
.nf
	disk/mafs -r mafs_myservice mydisk
	mount -c /srv/mafs_myservice /n/mafs_myservice
.fi
.sp
.ne 7
Ream and start mafs on a file. Also, mount thet filesystem at /n/mafs_myservice.
.sp
.nf
	dd -if /dev/zero -of myfile -bs 512 -count 128 # 64KB file
	mount -c <{disk/mafs -s -r mafs_service myfile <[0=1]} /n/mafs_myservice

	# to reuse the contents of myfile later, remove -r (ream) from the above command.
	mount -c <{disk/mafs -s myfile <[0=1]} /n/mafs_myservice
.fi
.sp
Prepare and use a disk (/dev/sdF1) for mafs.
.sp
.nf
	disk/fdisk -bawp /dev/sdF1/data	# partition the disk
	echo '
	a fs 9  $-7
	w
	p
	q' | disk/prep -b  /dev/sdF1/plan9	# add an fs plan 9 partition to the disk
	disk/mafs -r mafs_sdF1 /dev/sdF1/fs	# -r to ream the disk
	mount -c /srv/mafs_sdF1 /n/mafs_sdF1

	# for using the mafs file system on the disk later on
	disk/mafs  /dev/sdF1/fs	# no -r
	mount -c /srv/mafs_sdF1 /n/mafs_sdF1
.fi
.sp
Starting mafs on a 100 MiB file. The below commands create a disk.file to use as a disk. Mount /n/mafs_disk.file for the file system.
.sp
.nf
	dd -if /dev/zero -of disk.file -bs 1m -count 100;
	mount -c <{disk/mafs -s -r mafs_disk.file disk.file <[0=1]} /n/mafs_disk.file
.fi
.sp
Starting mafs on a RAM file. The below commands create a ramfs filesystem to use as a disk.
.sp
.nf
	ramfs -m /n/mafs_ramfs
	touch /n/mafs_ramfs/file
	dd -if /dev/zero -of /n/mafs_ramfs/file -count 100 -bs 1m
	disk/mafs -r mafs_ramfs_file /n/mafs_ramfs/file
	mount -c /srv/mafs_ramfs_file /n/mafs_ramfs_file
.fi
.sp
Sync Mafs. This command does not return until all the writes are written to the disk. So, could take a long time if you have a long writer queue.
.sp
	echo sync >> /n/mafs_myservice/a/ctl
.sp
.sp
.ne 4
Stop Mafs: There are 2 ways to shutdown:
.br
.in 3n
1. Unmount and remove the /srv/mfs_service file (can be rm and unmount too).
.br
2. Write halt into the /a/ctl file. Unmount() the mafs file system to keep it clean.
.in 0
.sp
In the first instance, the srv() process is driving the shutdown. It calls fsend(). rm /srv/mfs_service file does not wait for fsend() to finish. Hence, there is no way to ensure that memory contents have been flushed to the disk. If the system is shutdown or restarted immediately, there is a very high possibility that the filesystem will be in an inconsistent state.
.sp
In the second instance, fsend() is called by the worker process. It does not return until all the pending writes have been flushed to the disk. It also removes the /srv/mafs_service file and also stops the srv() process. Hence, this is the preferred approach to shutting down the file system.
.sp
There is no way to unmount() automatically on shutdown. The mount() and unmount() calls are client driven and it is not the responsibility of the server to find all the clients that mounted it. Just shutdown and let the respective clients deal with their mess.
.sp
The below command does not return until all the writes are written to the disk. So, could take a long time if you have a long writer queue. This is the proper way to shutdown the mafs file system.
.sp
	echo halt >> /n/mafs_myservice/a/ctl
.sp
Interpret the contents of a block based on the tag and write out a single formatted block based on the tag
.sp
	disk/block  tests/test.0/disk 22
.sp
Traverse the directory heirarchy and write out all the used block numbers. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks. Also, writes the invalid blocks to stderr. Starting from root, walk down each directory entry printing out the linked blocks with invalid tags. (Why not just write out the list of dirty blocks too? instead of using a different command for it?)
.sp
	disk/used  tests/test.0/disk
.sp
From the contents of /a/frees show the list of free blocks. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks.
.sp
	disk/free  tests/test.0/disk
.sp
Read two lists of block numbers and flag the common and missing blocks.
.sp
.nf
	disk/reconcile -u <{disk/used tests/test.0/disk} \\
		-F <{disk/free  tests/test.0/disk} 32
.fi
.sp
.ne 3
Find traverses the directory heirarchy and identifies the file that a block number belongs to.
.sp
	disk/find tests/test.0/disk 17
.sp
.ne 3
Find the total number of blocks on a disk.
.sp
.nf
	dd -if /dev/sdF1/fs -bs 512 -iseek 2 -count 1 -quiet 1 | awk '$1 == "nblocks" { print $2 }'

	disk/block /dev/sdF1/fs 2 | awk '$1 == "nblocks" { print $2 }'
.fi
.sp
.ne 5
Build the list of free blocks. This should match the contents of /a/frees.
.sp
.nf
	disk/unused 11721040049 <{disk/used /dev/sdF1/fs} # 11721040049 = total number of disk blocks
	disk/unused 32 <{disk/used test.0/disk} # 32 = total number of disk blocks
.fi
.sp
.ne 5
Change the contents of /a/frees.
.sp
.nf
	disk/updatefrees tests/test.0/disk <{disk/unused 32 <{disk/used tests/test.0/disk}}
	disk/updatefrees /dev/sdF1/fs <{disk/unused 11721040049 <{disk/used /dev/sdF1/fs}}
.fi
.sp
.ne 5
A sanity check that the file system is not corrupt by comparing that the unused blocks and free blocks match up. $nblocks is the total number of disk blocks. $disk is the disk.
.sp
.nf
	diff <{disk/unused -l 32 <{disk/used tests/test.0/disk}}} <{disk/free tests/test.0/disk}
.fi
.sp
.ne 5
Check and correct a crashed filesystem.
.sp
.nf
	disk=/dev/sdF1/fs
	nblocks=`{disk/block /dev/sdF1/fs 2 | awk '$1 == "nblocks"{ print $2 }'}
	disk/updatefrees /dev/sdF1/fs <{disk/unused $nblocks <{disk/used /dev/sdF1/fs}
	disk/fsok /dev/sdF1/fs

	disk=/dev/sdF1/fs
	nblocks=`{disk/block $disk 2 | awk '$1 == "nblocks"{ print $2 }'}
	disk/updatefrees $disk <{disk/unused $nblocks <{disk/used $disk}
	disk/fsok $disk
.fi
\".sp
\"Changing the service name without a ream.
\".sp
\".nf
\"	disk/block /dev/sdF1/fs 2 | wc
\"	Tdentry 2 2
\"	name config
\"	uid -1
\"	gid -1
\"	muid -1
\"	size 146
\"	pdblkno 6
\"	pqpath 6
\"	mtime 1669916563259999461
\"	path 2
\"	version 0
\"	mode 444
\"	size 6001172505088
\"	nblocks 11721040049
\"	backup config 2 to 11721040047
\"	backup super 4 to 11721040045
\"	backup root 20 to 11721040043
\"	service m_ddf_1
\"
\"	dd -if /dev/sdF1/fs -count 10 -skip 682 -bs 1
\"	mafs_ddf_110+0 records in
\"	10+0 records out
\"
\"	dd -if <{echo m_ddf_1; cat /dev/zero} -of /dev/sdF1/fs -count 11 -oseek 682 -bs 1
\"	7+0 records in
\"	7+0 records out
\"
\"	disk/block /dev/sdF1/fs 1
\"	Tdata 2
\"	size 6001172505088
\"	nblocks 11721040049
\"	backup config 1 to 11721040048
\"	backup super 2 to 11721040046
\"	backup root 3 to 11721040044
\"	service m_ddf_1
\".fi
\".sp
\"Changing the magic phrase in the magic block. - the offsets and the block output format has changed.
\".sp
\".nf
\"	disk/block /dev/sdF1/fs 0
\"	Tmagic 1
\"	mafs device
\"	512
\"
\"	dd -if /dev/sdF1/fs -count 16 -iseek 256 -bs 1
\"	mafs device
\"	512
\"	20+0 records in
\"	20+0 records out
\"
\"	dd -if <{echo mafs device; echo 512; cat /dev/zero} -of /dev/sdF1/fs -count 18 -oseek 256 -bs 1
\"	18+0 records in
\"	18+0 records out
\"
\"	dd -if /dev/sdF1/fs -count 18 -iseek 256 -bs 1
\"	mafs device
\"	512
\"	18+0 records in
\"	18+0 records out
\"
\"	disk/block /dev/sdF1/fs 0
\"	Tmagic 1
\"	mafs device
\"	512
\".fi
.sp
.sp
.ne 20
.ft B
Tests
.ft R
.sp
.TS
box;
c l
l a .
Program	Description
_
tests/regress.rc	All regression tests
tests/chkextents.rc	Unit tests on extents
tests/chkreli.rc	Unit tests on relative index lookups
tests/chknlastdatablocks.rc	Unit tests on the number of blocks in the last Tdata block
_
tests/6.offsets	Write file using different offsets to test mafswrite()
tests/6.sizes	Show the effects of the different parameters
tests/6.testextents	Test extents.[ch] state changes
tests/6.reli	Translate relative index to block number on a disk
.TE
.sp
The below disk state tests:
.in 3n
.br
.ti 0
1. Initialize a disk for mafs.
.br
.ti 0
2. Run mafs on that dsk.
.br
.ti 0
3. Stop mafs.
.br
.ti 0
4. Compare the contents with the expected contents (tests/test.0/blocks/*).
.in 0
.sp
.TS
box;
c s
c l
l a .
Disk State
=
Test	Description
_
tests/test.0	empty disk
tests/test.1	create a file /dir1/file1 and echo test into it
tests/test.2	writes at different offsets to a file and then removes the file
_
tests/test.3	write, read and delete files with sizes upto 16384 blocks
tests/test.4	directory copy
tests/test.5	fcp gzipped files
_
tests/test.6	df
tests/test.7	multiple processes working on the filesystem simultaneously
tests/test.8	check backup blocks locations
_
tests/test.9	examples used by this document
tests/test.a	write, read and delete a 100MB file
tests/test.b	duplicate of test.2 but seeded with random data
_
tests/test.d	seed with random data and do mkdir -p a/b/c/d/e/f/g/h
tests/test.e	seed with random data and test directory and file deletions
tests/test.f	test restart
.TE
.sp
.TS
box;
c s
c l
l a .
Extents behaviour
=
Test	Description
_
tests/extents/addabove	Figure 1 of the Extents section
tests/extents/addabove1	Figure 2 of the Extents section
tests/extents/addbelow	Figure 3 of the Extents section
_
tests/extents/mergeabove	Figure 4 of the Extents section
tests/extents/mergenext	Figure 5 of the Extents section
tests/extents/mergeprevious	Figure 6 of the Extents section
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.in 0
.sp
.ne 3
To run all the regression tests:
.br
.nf
	cd tests/; ./regress.rc
.fi
.sp
.ne 3
To loop through all the blocks of a test:
.br
.nf
	for(t in tests/test.2/blocks/^`{seq 0 39}*){ echo $t; echo '---------'; cat $t; echo }
.fi
.sp
.sp
.ft B
Performance metrics
.ft R
.sp
.nf
	ramfs -m /n/ramfs
	touch /n/ramfs/file
	cat /dev/zero | tput -p > /n/ramfs/file
		196.00 MB/s
		198.76 MB/s
		187.58 MB/s
		176.96 MB/s
		175.87 MB/s
		180.42 MB/s
		183.52 MB/s
		185.99 MB/s
		187.96 MB/s
		189.54 MB/s
		190.83 MB/s
		191.89 MB/s
		192.80 MB/s
	dd -if /dev/zero -of /n/ramfs/file -count 700 -bs 1m

	disk/mafs -r mafs_ramfs_file /n/ramfs/file
	mount -c /srv/mafs_ramfs_file /n/mafs_ramfs_file
	cat /dev/zero | tput -p > /n/mafs_ramfs_file/zeros.file # increase memunits for speed
		122.50 MB/s
		122.13 MB/s
		122.27 MB/s
		122.28 MB/s
	echo halt >> /n/mafs_ramfs_file/a/ctl; lc /srv
	unmount /n/mafs_ramfs_file

	dd -if /dev/zero -of /n/ramfs/file -count 700 -bs 1m
	hjfs -f /n/ramfs/file -r
	echo allow >>/srv/hjfs.cmd
	mount -c /srv/hjfs /n/hjfs/
	cat /dev/zero | tput -p > /n/hjfs/zeros.file
		70.85 MB/s
		71.02 MB/s
		70.88 MB/s
		70.62 MB/s
		70.46 MB/s
		70.54 MB/s
		69.74 MB/s
		68.63 MB/s
		67.74 MB/s
		67.02 MB/s
		63.81 MB/s
	echo halt >>/srv/hjfs.cmd
	unmount /n/ramfs

	time disk/used /dev/sdF1/fs > /tmp/used.blocks # on 600 GiB data stored to a SATA disk
	3.45u 30.41s 3201.71r 	 disk/used /dev/sdF1/fs

	ramfs -S ramfs
	mount -c /srv/ramfs /n/ramfs
	touch /n/ramfs/file
	dd -if /dev/zero -of /n/ramfs/file -count 1k -bs 2m

	cd
	disk/mafs -r mafs_ramfs_file /n/ramfs/file
	mount -c /srv/mafs_ramfs_file /n/mafs_ramfs_file
	cd /n/mafs_ramfs_file
	time dd -if /dev/zero -of /n/mafs_ramfs_file/file -count 500 -bs 1m
	time dd -if /n/mafs_ramfs_file/file -of /dev/null -count 500 -bs 1m
	rm /n/mafs_ramfs_file/file
	mkdir /n/mafs_ramfs_file/plan9front
	time clone -p 20:80 /dist/plan9front ./plan9front # kvik's clone which transfers in parallel
	time git/clone /dist/plan9front
	time walk plan9front >/dev/null
	time cat `{walk plan9front}>/dev/null
	time du -sh plan9front
	cd
	echo halt >> /n/mafs_ramfs_file/a/ctl; lc /srv
	unmount /n/mafs_ramfs_file

	hjfs -f /n/ramfs/file -r
	echo allow >>/srv/hjfs.cmd
	mount -c /srv/hjfs /n/hjfs/
	cd /n/hjfs
	time dd -if /dev/zero -of /n/hjfs/file -count 500 -bs 1m
	time dd -if /n/hjfs/file -of /dev/null -count 500 -bs 1m
	rm /n/hjfs/file
	time git/clone /dist/plan9front
	time walk plan9front >/dev/null
	time cat `{walk plan9front}>/dev/null
	time du -sh plan9front
	echo halt >>/srv/hjfs.cmd
	unmount /n/ramfs
	rm /srv/ramfs

hjfs metrics
------------
: mafs_ramfs_file ; 	ramfs -S ramfs
: mafs_ramfs_file ; 	mount -c /srv/ramfs /n/ramfs
: mafs_ramfs_file ; 	touch /n/ramfs/file
: mafs_ramfs_file ; 	dd -if /dev/zero -of /n/ramfs/file -count 1k -bs 2m
write: no memory
384+0 records in
383+0 records out
: mafs_ramfs_file ; 	disk/mafs -r mafs_ramfs_file /n/ramfs/file
: mafs_ramfs_file ; 	mount -c /srv/mafs_ramfs_file /n/mafs_ramfs_file
: mafs_ramfs_file ; 	cd /n/mafs_ramfs_file
: mafs_ramfs_file ; 	time dd -if /dev/zero -of /n/mafs_ramfs_file/file -count 500 -bs 1m
500+0 records in
500+0 records out
0.00u 0.63s 3.99r 	 dd -if /dev/zero -of /n/mafs_ramfs_file/file ...
: mafs_ramfs_file ; 	time dd -if /n/mafs_ramfs_file/file -of /dev/null -count 500 -bs 1m
500+0 records in
500+0 records out
0.00u 0.52s 3.74r 	 dd -if /n/mafs_ramfs_file/file -of /dev/null ...
: mafs_ramfs_file ; 	rm /n/mafs_ramfs_file/file
: mafs_ramfs_file ; 	time git/clone /dist/plan9front
fetching...
indexing 102147 objects: 100%
checking out repository...
85.92u 33.49s 175.44r 	 git/clone /dist/plan9front
: mafs_ramfs_file ; 	time walk plan9front >/dev/null
0.12u 0.28s 4.40r 	 walk plan9front
: mafs_ramfs_file ; 	time cat `{walk plan9front}>/dev/null
0.08u 2.02s 27.96r 	 cat plan9front/.git/fs plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.pack plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.idx plan9front/.git/objects/pack ...
: mafs_ramfs_file ; 	time du -sh plan9front
302.097M	plan9front
0.06u 0.34s 4.84r 	 du -sh plan9front
: mafs_ramfs_file ; 	cd
: ; 	echo halt >> /n/mafs_ramfs_file/a/ctl; lc /srv
9ramnodefs			cs_net.alt			cwfs_sdE2			ramfs
acme.glenda.1496	cwfs_ddf_1			cwfs_sdE2.cmd		rio.glenda.1400
bcfdfs				cwfs_ddf_1.cmd		dns					rio.glenda.1434
boot				cwfs_ddf_2			dns_net.alt			rio.glenda.33938
clone				cwfs_ddf_2.cmd		factotum			slashmnt
cons				cwfs_sdE1			mntexport			slashn
cs					cwfs_sdE1.cmd		plumb.glenda.1388
: ; 	unmount /n/mafs_ramfs_file
: ; 	hjfs -f /n/ramfs/file -r
hjfs: reaming /n/ramfs/file
hjfs: ream successful
hjfs: fs is /n/ramfs/file
: ; 	echo allow >>/srv/hjfs.cmd
hjfs: allow
: ; 	mount -c /srv/hjfs /n/hjfs/
: ; 	cd /n/hjfs
: hjfs ; 	time dd -if /dev/zero -of /n/hjfs/file -count 500 -bs 1m
500+0 records in
500+0 records out
0.00u 0.67s 6.63r 	 dd -if /dev/zero -of /n/hjfs/file ...
: hjfs ; 	time dd -if /n/hjfs/file -of /dev/null -count 500 -bs 1m
500+0 records in
500+0 records out
0.00u 0.39s 8.23r 	 dd -if /n/hjfs/file -of /dev/null ...
: hjfs ; 	rm /n/hjfs/file
: hjfs ; 	time git/clone /dist/plan9front
fetching...
indexing 102147 objects: 100%
checking out repository...
84.08u 34.35s 165.18r 	 git/clone /dist/plan9front
: hjfs ; 	time walk plan9front >/dev/null
0.04u 0.11s 0.69r 	 walk plan9front
: hjfs ; 	time cat `{walk plan9front}>/dev/null
0.13u 2.08s 16.49r 	 cat plan9front/.git/fs plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.pack plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.idx plan9front/.git/objects/pack ...
: hjfs ; 	time du -sh plan9front
302.097M	plan9front
0.00u 0.05s 0.62r 	 du -sh plan9front
: hjfs ; 	echo halt >>/srv/hjfs.cmd
: hjfs ; hjfs: ending
	unmount /n/ramfs
: hjfs ; 	rm /srv/ramfs

hjfs metrics
------------
: mafs ; 	hjfs -f /n/ramfs/file -r
hjfs: reaming /n/ramfs/file
hjfs: ream successful
hjfs: fs is /n/ramfs/file
: mafs ; 	echo allow >>/srv/hjfs.cmd
hjfs: allow
: mafs ; 	mount -c /srv/hjfs /n/hjfs/
: mafs ; 	cd /n/hjfs
: hjfs ; 	time git/clone /dist/plan9front
fetching...
indexing 102147 objects: 100%
checking out repository...
85.35u 46.81s 177.90r 	 git/clone /dist/plan9front
: hjfs ; bell
: hjfs ; 	time walk plan9front >/dev/null
0.10u 0.08s 0.68r 	 walk plan9front
: hjfs ; 	time cat `{walk plan9front}>/dev/null
0.09u 1.85s 16.54r 	 cat plan9front/.git/fs plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.pack plan9front/.git/objects/pack/bbf8e5461ce218a8b35073482679f457f2df8c7a.idx plan9front/.git/objects/pack ...
: hjfs ; 	echo halt >>/srv/hjfs.cmd
: hjfs ; hjfs: ending
	unmount /n/ramfs
: hjfs ; 

.fi
.sp
Profiling instructions:
.nf
	Set LDFLAGS=-p in the mkfile and install the executables.
	profilesize=2000000
	ramfs -m /n/ramfs
	touch /n/ramfs/file
	dd -if /dev/zero -of /n/ramfs/file -count 700 -bs 1m
	mount -c <{disk/mafs -s -r mafs_disk.file /n/ramfs/file <[0=1]} /n/mafs_ramfs_file
	cat /dev/zero | tput -p > /n/mafs_ramfs_file/zeros.file
		57.94 MB/s
		55.27 MB/s
		46.67 MB/s
	echo halt >> /n/mafs_ramfs_file/a/ctl; lc /srv
.fi
.sp
.sp
.ne 3
.ft B
Limitations
.ft R
.sp
As we use packed structs to store data to the disk, a disk with mafs is not portable to a machine using a different endian system.
.sp
.sp
.ft B
Design considerations
.ft R
.sp
For exclusive use (mode has DMEXCL bit set) files, there is no timeout.
.sp
Use an fs(3) device for RAID or other configuration involving multiple disks.
.sp
Why are you not using a checksum to verify the contents?
.br
Checksums are probabilistic and can be implemented as a bespoke application instead of complicating the file system implementation.
.sp
.sp
.ft B
Source
.ft R
.sp
http://git.9front.org/plan9front/mafs/HEAD/info.html
.sp
.sp
.ft B
References
.ft R
.sp
[1] Sean Quinlan, "A Cached WORM File System," Software--Practice and Experience, Vol 21., No 12., December 1991, pp. 1289-1299
.br
[2] Ken Thompson, Geoff Collyer, "The 64-bit Standalone Plan 9 File Server"
.br
[2] Brian Kernighan, Dennis Ritchie, "The C Programming Language"