code: mafs

ref: a566629485f872d51d4d5661d8b4aa32bd45f3e2
dir: /docs/mafs.ms/

View raw version
.\" The current font is \n(.f
.\" The current point size is \n(.s
.\" The current vertical line spacing is \n(.v
.\" The line length is \n(.l
.\" The page length is \n(.p
.\" The page offset is \n(.o
.\" need this RT call else, the subsequent pages are losing the indent
.RT
.ft B
.ce
Mafs - Plan 9 userspace file system
.ft R
\" .sp
.\" .ft B
.\" Introduction
.\" .ft R
.sp
Mafs wants you to be able to understand it, so you can be self-sufficient and fix a crash at two in the morning or satisfy your desire for speed or a feature. This empowerment is priceless as software literacy rises.
.sp
Mafs is a user space file system to provide system stability and security. It is based on kfs.
.sp
.sp
.ft B
Workflow
.ft R
.sp
.PS
right;
{
	Client: box invis height 4*boxht wid 2*boxwid "" "" "" "Chan.aux has" "file offset, etc.";
	"Client" at Client.n
	line from Client.ne to Client.se
}
move 2*boxwid
{
	Multiple: box invis
	{
		" multiple" at Multiple.nw - 0,0.1 ljust
		" workers" at Multiple.sw + 0,0.1 ljust
		line <-> from Client to Multiple.w "9p" above
	}
	move 0.5*boxwid
	Abstractions: box invis "Abstractions"
	{ "Directory" "File" at Abstractions.s }
	move 0.75*boxwid
	Datastructures: box invis "Data Structures"
	{ "Dentry" "Span" at Datastructures.s }
	Buffercache: box invis "Buffer cache" "used blocks" with .sw at Datastructures.ne + 0.5i,0
	Extents: box invis "Extents" "free blocks" with .nw at Datastructures.se + 0.5i,0
	move 0.5*boxwid
	Disk: box "Disk" "blocks" height 2*boxht with .nw at Buffercache.e + 0.4i,0
}
	line <-> from Multiple.e to Abstractions.w - 0.1i,0
	line <-> from Abstractions.e + 0.1i,0 to Datastructures.w - 0.2i,0
	line <-> from Datastructures.e + 0,0.1i to Buffercache.w - 0.1i,0
	line <-> from Datastructures.e - 0,0.1i to Extents.w
	line <-> from Buffercache.e + 0.1i,0 to Disk.w
	line <-> from Extents.e to Disk.w
	line <-> from Buffercache.s to Extents.n
.PE
.sp
.sp
.ft B
Disk Contents
.ft R
.sp
Mafs organizes and saves content on a disk as directories and files, just like any other filesystem.
.sp
The unit of storage is a logical block (not physical sector) of data. Disk space is split into 512 byte logical blocks. For optimum throughput, file data blocks are logically grouped into Spans.
.sp
A Tag identifies a Span written to the disk. A continuous space of Tag.len blocks makes up each Span. To efficiently manage system memory across numerous users and files, the size of each Span is constrained.
.sp
.ne 14
A sample disk of 2048 bytes with 4 blocks and 2 Spans
.PS
right
{
	down;
."	{box dashed; box dashed; box dashed; box dashed;}
	box height 4*boxht;
	move 0.2i;
	"disk of" " 2048 bytes"
}
move; move
{
	move 0.5i;
	down;
	{
		Block0: box dashed;
		Block1: box dashed;
		Block2: box dashed;
		Block3: box dashed;
	}
	box height 4*boxht;
	move 0.2i
	"disk of" " 2048 bytes"
	"Block  " at Block0.nw rjust
	"0  " at Block0.w rjust
	"1  " at Block1.w rjust
	"2  " at Block2.w rjust
	"3  " at Block3.w rjust
}
move; move; move; move
{
	down;
	{
		Block0: box dashed "Tag.len 2";
		Block1: box dashed; {line from Block1.sw to Block1.se;}
		Block2: box dashed "Tag.len 1"; {line from Block2.sw to Block2.se;}
		Block3: box dashed;
	}
	box height 4*boxht;
	move 0.2i;
	"disk of" " 2048 bytes"
	"Block  " at Block0.nw rjust
	"0  " at Block0.w rjust
	"1  " at Block1.w rjust
	"2  " at Block2.w rjust
	"3  " at Block3.w rjust
	" Span" at Block0.ne ljust
	"  0" at Block0.se ljust
	"  1" at Block2.e ljust
}
.PE
.ne 20
.sp
.nf
The different types of possible Span's on a disk are:
enum
{
	Tfree  = 0,	/* free blocks */
	Tmagic,		/* the first (zero'th) block holds a magic word */
	Tdentry,		/* directory entry */
				/* Tind\fIn\fR are last, to allow for future increases */
	Tdata,		/* file contents */
	Tind0,		/* list of Tdata Spans for files or Tdentry Spans for directories.*/
	Tind1,		/* list of Tind0 blocks */
	Tind2,		/* list of Tind1 blocks */
	Tind3,		/* list of Tind2 blocks. we can have a 11 TB file */
};
.sp
A Span is stored to the disk with a Tag.
.br
struct Tag
{
	u64 path;	/* Qid.path, unique identifier of directory or file */
	u8 type;	/* Tfree, Tmagic, Tdentry, Tdata, Tind\fIn\fR */
	u8 dirty;	/* is 1, when being written to.
				Identifies dirty data on a crash.
				This byte position is denoted by the enum Nthdirty. */
	u16 len;	/* number of blocks in this Span */
};
.fi
.sp
Every file or directory is represented on the disk by a directory entry (Dentry). Every directory entry uses a 1-block Span (Tag.type = Tdentry) and is uniquely identifiable by a Qid.
.sp
Mafs does not store the last access time of a file or directory.
.sp
A Dentry is defined as:
.nf
enum {
	Rawblocksize= 512,	/* real block size */
	Ndspanid	= 24,	/* number of direct Span identifiers in a Dentry */
	Niblock	= 4,		/* max depth of indirect blocks */
};
struct Qid9p1
{
	u32 version;
	u64 path;		/* unique identifier */
};
struct Spanid		/* Span identifier */
{
	u64 blkno;	/* starting block number */
	u16 len;		/* number of blocks */
};
struct Dentry1
{
	Qid9p1	qid;
	u64 size;		/* 0 for directories. For files, size in bytes of the content */
	u64 pdblkno; 	/* parent dentry absolute block number. 0 for root. */
	u64 pqpath; 	/* parent qid.path */
	u64 mtime;	/* modified time nano seconds from epoch */
	u32 mode;	/* same bits as defined in lib.h Dir.mode */
	s16 uid;
	s16 gid;
	s16 muid;
	Spanid dspans[Ndspanid];/* direct Span identifiers */
						/* Tag.type = Tdentry for directories and Tdata for files */
	u64 iblocks[Niblock];	/* indirect blocks */
};

/*
 * derived constants
 * Ndentriesperblock: number of Dentry's per block
 * Nindperblock: number of block pointers per block
 * Nspanidperblock: number of Span identifiers per a Tind0 block
 */
enum {
	Blocksize	= Rawblocksize - sizeof(Tag),
	Namelen	= (Blocksize-sizeof(Dentry1)),		/* max size of file name components */
	Maxspanlen	= MB/Rawblocksize,			/* in blocks */
	Maxspansize	= (Maxspanlen*Rawblocksize)-sizeof(Tag), /* in bytes */

	Ndentryperblock	= 1, 					/* Blocksize / sizeof(Dentry), */
	Nindperblock	= Blocksize / sizeof(u64),
	Nspanidperblock	= Blocksize / sizeof(Spanid),
};
struct Dentry
{
	struct Dentry1;
	char name[Namelen];
};
.fi
.PS
right
bigboxht = boxht
fieldht = 0.35*boxht
{
	down
	{ Bound: box height 15*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 64"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 64"
	box height fieldht invis "size 0"
	box height fieldht invis "pdblkno 3"
	box height fieldht invis "pqpath 63"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 19 1"
	box height fieldht invis "        1 0 0"
	box height fieldht invis "        2 0 0"
	box height fieldht invis "        3 0 0"
	box height fieldht invis "        4 0 0"
	box height fieldht invis "        5 0 0"
	box height fieldht invis "        6 0 0"
	box height fieldht invis "        7 0 0"
	box height fieldht invis "        8 0 0"
	box height fieldht invis "        9 0 0"
	box height fieldht invis "        10 0 0"
	box height fieldht invis "        11 0 0"
	box height fieldht invis "        12 0 0"
	box height fieldht invis "        13 0 0"
	box height fieldht invis "        14 0 0"
	box height fieldht invis "        15 0 0"
	box height fieldht invis "        16 0 0"
	box height fieldht invis "        17 0 0"
	box height fieldht invis "        18 0 0"
	box height fieldht invis "        19 0 0"
	box height fieldht invis "        20 0 0"
	box height fieldht invis "        21 0 0"
	box height fieldht invis "        22 0 0"
	box height fieldht invis "        23 0 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "        1 0"
	box height fieldht invis "        2 0"
	box height fieldht invis "        3 0"
	box height fieldht invis "name dir1"
	"Block 18 contents: /dir1 Dentry" at Bound.nw + 0,0.1i ljust
	"Representation of a file in a directory: /dir1/file1" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 15*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 65"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 65"
	box height fieldht invis "size 5"
	box height fieldht invis "pdblkno 18"
	box height fieldht invis "pqpath 64"
	box height fieldht invis "mtime 1653302180823455071"
	box height fieldht invis "mode 666"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 20 1"; {"content is in Block 20" at last box.e + 1i,0 ljust}
	box height fieldht invis "        1 0 0"
	box height fieldht invis "        2 0 0"
	box height fieldht invis "        3 0 0"
	box height fieldht invis "        4 0 0"
	box height fieldht invis "        5 0 0"
	box height fieldht invis "        6 0 0"
	box height fieldht invis "        7 0 0"
	box height fieldht invis "        8 0 0"
	box height fieldht invis "        9 0 0"
	box height fieldht invis "        10 0 0"
	box height fieldht invis "        11 0 0"
	box height fieldht invis "        12 0 0"
	box height fieldht invis "        13 0 0"
	box height fieldht invis "        14 0 0"
	box height fieldht invis "        15 0 0"
	box height fieldht invis "        16 0 0"
	box height fieldht invis "        17 0 0"
	box height fieldht invis "        18 0 0"
	box height fieldht invis "        19 0 0"
	box height fieldht invis "        20 0 0"
	box height fieldht invis "        21 0 0"
	box height fieldht invis "        22 0 0"
	box height fieldht invis "        23 0 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "        1 0"
	box height fieldht invis "        2 0"
	box height fieldht invis "        3 0"
	box height fieldht invis "name file1"
	"Block 19 contents: file1 Dentry" at Bound.nw + 0,0.1i ljust
}
.PE
.PS
right
Start: {
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 66"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 66"
	box height fieldht invis "size 0"
	box height fieldht invis "pdblkno 3"
	box height fieldht invis "pqpath 63"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 22 1"
	box height fieldht invis "        1 24 1"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "        23 0 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "."
	box height fieldht invis "        3 0"
	box height fieldht invis "name dir2"
	"Block 21 contents: /dir2 Dentry" at Bound.nw + 0,0.1i ljust
	"Representation of two files in a directory (/dir2/file1 and /dir2/file2)" ljust at Bound.nw + 0.2,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 67"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 67"
	box height fieldht invis "size 5"
	box height fieldht invis "pdblkno 21"
	box height fieldht invis "pqpath 66"
	box height fieldht invis "mtime 1653302180823455071"
	box height fieldht invis "mode 666"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 23 1"
	box height fieldht invis "        1 0 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "        23 0 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "."
	box height fieldht invis "        3 0"
	box height fieldht invis "name file1"
	"Block 22 contents: file1 Dentry" at Bound.nw + 0,0.1i ljust
}
down
move 9*bigboxht
{
	down
	{ Bound: box height 8.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 68"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 68"
	box height fieldht invis "size 5"
	box height fieldht invis "pdblkno 21"
	box height fieldht invis "pqpath 66"
	box height fieldht invis "mtime 1653302180823455071"
	box height fieldht invis "mode 666"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 25 1"
	box height fieldht invis "        1 0 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "        23 0 0"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "        0 0"
	box height fieldht invis "."
	box height fieldht invis "        3 0"
	box height fieldht invis "name file2"
	"Block 24 contents: file2 Dentry" at Bound.nw + 0,0.1i ljust
}
.PE
.PS
right
Start: {
	down
	{ Bound: box height 6.5*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 70"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 70"
	box height fieldht invis "size 2056192"
	box height fieldht invis "pdblkno 26"
	box height fieldht invis "pqpath 69"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "        0 28 2048"
	box height fieldht invis "        1 2076 1969"
	box height fieldht invis "        2 0 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "name 2MB.file"
	"Block 27 contents" at Bound.nw + 0,0.1i ljust
	"Representation of a 2 MB file (/dir3/2MB.file)" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 6*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata 2048 70"
	box height fieldht invis "0 0123456789"; {"contents of 2MB.file" at last box.e + 1i,0 ljust}
	"Block 28 contents" at Bound.nw + 0,0.1i ljust
}
.PE
.PS
right
Start: {
	down
	{ Bound: box height 9*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdentry 1 72"
	box height fieldht invis "qid.version 0"
	box height fieldht invis "qid.path 72"
	box height fieldht invis "size 26214400"
	box height fieldht invis "pdblkno 4045"
	box height fieldht invis "pqpath 71"
	box height fieldht invis "mtime 1653302180819962729"
	box height fieldht invis "mode 20000000777"
	box height fieldht invis "uid 10006"
	box height fieldht invis "gid -1"
	box height fieldht invis "muid 10006"
	box height fieldht invis "direct spans"
	box height fieldht invis "       0 8123 2048"
	box height fieldht invis "       1 10171 2048"
	box height fieldht invis "       2 12219 2048"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "       23 57249 2048"
	box height fieldht invis "indirect blocks"
	box height fieldht invis "       0 22434"
	box height fieldht invis "       1 0"
	box height fieldht invis "       2 0"
	box height fieldht invis "       3 0"
	box height fieldht invis "name 25MB.file"
	"Block 4046 contents" at Bound.nw + 0,0.1i ljust
	"Representation of a 25MB file (/dir4/25MB.file)" ljust at Bound.n + 0,0.3i
}
move 4*boxwid
{
	down
	{ Bound: box height 4*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata 2048 72"
	box height fieldht invis "0 0123456789"; {"starting contents" at last box.e + 1i,0 ljust;}
	box height fieldht invis "."; {"of 25MB.file" at last box.e + 1i,0 ljust}	
	box height fieldht invis "."
	box height fieldht invis "."
	"Block 8123 contents" at Bound.nw + 0,0.1i ljust
}
move to Start - 0,9.5*bigboxht
{
	down
	{ Bound: box height 6*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tind0 1 72"
	box height fieldht invis "	0 59297 2048"
	box height fieldht invis "	1 18363 1"
	box height fieldht invis "	2 0 0"
	box height fieldht invis "."
	box height fieldht invis "."
	box height fieldht invis "	49 0 0"
	"Block 22434 contents" at Bound.nw + 0,0.1i ljust
}
right
move 4*boxwid
{
	down
	{ Bound: box height 4*bigboxht width 3.3*boxwid }
	move 0.1i
	box height fieldht invis "Tdata 2048 72"
	box height fieldht invis "6789"; {"more content" at last box.e + 1i,0 ljust}
	box height fieldht invis "."; {"of 25MB.file" at last box.e + 1i,0 ljust}
	box height fieldht invis "."
	"Block 59297 contents" at Bound.nw + 0,0.1i ljust
}
.PE
.sp
kfs and cwfs do not use Spans. They use blocks but with a size of 8192 bytes to avoid dealing with small data blocks (less system calls, context switches, etc). Hence, they store multiple directory entries (Dentry) per block. They use slot numbers to identify a particular directory entry in a block of directory entries. Instead, we use variable length Span's for Tdata. All other types of data use 1 block Span's. This makes us use bigger sized content for data blocks (lesser system calls, context switches, etc.).
.sp
iblocks[0] contains the block number of a Tind0 Span(1 block). A Tind0 Span is a list of Span identifiers with the block numbers of Tdata Span's for files and Tdentry Span's for directories.
.sp
iblocks[1] contains the block number of a Tind1 Span(1 block). A Tind1 Span is a list of block numbers of Tind0 blocks.
.sp
Similarly, for other iblocks[n] entries, iblocks[n] contains the block number of a Tind\fIn\fR Span(1 block). A Tind\fIn\fR Span is a list of block numbers of Tind\fI(n-1)\fR blocks.
.sp
The Tag.dirty flag is set while a Span is being written. This helps identify dirty Span's after a crash.
.sp
To increase read and write throughput, all Tdata allocations will be Span's (Block number + len). The maximum Span length is 1MB (Maxspanlen blocks). Only the last span can be less than 1MB size.
.sp
A directory entry once assigned is not given up until the parent directory is removed. It is zero'ed if the directory entry is removed. It is reused by the next directory entry created under that parent directory. This removes the need for garbage collection of directory entries on removals and also avoids zero block numbers in the middle of a directory. A zero block number while traversing a directory's dspanids or iblocks represents the end of directory or file contents. When a directory is removed, the parent will have a directory entry with a tag of Tdentry and Qpnone and the rest of the contents set to zero.
.sp
A directory's size is always zero.
.sp
A file's data blocks are identified by a tag of Tdata and Qid.path. A block number of zero represents the end of the file's contents. If a file is truncated, the data and indirect blocks are given up and the dentry.dspanids[0] = (Spanid){0,0}.
.sp
The directory entry is locked with a read-write lock (RWlock) for any file operations. This ensures synchronization across multiple processes updating the same file.
.sp
Why is the Span's len stored in the directory entry when the same information can be obtained from the block's Tag.len?
.br
.in 3n
.ti 0
1. This avoids an extra read call for the Tag.len before getting the contents from the disk. Instead, we can read the contents with one read call as we know the number of blocks to read. Also, it works as a cross-checking mechanism if the Tag gets overwritten.
.br
.ti 0
2. Code is simpler when we store the length of the Span in the directory entry.
.in 0
.sp
.TS
box;
c s s s
c l c c
l a a a .
Disk State after ream - System Files
=
Block	Description	Directory entry	Content
_
0	magic
1	config		Y
2	super		Y
3	/	Y
_
4	/adm/	Y
5	/adm/config 	Y
6	/adm/super 	Y
_
7	/adm/users 	Y
8	/adm/users		Y
_
9	/adm/bkp/ 	Y
10	/adm/bkp/config.0 	Y
11	/adm/bkp/super.0 	Y
12	/adm/bkp/root.0 	Y
13	/adm/bkp/config.1 	Y
14	/adm/bkp/super.1 	Y
15	/adm/bkp/root.1 	Y
_
16	/adm/ctl (virtual file, empty contents)	Y
17	/adm/frees 	Y
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.sp
The /adm/ctl file is used to halt or sync. /adm/users is a r/w file that will reload users when written to it. The owner of the /adm/ctl file or any user belonging to the sys group can ream the disk.
.sp
There is no /adm/magic as the block number of the magic block is zero and zero block in a directory entry signifies the end of the directory contents.
.sp
.sp
.ne 4
Backup blocks
.sp
Three copies of Config, Super and Root blocks are maintained. This ensures two backups of config, Super and root blocks.
.sp
The backup block numbers on the disk are calculated during ream based on the disk size.
.sp
.TS
box;
c l c s
c l c c
l a a a .
		Backup Blocks
Block	Description	1	2
_
1	config	last block number -1	middle block number -1
2	super block (obsolete?)	last block number -2	middle block number -2
3	/	last block number -3	middle block number -3
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.sp
Mafs needs atleast Nminblocks=17 blocks (8.5 KB). The middle block number is Nminblocks + ((nblocks - Nminblocks)/2), where nblocks = total number of blocks.
.fi
.sp
.sp
.ft B
Buffer cache - Hash buckets with circular linked list of Iobuf's for collisions.
.ft R
.sp
An Iobuf is used to represent a Span in memory. An Iobuf is unique to a Span. All disk interaction, except for free block management, happens through an Iobuf. We read Span's from the disk into an Iobuf. To update Span's on the disk, we write to an Iobuf, which, in-turn gets written to the disk.
.sp
getbuf() and putbuf() are used to manage Iobuf's. The contents of an Iobuf is not touched unless it is locked between getbuf() and putbuf() calls.
.sp
allocblocks() allocates free blocks into an Iobuf.
.sp
freeblocks() erases the Iobuf and returns the blocks to the free block management routines.
.sp
Iobuf's are organized into a list of hash buckets to speed up access.
.sp
.nf
struct Hiob		/* Hash bucket */
{
	Iobuf* link;	/* least recently used Iobuf in the circular linked list */
	QLock;
};
struct Iobuf
{
	Ref;
	RWLock;
	u64	blkno;	/* block number on the disk, primary key */
	u16	len;		/* number of blocks of data xiobuf points to */
	Iobuf *fore;	/* for lru */
	Iobuf *back;	/* for lru */
	union{
		u8	*xiobuf;	/* "real" buffer pointer */
		Content *io;	/* cast'able to contents */
	}
	int	flags;
};
.fi
.sp
The Iobuf's are arranged into a hash structure of Nbuckets. Each bucket has a circular linked list of Ncollisions' Iobuf's to handle collisions. If all the Iobuf's in the circular linked list are locked, new Iobuf's are added to this linked list. This circular list is ordered on a least recently used basis. Iobuf's once added to this list are not removed. When an Iobuf is not in the list, the oldest unlocked Iobuf is reused.
.sp
Hiob hiob[nbuckets] is a valid representation of the list of hash buckets. The block number of the Span is hashed to arrive at the relevant hash bucket index.
.sp
hiob[hash(block number)].link = Address of Iobuf0, where Iobuf0 is the least recently used Iobuf.
.PS
{
	right
	Iobuf0: box "Iobuf 0"; move
	Iobuf1: box "Iobuf 1"; move
	Iobuf2: box "Iobuf 2"
}
down
move; move
{
	right
	Iobufn: box "Iobuf n"; move
	Iobufn1: box "Iobuf n-1"; move
	Iobufn2: box "Iobuf n-2"
}
arrow from Iobuf0.ne - 0,0.05i to Iobuf1.nw - 0,0.05i
arrow from Iobuf1.sw + 0,0.05i to Iobuf0.se + 0,0.05i
arrow from Iobuf1.ne - 0,0.05i to Iobuf2.nw - 0,0.05i
arrow from Iobuf2.sw + 0,0.05i to Iobuf1.se + 0,0.05i

arrow from Iobufn.ne - 0,0.05i to Iobufn1.nw - 0,0.05i
arrow from Iobufn1.sw + 0,0.05i to Iobufn.se + 0,0.05i
arrow from Iobufn1.ne - 0,0.05i to Iobufn2.nw - 0,0.05i
arrow from Iobufn2.sw + 0,0.05i to Iobufn1.se + 0,0.05i

arrow dashed from Iobuf0.sw + 0.05i,0 to Iobufn.nw + 0.05i,0
arrow dashed from Iobufn.ne - 0.05i,0 to Iobuf0.se - 0.05i,0
arrow dashed from Iobuf2.sw + 0.05i,0 to Iobufn2.nw + 0.05i,0
arrow dashed from Iobufn2.ne - 0.05i,0 to Iobuf2.se - 0.05i,0
.PE
.sp
The size of the buffer cache is approximately: number of hash buckets * collisions per hash bucket * Maximum Span size. By default, the approximate size of the buffer cache = Nbuckets * Ncollisions * Maxspansize = 256 * 10 * 1MB = 2.56GB. The -h parameter can be used to change the number of hash buckets.
.\" .sp
.\" What is the need for Iobuf.Ref?
.\" I have two threads that are writing to the same file. The file's directory entry is in block 18.
.sp
.sp
.ne 4
.ft B
Free blocks - Extents
.ft R
.sp
Free blocks are managed using Extents. The list of free blocks is stored to the disk when shutting down. If this state is not written, then the file system needs to be checked and the list of free blocks should be updated.
.sp
When shutting down, the Extents are written to free blocks. This information is written to /adm/frees. Also, fsok in the super block is set to 1. When fsok = 0, run an fsck (filesystem checker) to correct the issue.
.sp
A tag of Tfree and Qpnone represents a free block. If a directory entry is removed, the parent will have a zero'ed out child directory entry (Qid.path = 0) and a tag of Tdentry and Qpnone.
.sp
Algorithm to allocate blocks from Extents:
.in 3n
.br
.ti 0
1. Of all the Extents with the length we need, pick the Extent with the lowest block number (blkno).
.br
.ti 0
2. If no Extent of the length we need is available, then break up the smallest extent.
.in 0
.sp
.nf
struct Extent {
	struct Extent *low, *high;	/* sorted by blkno */
	struct Extent *small, *big;/* sorted by len */
	u64 blkno;			/* block number */
	u64 len; 				/* in blocks */
};
struct Extents {
	Extent *lru;	/* least recently used extent */
	QLock el;
	u32 n;		/* number of extents */
};
.fi
.sp
allocblocks() and freeblocks() use balloc() and bfree(). balloc() assigns blocks and bfree() holds them for next allocation.
.sp
.PS
# define field { [right; box invis $1 ljust; box invis $2 rjust; down] }
# define field { [right; box $1 ljust; box $2 rjust; down] }
define field { [right; box invis $1; box invis $2; down] }
boxht = 0.5*boxht
down
{
	box invis "Extents at memory location 1"
	Extents: {box ht 3*boxht wid 2*boxwid}
	Lru: field("lru", "100")
	{ "  assuming that the Extent at 100 was used last" ljust at Lru.e }
	El: field("el","0")
	{ "  unlocked" ljust at El.e }
	field("n","3")
}
.PE
.PS
down
move 4*boxht
define extent {
	[
		down
		Extent: {box ht 6*boxht wid 2*boxwid}
		field("blkno", $1)
		Len: field("len",$2)
		{ right line dashed from Len.sw to Len.se }
		field("low",$3)
		High: field("high",$4)
		{ right line dashed from High.sw to High.se }
		field("small", $5)
		field("big", $6)
		if $7 > 0 then {
			"Extent at" ljust above at Extent.nw
			"$7" ljust above at Extent.n
		}
	]
}
{
	right
	extent("10", "1", "0", "200", "0", "300", 100);
	move
	extent("20", "3", "100", "300", "300", "0", 200);
	move
	extent("30", "2", "200", "0", "100", "200", 300);
}
down
{
	move boxht*5
	right
	box invis "+"
	box invis width 2 "freed block numbers" "11,12,13,14"
	box invis "="
}
down
move boxht*10
{
	right
	extent("10", "5", "0", "200", "200", "0", 100);
	move
	extent("20", "3", "100", "300", "300", "100", 200);
	move
	extent("30", "2", "200", "0", "200", "100", 300);
}
.PE
.PS
# ../tests/extents/addabove
define delimiter {
	down
	line right 5 dashed
	move down 0.25
}
define headingfield {
	[
		right;
		Blkno: box invis $1;
		Len: box invis $2;
	]
}
define order {
	down
	arrowwid=0.15
	arrowht=0.15
	arrowhead=7
	arrow 0.25i at $1
}
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("20", "3")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 40" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("20", "3")
	field("40", "4")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("20", "3")
	field("40", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergeabove
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("100", "5")
	field("110", "3")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 105" "followed" "by 4 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("100", "13")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("100", "13")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergeprevious
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("105", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 101" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("101", "8")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("101", "8")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/mergenext
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("101", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 105" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("100", "8")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("100", "8")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/addabove1
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("180", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 250" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("180", "4")
	field("250", "4")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("180", "4")
	field("250", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.PS
delimiter
.PE
.PS
# ../tests/extents/addbelow
right
Before: [
	down
	Head: headingfield("blkno", "len", Blkno.w)
	{ order(Head.w) }
	field("250", "4")
]
{ "Extents before" above ljust at Before.nw }
[
	right
	box invis "+"
	box invis "Block number 180" "followed" "by 3 free blocks"
	box invis "="
]
move
After: [
	down
	Headb: headingfield("blkno", "len", Blkno.w)
	{ order(Headb.w) }
	field("180", "4")
	field("250", "4")
	move
	Headl: headingfield("blkno", "len", Len.e)
	{ order(Headl.e) }
	field("180", "4")
	field("250", "4")
]
{ "Extents after" above ljust at After.nw }
.PE
.sp
Kfs stores the list of free blocks in a Tfrees block and the Superblock. Instead we use block management routines, similar to pool.h, to allocate and monitor free blocks. On shutdown(), the block management routines (extents.[ch]) store state into the free blocks. This can be read from /adm/frees. On startup, this is read back by the block management routines. On a crash, the fsck can walk the directory structure to identify the free blocks and recreate /adm/frees.
.sp
.sp
.ne 4
.ft B
Code details
.ft R
.sp
.TS
allbox;
c l r
l a r .
File	Description	chatty9p
_
9p.c	9p transactions	2
sub.c	initialization and super block related routines.	2
dentry.c	encode/decode the file system abstraction into block operations.	3
iobuf.c	routines on Iobuf's. The bkp() routines operate on Iobuf's.	5
extents.[ch]	routines to manage the free blocks.	6
ctl.c	/adm/ctl operations.
tag.c	routines to convert from a relative index in a directory entry to a tag.
blk.c	routines to show blocks.
console.c	obsolete. /adm/ctl is the console.
.TE
.ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n
.in 0
.sp
A Chan's state could get out of sync with the contents if another process changes the on-disk state. Ephase error occurs when that happens.
.sp
For throughput, multiple processes are used to service 9p i/o requests.
.sp
.sp
.ne 4
.ft B
Useful commands:
.ft R
.sp
Ream and start single process Mafs on a disk and also mount it for use.
.sp
.nf
	mount -c <{disk/mafs -s -r mafs_myservice -h 10  mydisk  <[0=1]} /n/mafs_myservice
.in 3n
.br
	-s: use stdin and stdout for communication
	-r myservice: ream the disk using mafs_myservice as the service name
	-h 10: use 10 hash buckets
	 mydisk: running Mafs on the mydisk
.in 0
.fi
.sp
Ream and start multiple-process mafs on a disk.
.sp
.nf
	disk/mafs -r mafs_myservice -h 10  mydisk
	mount -c /srv/mafs_myservice /n/mafs_myservice
.fi
.sp
Ream and start mafs on a file. Also, mount thet filesystem at /n/mafs_myservice.
.sp
.nf
	dd -if /dev/zero -of myfile -bs 512 -count 128 # 64KB file
	mount -c <{disk/mafs -s -r mafs_service -h 10  myfile <[0=1]} /n/mafs_myservice

	# for reusing the contents of myfile later, remove -r (ream).
	mount -c <{disk/mafs -s -h 10  myfile <[0=1]} /n/mafs_myservice
.fi
.sp
Prepare and use a disk (/dev/sdF1) for mafs.
.sp
.nf
	disk/fdisk -bawp /dev/sdF1/data	# partition the disk
	echo '
	a fs 9  $-7
	w
	p
	q' | disk/prep -b  /dev/sdF1/plan9	# add an fs plan 9 partition to the disk
	disk/mafs -r mafs_sdF1 /dev/sdF1/fs	# -r to ream the disk
	mount -c /srv/mafs_sdF1 /n/mafs_sdF1

	# for using the mafs file system on the disk later on
	disk/mafs  /dev/sdF1/fs sdF1	# no -r
	mount -c /srv/mafs_sdF1 /n/mafs_sdF1
.fi
.sp
Stop Mafs.
.sp
	echo halt >> /n/mafs_myservice/adm/ctl
.sp
Interpret the contents of a block based on the tag and write out a single formatted block based on the tag
.sp
	disk/block  tests/test.0/disk 22
.sp
Traverse the directory heirarchy and write out all the used block numbers. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks. Also, writes the invalid blocks to stderr. Starting from root, walk down each directory entry printing out the linked blocks with invalid tags. Why not just write out the list of dirty blocks too? instead of using a different command for it?
.sp
	disk/used  tests/test.0/disk
.sp
From the contents of /adm/frees show the list of free blocks. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks
.sp
	disk/free  tests/test.0/disk
.sp
Read two lists of block numbers and flag the common and missing blocks.
.sp
.nf
	disk/reconcile -u <{disk/used  tests/test.0/disk} \\
		-F <{disk/free  tests/test.0/disk} 32
.fi
.sp
Find traverses the heirarchy and identifies the file that the block number belongs to.
.sp
	disk/find  disk.file blocknumber
.sp
.ti 0
.sp
Starting mafs on a 2MB byte file. The above commands will create a disk.file to use as a disk. Mount /n/mafs_disk.file for the file system.
.nf
.sp
	dd -if /dev/zero -of disk.file -bs 512 -count 4096;
.br
	mount -c <{disk/mafs -s -r mafs_disk.file -m 1 -n mafs_disk.file \\
		  <[0=1]} /n/mafs_disk.file
.fi
.sp
tests/sizes.c shows the maximum file size representable by a Dentry.
.br
.nf
.in 3n
: tests ; ./6.sizes
Namelen 174 Ndspanid 24 Niblock 4
Blocksize 500 Nspanidperblock 50 Nindperblock 62
Maxspanlen 2048 Maxspansize 1048564
A Tind0 unit points to 1 data spans (1048564 bytes)
	block points to 50 data spans
A Tind1 unit points to 50 data spans (52428200 bytes)
	block points to 3100 data spans
A Tind2 unit points to 3100 data spans (3250548400 bytes)
	block points to 192200 data spans
A Tind3 unit points to 192200 data spans (201534000800 bytes)
	block points to 11916400 data spans
sizeof(Dentry1) 326 Namelen 174
maxsize direct spans max 24
maxsize Tind0 50 max 74
maxsize Tind1 3100 max 3174
maxsize Tind2 192200 max 195374
maxsize Tind3 11916400 max 12111774
maximum possible spans 12111774
	(12111774*Maxspansize = 12699970192536 bytes)
	(12111774*Maxspansize = 12699970192536 bytes = 11 TB)
.fi
.in 0
.sp
.sp
.ne 6
.ft B
Tests
.ft R
.sp
tests/regress.rc: Regression tests.
.sp
tests/6.offsets: Write file using different offsets to test mafswrite().
.sp
tests/6.sizes: Show the effects of the different parameters.
.sp
tests/chkextents.rc: Unit tests on extents.
.sp
tests/6.testextents: Test extents.[ch] state changes.
.sp
The below disk state tests:
.in 3n
.br
.ti 0
1. Initializes a disk for mafs.
.br
.ti 0
2. Run mafs on that dsk.
.br
.ti 0
3. Stop mafs.
.br
.ti 0
4. Compares the contents with the expected contents (tests/test.0/blocks/*).
.in 0
.sp
.TS
box;
c s
c l
l a .
Disk State
=
Test	Description
_
tests/test.0	empty disk
tests/test.1	create a file /dir1/file1 and echo test into it
tests/test.2	writes at different offsets to a file and then removing the file
_
tests/test.3	write, read and deletion of files with sizes upto 16384 blocks
tests/test.4	directory copy
tests/test.5	fcp gzipped files
_
tests/test.6	df
tests/test.7	multiple processes working on the filesystem simultaneously
tests/test.8	check backup blocks locations
_
tests/test.9	examples used by this document
tests/test.a	write, read and delete a 100MB file
.TE
.sp
.TS
box;
c s
c l
l a .
Extents behaviour
=
Test	Description
_
tests/extents/addabove	Figure 1 of the Extents section
tests/extents/addabove1	Figure 2 of the Extents section
tests/extents/addbelow	Figure 3 of the Extents section
_
tests/extents/mergeabove	Figure 4 of the Extents section
tests/extents/mergenext	Figure 5 of the Extents section
tests/extents/mergeprevious	Figure 6 of the Extents section
.TE
.sp
.nf
To loop through all the blocks of a test:
.in 3n
.br
for(t in tests/test.2/blocks/^`{seq 0 39}*){ echo $t; echo '---------'; cat $t; echo }
.in 0
.fi
.sp
.sp
.ft B
Limitations
.ft R
.sp
As we use packed structs to store data to the disk, a disk with mafs is not be portable to a machine using a different endian system.
.sp
.sp
.ft B
References
.ft R
.sp
[1] Sean Quinlan, "A Cached WORM File System," Software--Practice and Experience, Vol 21., No 12., December 1991, pp. 1289-1299
.br
[2] Ken Thompson, Geoff Collyer, "The 64-bit Standalone Plan 9 File Server"