Operating System: Three Easy Pieces 讲义 NJU: 操作系统 - 蒋炎岩>
储存介质 (Non-Volatile Memory)
Name | Year | 原理 | Remarks |
---|---|---|---|
Magnetic Tape | 1928 | 磁比特数组 | 无法进行高效的随机读写 |
Magnetic Drum | 1932 | 并行版磁带 | |
Hard Disk | 1956 | 二维并行版磁带 | 柱面(platter)+磁道(track)+扇区(sector) |
Floppy Disk | 1971 | 读写头与盘片分离 | floppy drive 每个电脑一份 |
Compact Disk | 1980 | 1bit = 挖坑/填坑 | CD-ReWritable: 激光改变反射率 |
Solid State Drive | 1991 | NAND Flash | NAND(x,y)=$!(x&y)$ |
FTL(Flash translation layer) | 运行操作系统和闪存管理程序 | ||
USB Flash Disk | 1999 | 放电做不到100%干净 |
Hard Disk Time
- RPM: rotations per minute
- 外部半径大
- track skew
- multi-zoned
- cache (track buffer)
Name | Description | 经典数值 |
---|---|---|
rotation delay | rotate on track | 2ms |
seek time | acceleration+coasting+deceleration+settling(most, 0.5-2ms) | 4ms |
transfer time | read/write data | 30 ms |
tatol delay | rotation delay + seek time(0.5-2ms) + transfer time | |
I/O rate Random | Transfer size/total delay | 0.66MB/s(SCSI) 0.31MB/s(SATA) |
I/O rate Sequential | 125MB/s(SCSI) 105MB/s(SATA) |
disk scheduler
Name | Remarks |
---|---|
SJF | length of each job is usually unknown |
SSTF(Shortest Seek Time First) | Disk infomation may not be known to host |
NBF(nearest block first) | |
SCAN(elevator algorithm) | moves back and forth across the disk servicing re- quests in order across the tracks |
SPTF(shortest positioning time first) | seek time + rotation delay |
IO设备与驱动
访问指令
Name | Description |
---|---|
Port IO (PIO) | 为I/O设备提供了一个单独的地址空间,通过读/写端口的方式实现设备控制(状态、控制、数据寄存器) |
Memory-Mapped I/O (MMIO) | 给特定的内存地址赋予特殊的含义,从而读/写内存地址就能实现设备的访问 |
设备管理
Method | Description |
---|---|
Polling | polling and wait device read->send->polling and wait device finished |
interrupt | send data->do sth else->receive ITR->run ISR |
DMA | send info to DMA->do sth else->receive ITR->run ISR |
- coalescing: an interrupt-based optimzation
- a device which needs to raise an interrupt first waits for a bit before deliv- ering the interrupt to the CPU
- multiple interrupts can be coalesced into a single interrupt delivery
- ISR: interrupt service routine(interrupt handler)
设备驱动程序
为每一类设备统一接口访问
Category | Examples |
---|---|
字符设备 | 键盘、鼠标、终端、显卡 |
块设备 | 存储设备(磁盘、SSD) |
网络设备 | 网卡 |
设备举例
Name | Description |
---|---|
键盘控制器 | 缓冲区 |
磁盘控制器 | 通过状态、控制、数据访问,可配置为中断模式 |
显卡(GPU) | 协处理器;有专门的核心、指令等用于图像显示 |
DMA | 实现 memcpy() 的I/O设备;通过总线占用内存宽带 |
网卡 | 网络设备:内存,DMA,中断 |
- DMA(Direct Memery Access)
- 没有DMA时的运行顺序
- CPU 给设备发送少量命令(快)
- CPU 传送数据(慢)
- CPU 等待设备完成(并行)
- 有DMA时的运行
- 给 DMA 发送命令(快)
- DMA 传送数据(DMA 与 CPU 并行)
- DMA 等待设备完成 (DMA 与设备并行)
- CPU 等待 DMA 完成(DMA 与 CPU 并行)
- 没有DMA时的运行顺序
- 网络设备驱动
- send(): 发送包
- TX:发送数据
- RX:接受数据
- poll(): 询问是否收到包
- bufctl(): 设置 ring buffer
- send(): 发送包
- 每当需要做一件耗费CPU太多时间的专有工作,就可以增加一个I/O设备
- 图形渲染->GPU
- 大规模数据传送->DMA
设备总线 (I/O bus)
Category | Name | Description |
---|---|---|
General I/O Bus | PCI | |
Peripheral I/O Bus | SCSI | |
SATA | ||
USB |
文件系统
文件系统是
- 储存设备虚拟化(Virtualization)
- 保存在持久储存上的数据结构
- 管理操作系统内部对象的中间层
- 文件描述符 = 指向对象的指针
- read/write/ioctl系统调用 = 对象访问
文件系统实现
- 文件系统的设计
- balloc/bfree 的实现
- 虚拟磁盘的数据结构
- inode 的表示和储存
- 目录文件的数据结构
- 文件系统的实现 = 实现数据结构的查询/修改操作
- 虚拟文件系统 VFS
- 文件系统缓存
- static partitioning: approx 10%
- dynamic partition
API
Category | Name | Description |
---|---|---|
管理文件 | inode number | low-level name of a file |
open() /openat() |
`int fd=open(“foo”, O_CREAT | |
fsync() |
fsync:所有数据;fdatasync:数据;mync | |
fstat() |
||
管理目录 | isdir() |
|
opendir /closedir() |
DIR *dp = opendir(".") |
|
readdir() |
struct dirent *d = readdir(dp) |
|
fchmod() /fchown() |
||
rmdir() |
||
Link | hard link | same inode |
soft link | symbolic link,dangling reference,ln -s |
|
Filesystem | mkfs |
|
mount |
Filesystem List
Category | Name | Intro |
---|---|---|
Windows | FAT | File Allocation Table, FAT12,FAT16,FAT32 |
exFAT | ||
NTFS | New Technology File System | |
ReFS | B+-tree, CoW(Copy-on-Write) | |
macOS | HFS+ | |
APFS | B-Tree | |
Linux | Ext4 | Ext2, Ext3 |
ReiserFS | storing a huge number of small files | |
XFS | ||
JFS | JSF1, JSF2, also for clustered file systems | |
Btrfs | B-tree | |
Unix | JFS | |
Solaris | ZFS | also for clusterd file systems, COW |
Clustered file systems | GFS | Gobal File System of Red Hat Linux |
VMFS | Virtual Machine File System of VMware | |
Apple Xsan | ||
NetApp | WAFL | snapshots |
Others | LFS | Log-structured File SYstems |
Filesystem Hierarchy Standard
Name | Function |
---|---|
/boot | 系统启动数据 |
/dev | 设备文件 |
/sbin | 系统程序 |
/etc | 配置文件 |
/home | 用户目录 |
/lib | 库文件 |
/media | 可移动设备 |
/var | 可变文件 (logs, snapshots) |
/tmp | 临时文件 |
/usr | 用户程序 (/usr/bin/, /usr/lib/, /usr/local/ |
可靠性
- 硬件:a disk fails;a block fails
- 软件:一致性
RAID (Redundant Arrays of inexpensive disks)
Fail-stop model: a disk can be working or failed
RAIN Level | RAID-0 | RAIN-1 | RAID-4 | RAID-5 |
---|---|---|---|---|
Realization | striping | mirroring | parity-based | rotate parity |
Remark | chunk size | additive/subtractive | ||
Capacibility | $N*B$ | $(N*B)/2$ | $(N-1)*B$ | $(N-1)*B$ |
Reliability | 0 | 1 | 1 | 1 |
Best case | $N/2$ | |||
Throught | ||||
Sequential Read | $N*S$ | $(N/2)*S$ | $(N-1)*S$ | $(N-1)*S$ |
Sequential Write | $N*S$ | $(N/2)*S$ | $(N-1)*S$ | $(N-1)*S$ |
Random Read | $N*R$ | $N*R$ | $(N-1)*S$ | $N*R$ |
Random Write | $N*R$ | $(N/2)*R$ | $\frac{1}{2}R$ | $\frac{N}{4}R$ |
Latency | ||||
Read | $T$ | $T$ | $T$ | $T$ |
Write | $T$ | $T$ | $2T$ | $2T$ |
完整性
single-block failures
Errors | Description | Frequency(Cheap/Costly) | Detection |
---|---|---|---|
LSEs(latent-sector errors) | sector(s) damaged | 9.40%/1.40% | error correcting code(ECC) |
Corruption | silent faults like misdirected write | 0.50%/0.05% | checksum (say a 4KB block) |
Lost Writes | read-after-write/checksum elsewhere |
一致性
- 持久化:原子性丧失
- FAT: FAT[b]=EOF,data[b]=DATA,FAT[f]=b
- ext2: DMAP,inode,DATA 都需要更新
- 恢复文件系统一致性:FSCK(Filesystem Checking)
- 崩溃一致性:Journaling
- 利用 sync() 保证顺序性:磁盘会优化写入顺序
- ext3 journaling (data journaling)
- journal write: physical logging
- TxB,I[v2],B[v2],Db->Journal
- sync()
- journal commit
- TxD->Journal
- sync()
- Checkpoint
- I[v2],B[v2],Db->Disk
- journal write: physical logging
- Metadata Journaling (ordered journaling)
- Data write
- Db->Disk
- sync()
- Journal metadata write: logical logging
- TxB,I[v2],B[v2]->Journal
- sync()
- Journal commit
- TxE->Journal
- sync()
- Checkpoint metadata
- I[v2],B[v2]->Disk
- Free
- mark the transaction free in journal superblock
- Data write
- 崩溃恢复
- 用一个额外的指针维护journal完成的时刻
- 从指针开始,向后重做journal中记录的操作
- 其它方法
- Soft Updates: orders all writes to the file sys- tem to ensure that the on-disk structures are never left in an inconsis- tent state
- copy on write(COW): never overwrites files or directories in place
- LFS: writing to disk sequentially
- write buffering
- $D=\frac{F}{1-F}*R_{peak}*T_{position}$
- Checkpoint Region->Inode Map->Inode->[SS,D]
- recursive update problem
- garbage collection problem