Wim Vanderbauwhede
Jeremy Singer
Operating Systems
Foundations
with Linux on the Raspberry Pi
TEXTBOOK
Operating Systems
Foundations
with Linux on the Raspberry Pi
Wim Vanderbauwhede
Jeremy Singer
Operating Systems
Foundations
with Linux on the Raspberry Pi
TEXTBOOK
Arm Educaon Media is an imprint of Arm Limited, 110 Fulbourn Road, Cambridge, CBI 9NJ, UK
Copyright © 2019 Arm Limited (or its aliates). All rights reserved.
No part of this publicaon may be reproduced or transmied in any form or by any means, electronic
or mechanical, including photocopying, recording or any other informaon storage and retrieval
system, without permission in wring from the publisher, except under the following condions:
Permissions
You may download this book in PDF format for personal, non-commercial use only.
You may
reprint or republish portions of the text for non-commercial, educational or research
purposes but only if there is an attribution to Arm Education.
This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein). Nothing in this license grants you any right to modify
the whole, or portions of, this book.
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience
broaden our understanding, changes in research methods and professional practices may become
necessary.
Readers must always rely on their own experience and knowledge in evaluating and using any
information, methods, project work, or experiments described herein. In using such information or
methods, they should be mindful of their safety and the safety of others, including parties for whom
they have a professional responsibility.
To the fullest extent permitted by law, the publisher and the authors, contributors, and editors shall
not have any responsibility or liability for any losses, liabilities, claims, damages, costs or expenses
resulting from or suffered in connection with the use of the information and materials set out in this
textbook.
Such information and materials are protected by intellectual property rights around the world and are
copyright © Arm Limited (or its affiliates). All rights are reserved. Any source code, models or other
materials set out in this textbook should only be used for non-commercial, educational purposes (and/or
subject to the terms of any license that is specified or otherwise provided by Arm). In no event shall
purchasing this textbook be construed as granting a license to use any other Arm technology or know-how.
ISBN: 978-1-911531-21-0
Version: 1.0.0 – PDF
For information on all Arm Education Media publications, visit our website at www.armedumedia.com
To report errors or send feedback please email [email protected]
Foreword
xviii
Disclaimer
xix
Preface
xx
About the Authors
xxiv
Acknowledgments
xxv
1. A Memory-centric system model
1.1 Overview
2
1.2 Modeling the system
2
1.2.1 The simplest possible model
2
1.2.2 What is this ‘‘system state’’?
3
1.2.3 Rening non-processor acons
4
1.2.4 Interrupt requests
4
1.2.5 An important peripheral: the mer
5
1.3 Bare-bones processor model
5
1.3.1 What does the processor do?
5
1.3.2 Processor internal state: registers
6
1.3.3 Processor instrucons
7
1.3.4 Assembly language
8
1.3.5 Arithmec logic unit
8
1.3.6 Instrucon cycle
8
1.3.7 Bare bones processor model
10
1.4 Advanced processor model
11
1.4.1 Stack support
11
1.4.2 Subroune calls
12
1.4.3 Interrupt handling
13
1.4.4 Direct memory access
13
1.4.5 Complete cycle-based processor model
14
1.4.6 Caching
15
1.4.7 Running a program on the processor
18
1.4.8 High-level instrucons
19
1.5 Basic operang system concepts
20
1.5.1 Tasks and concurrency
20
1.5.2 The register le
21
1.5.3 Time slicing and scheduling
21
1.5.4 Privileges
23
Contents
vi
1.5.5 Memory management
23
1.5.6 Translaon look-aside buer (TLB)
24
1.6 Exercises and quesons
25
1.6.1 Task scheduling
25
1.6.2 TLB model
25
1.6.3 Modeling the system
25
1.6.4 Bare-bones processor model
25
1.6.5 Advanced processor model
25
1.6.6 Basic operang system concepts
25
2. A Praccal view of the Linux System
2.1 Overview
30
2.2 Basic concepts
30
2.2.1 Operang system hierarchy
31
2.2.2 Processes
31
2.2.3 User space and kernel space
32
2.2.4 Device tree and ATAGs
32
2.2.5 Files and persistent storage
32
Paron
32
File system
33
2.2.6 ‘Everything is a le’
33
2.2.7 Users
34
2.2.8 Credenals
34
2.2.9 Privileges and user administraon
35
2.3 Boong Linux on the Arm (Raspberry Pi 3)
36
2.3.1 Boot process stage 1: Find the bootloader
36
2.3.2 Boot process stage 2: Enable the SDRAM
36
2.3.3 Boot process stage 3: Load the Linux kernel into memory
37
2.3.4 Boot process stage 4: Start the Linux kernel
37
2.3.4 Boot process stage 5: Run the processor-independent kernel code
37
2.3.5 Inializaon
37
2.3.6 Login
38
2.4 Kernel administraon and programming
38
2.4.1 Loadable kernel modules and device drivers
38
2.4.2 Anatomy of a Linux kernel module
39
2.4.3 Building a custom kernel module
41
2.4.4 Building a custom kernel
42
Contents
vii
2.5 Kernel administraon and programming
42
2.5.1 Process management
42
2.5.2 Process scheduling
43
2.5.3 Memory management
43
2.5.4 Concurrency and parallelism
43
2.5.5 Input/output
43
2.5.6 Persistent storage
43
2.5.7 Networking
44
2.6 Summary
44
2.7 Exercises and quesons
44
2.7.1 Installing Raspbian on the Raspberry Pi 3
44
2.7.2 Seng up SSH under Raspbian
44
2.7.3 Wring a kernel module
44
2.7.4 Boong Linux on the Raspberry Pi
45
2.7.5 Inializaon
45
2.7.6 Login
45
2.7.7 Administraon
45
3. Hardware architecture
3.1 Overview
49
3.2 Arm hardware architecture
49
3.3 Arm Cortex M0+
50
3.3.1 Interrupt control
51
3.3.2 Instrucon set
51
3.3.3 System mer
52
3.3.4 Processor mode and privileges
52
3.3.5 Memory protecon
52
3.4 Arm Cortex A53
53
3.4.1 Interrupt control
53
3.4.2 Instrucon set
54
Floang-point and SIMD support
55
3.4.3 System mer
56
3.4.4 Processor mode and privileges
56
3.4.5 Memory management unit
57
Translaon look-aside buer
58
Addional caches
59
viii
Contents
3.4.6 Memory system
59
L1 Cache
59
L2 Cache
60
Data cache coherency
61
3.5 Address map
61
3.6 Direct memory access
63
3.7 Summary
64
3.8 Exercises and quesons
64
3.8.1 Bare-bones programming
64
3.8.2 Arm hardware architecture
65
3.8.3 Arm Cortex M0+
65
3.8.4 Arm Cortex A53
65
3.8.5 Address map
65
3.8.6 Direct memory access
65
4. Process management
4.1 Overview
70
4.2 The process abstracon
70
4.2.1 Discovering processes
71
4.2.2 Launching a new process
71
4.2.3 Doing something dierent
73
4.2.4 Ending a process
73
4.3 Process metadata
74
4.3.1 The /proc le system
75
4.3.2 Linux kernel data structures
76
4.3.3 Process hierarchies
77
4.4 Process state transions
79
4.5 Context switch
81
4.6 Signal communicaons
83
4.6.1 Sending signals
83
4.6.2 Handling signals
84
4.7 Summary
85
4.8 Further reading
85
4.9 Exercises and quesons
85
4.9.1 Mulple choice quiz
85
4.9.2 Metadata mix
86
Contents
ix
4.9.3 Russian doll project
86
4.9.4 Process overload
86
4.9.5 Signal frequency
86
4.9.6 Illegal instrucons
86
5. Process scheduling
5.1 Overview
90
5.2 Scheduling overview: what, why, how?
90
5.2.1 Denion
90
5.2.2 Scheduling for responsiveness
90
5.2.3 Scheduling for performance
91
5.2.4 Scheduling policies
91
5.3 Recap: the process lifecycle
91
5.4 System calls
93
5.4.1 The Linux syscall(2) funcon
94
5.4.2 The implicaons of the system call mechanism
95
5.5 Scheduling principles
95
5.5.1 Preempve versus non-preempve scheduling
96
5.5.2 Scheduling policies
96
5.5.3 Task aributes
96
5.6 Scheduling criteria
96
5.7 Scheduling policies
97
5.7.1 First-come, rst-served (FCFS)
97
5.7.2 Round-robin (RR)
98
5.7.3 Priority-driven scheduling
98
5.7.4 Shortest job rst (SJF) and shortest remaining me rst (SRTF)
99
5.7.5 Shortest elapsed me rst (SETF)
100
5.7.6 Priority scheduling
100
5.7.7 Real-me scheduling
100
5.7.8 Earliest deadline rst (EDF)
101
5.8 Scheduling in the Linux kernel
101
5.8.1 User priories: niceness
102
5.8.2 Scheduling informaon in the task control block (TCB)
102
5.8.3 Process priories in the Linux kernel
104
Priority info in task_struct
105
Priority and load weight
106
x
Contents
5.8.4 Normal scheduling policies: the completely fair scheduler (CFS)
107
5.8.5 So real-me scheduling policies
110
5.8.6 Hard real-me scheduling policy
112
Time budget allocaon
113
5.8.7 Kernel preempon models
115
5.8.8 The red-black tree in the Linux kernel
116
Creang a new rbtree
116
Searching for a value in a rbtree
117
Inserng data into a rbtree
117
Removing or replacing exisng data in a rbtree
118
Iterang through the elements stored in a rbtree (in sort order)
118
Cached rbtrees
119
5.8.9 Linux scheduling commands and API
119
Normal processes
119
Real-me processes
120
5.9 Summary
120
5.10 Exercises and quesons
120
5.10.1 Wring a scheduler
120
5.10.2 Scheduling
120
5.10.3 System calls
121
5.10.4 Scheduling policies
121
5.10.5 The Linux scheduler
121
6. Memory management
6.1 Overview
126
6.2 Physical memory
126
6.3 Virtual memory
127
6.3.1 Conceptual view of memory
127
6.3.2 Virtual addressing
128
6.3.3 Paging
129
6.4 Page tables
130
6.4.1 Page table structure
130
6.4.2 Linux page tables on Arm
132
6.4.3 Page metadata
134
6.4.4 Faster translaon
136
6.4.5 Architectural details
137
Contents
xi
6.5 Managing memory over-commitment
138
6.5.1 Swapping
138
6.5.2 Handling page faults
138
6.5.3 Working set size
141
6.5.4 In-memory caches
142
6.5.5 Page replacement policies
142
Random
143
Not recently used (NRU)
143
Clock
143
Least recently used
144
Tuning the system
144
6.5.6 Demand paging
145
6.5.7 Copy on Write (CoW)
146
6.5.8 Out of memory killer
147
6.6 Process view of memory
148
6.7 Advanced topics
149
6.8 Further reading
151
6.9 Exercises and quesons
151
6.9.1 How much memory?
151
6.9.2 Hypothecal address space
152
6.9.3 Custom memory protecon
152
6.9.4 Inverted page tables
152
6.9.5 How much memory?
153
6.9.6 Tiny virtual address space
153
6.9.7 Denions quiz
153
7. Concurrency and parallelism
7.1 Overview
158
7.2 Concurrency and parallelism: denions
158
7.2.1 What is concurrency?
158
7.2.2 What is parallelism?
158
7.2.3 Programming model view
158
7.3 Concurrency
159
7.3.1 What are the issues with concurrency?
159
Shared resources
159
Exchange of informaon
160
xii
Contents
7.3.2 Concurrency terminology
161
Crical secon
161
Synchronizaon
161
Deadlock
161
7.3.3 Synchronizaon primives
163
7.3.4 Arm hardware support for synchronizaon primives
163
Exclusive operaons and monitors
163
Shareability domains
164
7.3.5 Linux kernel synchronizaon primives
165
Atomic primives
165
Memory operaon ordering
169
Memory barriers
170
Spin locks
173
Futexes
174
Kernel mutexes
174
Semaphores
175
7.3.6 POSIX synchronizaon primives
177
Mutexes
178
Semaphores
178
Spin locks
179
Condion variables
179
7.4 Parallelism
181
7.4.1 What are the challenges with parallelism?
181
7.4.2 Arm hardware support for parallelism
182
7.4.3 Linux kernel support for parallelism
183
SMP boot process
183
Load balancing
183
Processor anity control
184
7.5 Data-parallel and task-parallel programming models
184
7.5.1 Data parallel programming
184
Full data parallelism: map
184
Reducon
185
Associavity
185
Binary tree-based parallel reducon
185
7.5.2 Task parallel programming
186
Contents
xiii
7.6 Praccal parallel programming frameworks
186
7.6.1 POSIX Threads (pthreads)
186
7.6.2 OpenMP
189
7.6.3 Message passing interface (MPI)
190
7.6.4 OpenCL
191
7.6.5 Intel threading building blocks (TBB)
194
7.6.6 MapReduce
195
7.7 Summary
195
7.8 Exercises and quesons
195
7.8.1 Concurrency: synchronizaon of tasks
196
7.8.1 Parallelism
196
8. Input/output
8.1 Overview
202
8.2 The device zoo
202
8.2.1 Inspect your devices
203
8.2.2 Device classes
203
8.2.3 Trivial device driver
204
8.3 Connecng devices
206
8.3.1 Bus architecture
206
8.4 Communicang with devices
207
8.4.1 Device abstracons
207
8.4.2 Blocking versus non-blocking IO
207
8.4.3 Managing IO interacons
208
Polling
209
Interrupts
209
Direct memory access
210
8.5 Interrupt handlers
210
8.5.1 Specic interrupt handling details
211
8.5.2 Install an interrupt handler
212
8.6 Ecient IO
213
8.7 Further reading
213
8.8 Exercises and quesons
213
8.8.1 How many interrupts?
213
8.8.2 Comparave complexity
214
8.8.3 Roll your own Interrupt Handler
214
8.8.4 Morse Code LED Device
214
xiv
Contents
9. Persistent storage
9.1 Overview
218
9.2 User perspecve on the le system
218
9.2.1 What is a le?
218
9.2.2 How are mulple les organized?
219
9.3 Operaons on les
221
9.4 Operaons on directories
222
9.5 Keeping track of open les
222
9.6 Concurrent access to les
223
9.7 File metadata
225
9.8 Block-structured storage
226
9.9 Construcng a logical le system
228
9.9.1 Virtual le system
228
9.10 Inodes
230
9.10.1 Mulple links, single inode
231
9.10.2 Directories
231
9.11 ext4
233
9.11.1 Layout on disk
233
9.11.2 Indexing data blocks
224
9.11.3 Mulple links, single inode
237
9.11.4 Checksumming
237
9.11.5 Encrypon
237
9.12 FAT
238
9.12.1 Advantages of FAT
239
9.12.2 Construct a mini le system using FAT
240
9.13 Latency reducon techniques
242
9.14 Fixing up broken le systems
243
9.15 Advanced topics
243
9.16 Further reading
245
9.17 Exercises and quesons
245
9.17.1 Hybrid conguous and linked le system
245
9.17.2 Extra FAT le pointers
245
9.17.3 Expected le size
245
9.17.4 Ext4 extents
245
9.17.5 Access mes
245
9.17.6 Database decisions
246
Contents
xv
10. Networking
10.1 Overview
250
10.2 What is networking
250
10.3 Why is networking part of the kernel?
250
10.4 The OSI layer model
251
10.5 The Linux networking stack
252
10.5.1 Device drivers
253
10.5.2 Device-agnosc interface
253
10.5.3 Network protocols
253
10.5.4 Protocol-agnosc interface
253
10.5.5 System call interface
254
10.5.6 Socket buers
254
10.6 The POSIX standard socket interface library
255
10.6.1 Stream socket (TCP) communicaons ow
255
10.6.2 Common internet data types
256
Socket address data type: struct sockaddr
256
Internet socket address data type: struct sockaddr_in
257
10.6.3 Common POSIX socket API funcons
258
Create a socket descriptor: socket()
258
Bind a server socket address to a socket descriptor: bind()
258
Enable server socket connecon requests: listen()
259
Accept a server socket connecon request: accept()
259
Client connecon request: ‘connect()’
260
Write data to a stream socket: send()
261
Read data from a stream socket: recv()
261
Seng server socket opons: setsockopt()
262
10.6.4 Common ulity funcons
263
Internet address manipulaon funcons
263
Internet network/host byte order manipulaon funcons
263
Host table access funcons
264
10.6.5 Building applicaons with TCP
264
Request/response communicaon using TCP
264
TCP server
265
TCP client
266
10.6.6 Building applicaons with UDP
268
UDP server
269
xvi
Contents
UDP client
270
UDP client using connect()
271
10.6.7 Handling mulple clients
271
The select() system call
271
Mulple server processes: fork() and exec()
275
Multhreaded servers using pthreads
276
10.7 Summary
278
10.8 Exercises and quesons
278
10.8.1 Simple social networking
278
10.8.2 The Linux networking stack
278
10.8.3 The POSIX socket API
278
11. Advanced topics
11.1 Overview
282
11.2 Scaling down
282
11.3 Scaling up
283
11.4 Virtualizaon and containerizaon
285
11.5 Security
287
11.5.1 Rowhammer, Rampage, Throwhammer, and Nethammer
287
11.5.2 Spectre, Meltdown, Foreshadow
288
11.6 Vericaon and cercaon
289
11.7 Recongurability
291
11.8 Linux development roadmap
292
11.9 Further reading
293
11.10 Exercises and quesons
293
11.10.1 Make a minimal kernel
293
11.10.2 Verify important properes
293
11.10.3 Commercial comparison
293
11.10.4 For or against cercaon
293
11.10.5 Devolved decisions
293
11.10.6 Underclock, overclock
293
Glossary of terms
296
Index
304
Contents
xvii
Foreword
In 1983, when I started modeling a RISC processor using a simulator wrien in BBC Basic on a BBC
Microcomputer, I could hardly have conceived that there would be billions of Arm (then short for
Acorn RISC Machine’) processors all over the world within a few decades.
I expect Linus Torvalds has similar feelings, when he thinks back to the early days, craing a prototype
operang system for his i386 PC. Now Linux runs on a vast array of devices, from smartwatches to
supercomputers. I am delighted that an increasing proporon of these devices are built around Arm
processor cores.
In a more recent tale of runaway success, the Raspberry Pi single-board computer has far exceeded
its designers’ inial expectaons. The Raspberry Pi Foundaon thought they might sell one thousand
units, ‘maybe 10 thousand in our wildest dreams.With sales gures now around 20 million, the
Raspberry Pi is rmly established as Britain’s best-selling computer.
This textbook aims to bring these three technologies together—Arm, Linux, and Raspberry Pi. The
authors’ ambious goal is to ‘make Operang Systems fun again.As a professor in one of the UK’s
largest university Computer Science departments, I am well aware that modern students demand
engaging learning materials. Dusty 900-page textbooks with occasional black and white illustraons
are not well received. Today’s learners require interacve content, gaining understanding through
praccal experience and intuive analogies. My observaon applies to students in tradional higher
educaon, as well as those pursuing blended and fully online educaon. I am condent this innovave
textbook will meet the needs of the next generaon of Computer Science students.
While the modern systems soware stack has become large and complex, the fundamental principles
are unchanging. Operang Systems must trade-o abstracon for eciency. In this respect, Linux
on Arm is parcularly instrucve. The authors do an excellent job of presenng Operang Systems
concepts, with direct links to concrete examples of these concepts in Linux on the Raspberry Pi.
Please don’t just read this textbook – buy a Pi and try out the praccal exercises as you go.
Was it Plutarch who said, ‘The mind is not a vessel to be lled but a re to be kindled’? We could
translate this into the Operang Systems domain as follows: ‘Learning isn’t just reading source code;
it’s bootstrapping machines.’ I hope that you enjoy all these acvies, as you explore Operang
Systems with Linux on Arm using your Raspberry Pi.
Steve Furber CBE FRS FREng
ICL Professor of Computer Engineering
The University of Manchester, UK
February 2019
xviii
xix
Disclaimer
The design examples and related soware les included in this book are created for educaonal
purposes and are not validated to the same quality level as Arm IP products. Arm Educaon Media
and the author do not make any warranes of these designs.
Note
When we developed the material for this textbook, we worked with Raspberry Pi 3B boards.
However, all our praccal exercises should work on other generaons and variants of Raspberry Pi
devices, including the more recent Raspberry Pi 4.
Preface
Introducon
Modern computer devices are fabulously complicated both in terms of the processor hardware and
the soware they run.
At the heart of any modern computer device sits the operang system. And if the device is a
smartphone, IoT node, datacentre server or supercomputer, then the operang system is very likely to
be Linux: about half of consumer devices run Linux; the vast majority of smartphones worldwide (86%)
run Android, which is built on the Linux kernel. Of the top one million web servers, 98% run Linux.
Finally, the top 500 fastest supercomputers in the world all run Linux.
On the hardware side, Arm has a 95% market share in smartphone and tablet processors as well as
being used in the majority of Internet of Things (IoT) devices such as webcams, wireless routers, etc.
and embedded devices in general.
Since its creaon by Linus Torvalds in 1991, the eorts of thousands of people, most of them
volunteers, have turned Linux into a state-of-the-art, exible and powerful operang system, suitable
for any system from ny IoT devices to the most powerful supercomputers.
Meanwhile, in roughly the same period, the Arm processor range has expanded to cover an equally
wide gamut of systems and devices, including the remarkably successful Raspberry Pi.
So if you want to learn about Operang Systems but keep a praccal, real-world focus, then this book
is an ideal starng point. This book will help you answer quesons such as:
What is a le, and why is the le concept so important in Linux?
What is scheduling and how can knowledge of Linux scheduling help you create a high-throughput
video processor or a mission-crical real-me system?
What are POSIX threads, and how can the Linux kernel assist you in making your multhreaded
applicaons faster and more responsive?
How does the Linux kernel support networking, and how do you create network clients and servers?
How does the Arm hardware assist the Linux kernel in managing memory and how does
understanding memory management make you a beer programmer?
The aim of this book is to provide a praccal introducon to the foundaons of modern operang
systems, with a parcular focus on GNU/Linux and the Arm plaorm. Our unique perspecve is
that we explain operang systems theory and concepts but ground them in praccal use through
illustrave examples of their implementaon in GNU/Linux, as well as making the connecon with
the Arm hardware supporng the OS funconality.
xx
Is this book suitable for you?
This book does not require prior knowledge of operang systems, but some familiarity with command-
line operaons in a GNU/Linux system is expected. We discuss technical details of operang systems,
and we use source code to illustrate many concepts. Therefore, you need to know C and Python, and
you need to be familiar with basic data structures such as arrays, queues, stacks and trees.
This textbook is ideal for a one-semester course introducing the concepts and principles underlying
modern operang systems. It complements the Arm online courses in Real-Time Operang Systems
Design and Programming, and Embedded Linux.
Online addional material
The companion web site of the book (www.dcs.gla.ac.uk/operang-system-foundaons) contains:
Source code for all original code snippets listed in the book;
Answers to quesons and exercises;
Lab materials;
Addional content;
Addional teaching materials;
Further reading.
Target plaorm
This textbook focuses on the Raspberry Pi 3, an Arm Cortex-A53 plaorm running Linux. We use the
Raspbian GNU/Linux distribuon. However, the book does not specically depend on this plaorm
and distribuon, except for the exercises.
If you don’t own a Raspberry Pi 3, you can use the QEMU emulator which supports the Raspberry Pi 3.
Soware development environment
The code examples in this book are either in C or Python 3. We assume that the reader has access
to a Linux system with an installaon of Python, a C compiler, the make build tool and the git version
control tool.
Structure
The structure of this textbook is based on our many years of teaching operang systems courses at
undergraduate and masters level, taking into account the feedback provided by the reviewers of the
text. The content of the text is closely aligned to the Compung Curricula 2001 Compung Science
report recommendaons for teaching Operang Systems, published by the Joint Task Force of the
IEEE Compung Society and the Associaon for Compung Machinery (ACM).
xxi
Preface
The book is organized into eleven chapters.
Chapters 1 and 2 provide alternate introductory views to operang systems.
Chapter 1 A memory-centric system model presents a top-down view. In this chapter, we introduce a
number of abstract models for processor-based systems. We use Python code to describe the models
and only use simple data structures and funcons. The purpose is to help the student understand that
in a processor-based system, all acons fundamentally reduce to operaons on addresses. The models
are gradually being rened as the chapter advances, and by the end, the model integrates the basic
operang system funconality into a runnable Python-based processor model.
Chapter 2 A praccal view of the Linux system approaches the Linux system from a praccal
perspecve: what actually happens when we boot and run the system, how does it work and what is
required to make it work. We rst introduce the essenal concepts and techniques that the student
needs to know in order to understand the overall system, and then we discuss the system itself.
The aim of this part is to help the student answer quesons such as “what happens when the system
boots?” or “how does Linux support graphics?”. This is not a how-to guide, but rather, provides the
student with the background knowledge behind how-to guides.
In Chapter 3 Hardware architecture, we discuss the hardware on which the operang system runs,
the hardware support for operang systems (dedicated registers, MMU, DMA, interrupt architecture,
relevant details about the bus/NoC architecture, ...), the memory subsystem (caches, TLB), high-level
language support, boot subsystem and boot sequence. The purpose is to provide the student with a
useable mental model for the hardware system and to explain the need for an operang system and
how the hardware supports the OS. In parcular, we study the Linux view on the hardware system.
The next seven chapters form the core of the book, each of these introduces a core Operang System
concept.
In Chapter 4, Process management, we introduce the process abstracon. We outline the state
that needs to be encapsulated. We walk through the typical lifecycle of a process from forking to
terminaon. We review the typical operaons that will be performed on a process.
Chapter 5 Process scheduling discusses how the OS schedules processes on a processor. This includes
the raonale for scheduling, the concept of context switching, and an overview of scheduling policies
(FCFS, priority, ...) and scheduler architectures (FIFO, mullevel feedback queues, priories, ...). The
Linux scheduler is studied in detail.
While memory itself is remarkably straighorward, OS architects have built lots of abstracon layers
on top. Principally, these abstracons serve to improve performance and/or programmability. In
Chapter 6 Memory management, we review caches (in hardware and soware) to improve access
speed. We go into detail about virtual memory to improve the management of physical memory
resource. We will provide highly graphical descripons of address translaon, paging, page tables,
page faults, swapping, etc. We explore standard schemes for page replacement, copy-on-write, etc.
We will examine concrete examples in Arm architecture and Linux OS.
xxii
Preface
In Chapter 7, Concurrency and parallelism, we discuss how the OS supports concurrency and how the
OS can assist in exploing hardware parallelism. We dene concurrency and parallelism and discuss
how they relate to threads and processes. We discuss the key issue of resource sharing, covering
locking, semaphores, deadlock and livelock. We look at OS support for concurrent and parallel
programming via POSIX threads and present an overview of praccal parallel programming techniques
such as OpenMP, MPI and OpenCL.
Chapter 8 Input/output presents the OS abstracon of an IO device. We review device interfacing,
covering topics like Polling, Interrupts and DMA. We will invesgate a range of device types, to
highlight their diverse features and behavior. We will cover hardware registers, memory mapping
and coprocessors. Further, we will examine the ways in which devices are exposed to programmers.
We will review the structure of a typical device driver.
Chapter 9 Persistent storage focuses on data storage. We outline the range of use cases for le
systems. We explain how the raw hardware (block- and sector-based 2d storage, etc.) is abstracted at
the OS level. We talk about mapping high-level concepts like les, directories, permissions, etc., down
to physical enes. We review allocaon, space management, and recovery from failure. We present
a case study of a Linux le system. We also discuss Windows-style FAT, since this is how USB bulk
storage operates.
Chapter 10 Networking introduces networking from an OS perspecve: why is networking treated
dierently from other types of IO, what are the OS requirements to support the OSI stack. We
introduce socket programming with a focus of the role the OS plays (e.g. zero-copy buers, le
abstracon, supporng mulple clients, ...).
Finally, Chapter 11 Advanced topics discusses a number of concepts that go beyond the material of
the previous chapters: The rst part of this chapter deals with customisaon of Linux for Embedded
Systems, Linux on systems without MMU, and datacentre level operang systems. The second
part discusses the security of Linux-based systems, focusing on validaon and vericaon of OS
components and the analysis of recent security exploits.
We hope that you enjoy both reading our book and doing the exercises – especially if you are trying
them on the Raspberry Pi. Please do let us know what you think about our work and how we could
improve it by sending your comments to Arm Educaon Media [email protected]
Jeremy Singer and Wim Vanderbauwhede, 2019
xxiii
Preface
About the Authors
Wim Vanderbauwhede
School of Compung Science, University of Glasgow, UK
Prof. Wim Vanderbauwhede is Professor in Compung Science at the School of Compung Science
of the University of Glasgow. He has been teaching and researching operang systems for over
a decade. His research focuses on high-level programming, compilaon, and architectures for
heterogeneous manycore systems and FPGAs, with a special interest in power-ecient compung
and scienc High-Performance Compung (HPC). He is the author of the book ‘High-Performance
Compung Using FPGAs’. He received his Ph.D. in Electrotechnical Engineering with Specialisaon
in Physics from the University of Gent, Belgium in 1996. Before moving into academic research,
Prof. Vanderbauwhede worked as an ASIC Design Engineer and Senior Technology R&D Engineer for
Alcatel Microelectronics.
Jeremy Singer
School of Compung Science, University of Glasgow, UK
Dr. Jeremy Singer is a Senior Lecturer in Systems at the School of Compung Science of the University
of Glasgow. His main research theme involves programming language runmes, with parcular
interests in garbage collecon and manycore parallelism. He leads the Federated Raspberry Pi
Micro-Infrastructure Testbed (FRµIT) team, invesgang next-generaon edge compute plaorms.
He received his Ph.D. from the University of Cambridge Computer Laboratory in 2006. Singer and
Vanderbauwhede also collaborated in the design of the FutureLearn ‘Funconal Programming in
Haskell’ massive open online course.
xxiv
Acknowledgements
The authors would like to thank the following people for their help:
Khaled Benkrid, who made this book possible.
Ashkan Tousimojarad, who originally suggested the project.
Melissa Good, Jialin Dou and Michael Shu who kept us on track and assisted us with the process.
The reviewers at Arm who provided valuable feedback on our dras.
Tony Garnock-Jones, Dejice Jacob, Richard Morer, Colin Perkins, and other colleagues who
commented on early versions of the text.
Steve Furber, for his kind endorsement of the book.
Lovisa Sundin, for her help with illustraons.
Jim Garside, Krisan Hentschel, Simon McIntosh-Smith, Magnus Morton and Michèle Weiland for
kindly allowing us to use their photographs.
The countless volunteers who made the Linux kernel what it is today.
xxv
Chapter 1
A Memory-centric
system model
Operang Systems Foundaons with Linux on the Raspberry Pi
2
1.1 Overview
In this chapter, we will introduce a number of abstract memory-centric models for processor-based
systems. We will use Python code to describe the models and only use simple data structures and
funcons. The models are abstract in the sense that we do not build the processor system starng from
its physical building blocks (transistors, logic gates, etc.), but rather, we model it in a funconal way.
The purpose is to help you understand that in a processor-based system, all acons fundamentally
reduce to operaons on addresses. This is a very important point: every observable acon in a processor-
based system is the result of wring to or reading from an address locaon.
In parcular, this includes all peripherals of the system, such as the network card, keyboard, and display.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Discuss the importance of state and the address space in a processor-based system.
2. Create a processor-based system model in a high-level language.
3. Implement basic operang system concepts such as me slicing in machine code.
4. Explain how hardware and soware features of a processor-based system are designed to handle
I/O, concurrency, and performance.
1.2 Modeling the system
A microprocessor is driven by a clock.
Our model will describe the acons at every ck of the clock using funcons.
We will model the system through its state, represented as a simple data structure.
By “state,we mean informaon that is persistent, i.e., some form of memory. This is not limited to
actual computer memory. For example, if our system controls a robot arm, then the posion of the
arm is part of the state of the system.
1.2.1 The simplest possible model
We start our system model by stang that the acon of the processor modies the system state:
systemState = processorAction(systemState)
Python
In pracce, the system also interacts with the outside world through peripherals such as the keyboard,
network interface, etc., generally called “I/O devices”, storage devices such as disks, etc. Let's just call
these types of acons to modify the state ‘non-processor acons’. Adding this to our model, we get:
3
Lisng 1.2.1: System state with non-processor acons Python
1 systemState = nonProcessorAction(systemState)
2 systemState systemState = processorAction(systemState)
In a real system, these acons happen at the same me (we call concurrent acons), so one of the
quesons (that we will address in detail in Chapters 7, ‘Concurrency and parallelism’) is how to make
sure that the system state does not become undetermined as a result of concurrent acons. But rst,
let’s look in a bit more detail at the system state.
1.2.2 What is this ‘system state’?
We say that the processor ‘‘modies the system state’’, so let’s take a closer look at this system
state. From the point of view of the processor, the system state is simply a xed-size array of unsigned
integers. Nothing more than that. In C syntax, we can express this as shown in Lisng 1.2.2:
Lisng 1.2.2: System state as C array C
1 int systemState[STATE_SZ]
Which means that manipulaon of the system state, and by consequence, anything that happens in
a processor-based system boils down to modifying this array.
So, what does this array actually represent? It represents all of the memory in the system, not just the
actual system memory (DRAM, Dynamic Random Access Memory) but including the I/O devices and
other peripherals such as disks. In system terms, this is known as the ‘physical address space, and we
will discuss this in detail in Chapter 6, ‘‘Memory management.’’
1
In other words, the system state is composed of the states of all the system components, for example
for a system with a keyboard kbd, network interface card nic, solid state disk ssd, graphics processing
unit gpu, and random access memory ram:
systemState = ramState + kbdState + nicState + ssdState + gpuState
Python
Where ramState, kbdState, nicState, etc. are all xed-size arrays of integers.
However, it could of course equally be:
systemState = ssdState + kbdState + nicState + ramState + gpuState
Python
The above are two examples of address space layouts. The descripon of the purpose, size, and
posion of the address regions for memory and peripherals is called the address map. As an illustraon,
the Arm address map for A-class systems [1] is shown in Figure 1.1.
1
As our model focuses on Arm-based systems, we do not discuss port-mapped I/O.
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
4
Figure 1.1: Arm 40-bit address map.
Figure 1.1. If the address size is 32 bits, we can address 2
32
= 4GB of memory. We see from the gure
that dierent regions are reserved for dierent purposes, e.g., the second GB is memory mapped I/O,
and the upper 2 GB are random access memory (DRAM).
1.2.3 Rening non-processor acons
Using the more detailed state from above, we can split the non-processor acons into per-peripheral
acons, so that our model becomes:
Lisng 1.2.3: Model with per-peripheral acons Python
1 kbdState=kbdAction(kbdState)
2 nicState=nicAction(nicState)
3 ssdState=ssdAction(ssdState)
4 gpuState=gpuAction(gpuState)
5 systemState = ramState+kbdState+nicState+diskState+gpuState
6 systemState = processorAction(systemState)
Each of these acons only aects the state of the peripheral; the rest of the system state remains
unaected.
1.2.4 Interrupt requests
Let’s return now to the potenal problem of state modied by concurrent acons. The way we just
separated the state oers a possible soluon. Now we can create a kind of nocaon mechanism
which lets the processor know that an outside acon has modied the state
2
.
This is exactly what happens in real systems, and the mechanisms used are called interrupts. We will
discuss this in detail in Chapter 8, ‘Input/output’, but it is useful to add an interrupt mechanism to our
abstract model.
A peripheral can send an interrupt request (IRQ) to the processor. We will model the interrupt request as a
boolean ag which is returned by every peripheral acon together with its state (as a tuple). The processor
2
We could also let the processor check if the state of a peripheral was changed before acng on it. This approach is called polling and will be discussed in Chapter 8, ‘Input/output’.
5
acon receives an array of these interrupt requests and uses the array index to idenfy the peripheral that
raised the interrupt (‘raising an interrupt’ in our model means seng the boolean ag to True).
In pracce, the mechanism is more complicated because many peripherals can raise mulple dierent
interrupt requests depending on the condion. Typically, a dedicated peripheral called interrupt
controller is used to manage the interrupts from the various devices.
Note that the interrupt mechanism is purely a nocaon mechanism: it does not stop the processor
from modifying the peripheral state, all it does is nofy the processor that the peripheral unilaterally
changed its state. So in principle, the peripheral could sll be modifying its state at the very same me
that the processor is modifying it. In what follows, we simply assume that this cannot happen, i.e., if a
peripheral is modifying its state, then the processor can’t change it and vice versa. A possible model for
this is that the peripheral state change and the interrupt request are happening at the same me and
that the processor always needs to process the request before making a state change.
Lisng 1.2.4: Model with interrupt requests Python
1 (kbdState,kbdIrq)=kbdAction(kbdState)
2 ...
3
4 irqs=[kbdIrq,...]
5
6 systemState = ramState+kbdState+nicState+diskState+gpuState
7 (systemState,irqs) = processorAction(systemState,irqs)
We will see in the next secon how the processor handles interrupts.
1.2.5 An important peripheral: the mer
A mer is a peripheral that counts me in terms of the system clock. It can be programmed to ‘re’
periodically at given intervals, or aer a one-o interval. When a mer ‘res’ it raises an interrupt
request. The mer is parcularly important because it is the principal mechanism used by the
operang system to track the progress of me and allows it to schedule tasks.
(timerState, timerIrq)=timerAction(timerState)
Python
1.3 Bare-bones processor model
To gain more insight into the way the processor modies the system state, we will build a simple processor
model which models how the processor changes the system state at every clock cycle. The purpose of
this model is to make the introducon of the more abstract model in Secon 1.4 easier to understand.
1.3.1 What does the processor do?
The processor is a machine to modify the system state. You need to know that …
A key feature of a processor is the ability to run arbitrary programs.
A program consists of a series of instrucons.
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
6
An instrucon determines how the processor interacts with the system through the address space: it can
read values at given addresses, compute new values and addresses, and write values to given addresses.
Note that the program is itself part of the system state. The program running on the processor can control
which part of the enre program code to access. This is what allows us to create an operang system.
1.3.2 Processor internal state: registers
Although in principle, a processor could directly manipulate the system state, this is not praccal
because DRAM memory access is quite slow. Therefore, in pracce, processors have a dedicated
internal state known as the register le, an array of words called registers which you can consider
as a small but very fast memory. The register le is separate from the rest of the system state (it is
a ‘separate address space’). This means we have to rene our model to separate the register le from
the rest of the system state, which we will call systemState. We do this using a tuple
3
:
(systemState,irqs,registers) = processorAction(systemState,irqs,registers)
Python
For convenience, registers oen have names (mnemonics). For example, Figure 1.2 shows the core
AArch32 register set of the Arm Cortex-A53 [2].
There are 16 ordinary registers (and ve special ones which we have omied). Registers R0-R12 are
the ‘General-purpose registers’. Then there are three registers with special names: the Stack Pointer
(SP), the Link Register (LR) and the Program Counter (PC).
Figure 1.2: Arm Cortex-A53 AArch32 register set.
3
Alternavely, we could make the registers part of the system state similar to the state of the peripherals. Our choice is purely for convenience because it makes it easier
to manipulate the registers in the Python code.
SP (R13)
LR (R14)
PC (R15)
R5
R6
R7
R0
R1
R3
R4
R2
R10
R11
R12
R8
R9
Low registers
High registers
General-purpose
registers
Stack Pointer
Link Register
Program Counter
7
1.3.3 Processor instrucons
A typical processor can perform a wide range of instrucons on memory addresses and/or register
values. We will use a simple list-based notaon for all instrucons. We will use the (uppercase) Arm
mnemonics for registers and instrucons; in Python, these are simply variables; their denions can
be found in the code repository in le abstract_system_constants.py.
We will assume that all instrucons take up to three registers as arguments, for example
add_instr = [ADD,R3,R1,R2]
Python
which means that the result of ADD operang on registers R1 and R2 is stored in register R3.
Apart from computaonal (arithmec and logic) instrucons we also introduce the instrucons LDR, e.g.
load_instr = [LDR,R1,R2]
Python
and STR, e.g.
store_instr=[STR,R1,R2]
Python
which respecvely load the content of a memory address stored in R2 into register R1 and store the
content of register R1 at the address locaon given in R2.
We also have MOV, which copies data between two registers, e.g.
set_instr = [MOV,R1,R2]
Python
will set the content of R1 to the content of R2.
We have a special non-Arm instrucon called SET, which takes a register and a value as arguments, e.g.
set_instr = [SET,R1,42]
Python
will set the content of R1 to 42.
We also need some instrucons to control the ow of the program, such as branches (B)
goto_instr = [B,R1]
Python
where R1 contains the address of the target instrucon in the program, and condional branches
(CBZ, ‘Compare and Branch if Zero’)
if_instr = [CBZ,R1,R2]
Python
where register R1 contains the condion variable (0 or 1) and the program branches to the address in R2 if
R1=0 and connues on the next line otherwise. We also have CBNZ, ‘Compare and Branch if Non-Zero’.
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
8
Finally, we have two instrucons which take no arguments: NOP does nothing, and WFI stops the
processor unl an interrupt occurs.
1.3.4 Assembly language
To write instrucons for actual processors, a similar, but more expressive, notaon called assembly
language is used. For example, consider the following program that reads two values from memory,
stores them in registers, adds them, and writes the result back:
Lisng 1.3.1: Example program Python
1 [
2 [LDR,R1,R4],
3 [LDR,R2,R5],
4 [ADD,R3,R1,R2],
5 [STR,R3,R6]
6 ]
In the assembly language for the Arm processor [3], this code would look as follows:
Lisng 1.3.2: Example Arm assembly program Python
1 ldr r1, r4
2 ldr r2, r5
3 add r3, r1, r2
4 str r3, r6
Assembly languages have many other features, such as a rich set of addressing mechanisms, labeling
opons, etc. However, for our current purpose, our simple funcon-based notaon is sucient. For
more details, see, e.g., [4].
1.3.5 Arithmec logic unit
The part of a processor that performs computaons is known as the arithmec logic unit (ALU).
We can create a simple ALU in Python as follows:
Lisng 1.3.3: ALU model Python
1 from operator import *
2
3 alu = [
4 add,
5 sub,
6 mul,
7 ...
8 ]
This is simply an array of funcons; more instrucons can be added trivially.
1.3.6 Instrucon cycle
A processor operates what is known as the instrucon cycle or fetch-decode-execute cycle. We can
dene each of these operaons as follows. First, we dene fetchInstrucon. This funcon fetches an
9
instrucon from memory. To determine which instrucon to fetch, it uses a dedicated register known
as the program counter, which has address PC in our register le. Then we also need to know where in
our memory space, we can nd the program code. We use CODE to denote the starng address of the
program in the system state. Aer reading the instrucon, we increment the program counter, so it
points to the next instrucon in the program.
Lisng 1.3.4: Instrucon fetch model Python
1 def fetchInstruction(registers,systemState):
2 # get the program counter
3 pctr = registers[PC]
4 # get the corresponding instruction
5 ir = systemState[CODE+pctr]
6 # increment the program counter
7 registers[PC]+=1
8 return ir
The instrucon is stored in the temporary instrucon register (ir in our code). The processor now has to
decode this instrucon, i.e., extract the register addresses and instrucon opcode from the instrucon
word. Remember that the state stores unsigned integers, so an instrucon is encoded as an unsigned
integer. The details of the implementaon can be found in the repository in le abstract_system_cpu_
decode.py. For this discussion, the important point is that the funcon returns a tuple opcode,args
where args is a tuple containing the decoded arguments (registers, addresses or constants). In the
code, if an element of a tuple is unused, we used _ as variable name to indicate this.
Lisng 1.3.5: Instrucon decode model Python
1 def decodeInstruction(ir):
2 ...
3 return (opcode,args)
Finally, the processor executes the decoded instrucon. In our model, we implement instrucon using
a funcon. The load instrucon (mnemonic LDR) is simply an array read operaon, store (mnemonic
STR) is simply an array write operaon. The B and CBZ branching instrucons only modify the program
counter. By using an array of funcons alu as discussed above, the ALU execuon is very simple too.
The complete code can be found in the repository in le abstract_- system_cpu_execute.py.
Lisng 1.3.6: Individual instrucon execute model Python
1 def doLDR(registers,systemState,args):
2 (r1,addr,_)=args
3 registers[r1] = systemState[addr]
4 return (registers,systemState)
5
6 def doSTR(registers,systemState,args)
7 (r1,addr,_)=args
8 systemState[addr]=registers[r1]
9 return (registers,systemState)
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
10
10
11 def doB(registers,args):
12 (_,addr,_)=args
13 registers[PC] = addr
14 return registers
15
16 def doCBZ(registers,args):
17 (r1,addr1,addr2)=args
18 if registers[r1]:
19 registers[PC] = addr1
20 else:
21 registers[PC] = addr2
22 return registers
23
24 def doALU(instr,registers,args):
25 (r1,r2,r3)=args
26 registers[r3] = alu[instr](registers[r1],registers[r2])
27 return registers
The executeInstrucon funcon simply calls the appropriate handler funcon via a condion on the
instrucon:
Lisng 1.3.7: Instrucon execute model Python
1 def executeInstruction(instr,args,registers,systemState):
2 if instr==LDR:
3 (registers,systemState)=doLDR(registers,systemState,args)
4 elif instr==STR:
5 (registers,systemState)=doSTR(registers,systemState,args)
6 elif ...
7 else:
8 registers = doALU(instr,registers,args)
9 return (registers,systemState)
1.3.7 Bare bones processor model
With these denions, we can build a very simple processor model:
Lisng 1.3.8: Simple processor model Python
1 def processorAction(systemState,registers):
2 # fetch the instruction
3 ir = fetchInstruction(registers,systemState)
4 # decode the instruction
5 (instr,args) = decodeInstruction(ir)
6 # execute the instruction
7 (registers,systemState)= executeInstruction(instr,args,registers,systemState)
8 return (systemState,registers)
11
In the source code, we have also provided an encodeInstrucon in le abstract_system_en- coder.py.
We can encode an instrucon using this funcon, assuming the mnemonics have been dened:
Lisng 1.3.9: Instrucon encoding Python
1 # multiply value in R1 with value in R2
2 # store result in R3
3 instr=[MUL,R3,R1,R2]
4
5 iw=encodeInstruction(instr)
Now you can run this as follows:
Lisng 1.3.10: Running the code Python
1 # Set the program counter relative to the location of the code
2 registers[PC]=0
3 # Set the registers
4 registers[R1]=6
5 registers[R2]=7
6
7 # Store the encoded instructions in memory
8 systemState[CODE] = iw
9
10 # Now run this
11 (systemState,registers) = processorAction(systemState,registers)
12
13 # Inspect the result
14 print( registers[R3] )
15 # prints 42
You can nd the complete Python code for this bare-bones model in the folder bare-bones-model,
have a look and try it out. The le to run is bare-bones-model/abstract_- system_model.py.
1.4 Advanced processor model
The bare-bones model is missing a number of features that are essenal to support an operang
system; in this secon, we introduce these features and add them to the model.
1.4.1 Stack support
A stack is a conguous block of memory that is accessed in LIFO (last in, rst out) fashion. Data is
added to the top of the stack using a ‘push’ operaon and taken from the top of stack using a ‘pop
operaon. Stacks are used to store temporary data, and they are commonly used to handle funcon
calls. Most computer architectures include at least a register that is usually reserved for the stack
pointer (e.g., as we have seen the Arm processor has a dedicated ‘SP’ register) as well as ‘PUSH’ and
‘POP’ instrucons to access the stack. In our model, we will implement the stack as part of the RAM
memory, and we dene the push and pop instrucons as in the Arm instrucon set, for example:
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
12
Lisng 1.4.1: Example stack instrucons Python
1 push_pop=[
2 [PUSH,R1],
3 [POP,R2]
4 ]
would push the content of R1 onto the stack and then pop it into R2. The PUSH and POP instrucons
are encoded similar to the LDR and STR memory operaons. We extend the executeInstrucon
denion to support the stack with the following funcons:
Lisng 1.4.2: Push/pop implementaon Python
1 def doPush(registers,systemState,args):
2 sptr = registers[SP]
3 (r1,_,_)=args
4 systemState[sptr]=registers[r1]
5 registers[SP]+=1
6 return (registers,systemState)
7
8 def doPop(registers,systemState,args):
9 sptr = registers[SP]
10 (r1,_,_)=args
11 registers[r1] = systemState[sptr]
12 registers[SP]-=1
13 return (registers,systemState)
1.4.2 Subroune calls
One of the main reasons for having a stack is so that the processor can handle subroune calls, and
in parcular, subrounes that call other subrounes or call themselves (recursive call). This is because
whenever we call a subroune, the code in the subroune will overwrite the register le, so we need
to store the registers somewhere before we call a subroune.
To support this mechanism, most processors have instrucons to change the control ow: a rst
instrucon, the call instrucon changes the program counter to the locaon of the subroune to be called.
A second instrucon, the return instrucon, returns the locaon aer the subroune call instrucon.
These instrucons can use either the stack or a dedicated register to save the program counter.
In the Arm 32-bit instrucon set the call and return instrucons are usually implemented using BL and
BX; the Arm convenon is to store the return address in the link register LR, and we will use the same
convenon in our model. We extend the executeInstrucon denion to support subroune call and
return as follows:
Lisng 1.4.3: Call/return implementaon Python
1 def doCall(registers,args):
2 pctr = registers[PC]
3 (_,sraddr,_)=args
4 registers[LR] = pctr
5 registers[PC]=sraddr
13
6 return registers
7
8 def doReturn(registers,args):
9 lreg = registers[LR]
10 registers[PC]=lreg
11 return registers
1.4.3 Interrupt handling
Now let’s extend the processor model to support interrupts. When the processor receives an interrupt
request, it must take some specic acons. These acons are simply special small programs called
interrupt handlers or interrupt service rounes (ISR). The processor uses a region of the main memory
called the interrupt vector table (IVT) to link the interrupt requests to interrupt handlers.
How does the processor handle interrupts? On every clock ck (i.e., on every call to processorAcon
in our model), if an interrupt was raised, the processor has to run the corresponding ISR. In our model,
this means the processor needs to inspect irqs, get the corresponding ISR from the ivt (which in our
model is a slice of the systemState array), and execute it. So in fact, the call to the ISR is a normal
subroune call, but one that does not have a corresponding CALL instrucon in the code. Before
execung the ISR, the processor typically stores some register values on the stack, e.g., the Arm
Cortex-M3 stores R0-R3, R12, PC, and LR [5]. According to the Arm Architecture Procedure Call
Standard [6], the called subroune is responsible for storing R4-R11. In our simple model, we only
store the PC, extending it to support the AAPCS is a trivial exercise.
Lisng 1.4.4: Interrupt handling Python
1 def checkIrqs(registers,ivt,irqs):
2 idx=0
3 for irq in irqs:
4 if irq :
5 # Save the program counter in the link register
6 registers[LR] = registers[PC]
7 # Set program counter to ISR start address
8 registers[PC]=ivt[idx]
9 # Clear the interrupt request
10 irqs[idx]=False
11 break
12 idx+=1
13 return (registers,irqs)
1.4.4 Direct memory access
Another important component of a modern processor-based system is support for Direct Memory Access
(DMA). This is a mechanism that allows peripherals to transfer data directly into the main memory without
going through the processor registers. In Arm systems, the DMA controller unit is typically a peripheral
(e.g., the PrimeCell DMA Controller), so we will implement our DMA model as a peripheral as well.
The principle of a DMA transfer is that the CPU iniates the transfer by wring to the DMA unit’s
registers, then runs other instrucons while the transfer is in progress, and nally receives an interrupt
from the DMA controller when the transfer is done.
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
14
Typically, a DMA transfer is a transfer of a large block of data, which would otherwise keep the
processor occupied for a long me. In our simple model, the DMA controller has four registers:
Source Address Register (DSR)
Desnaon Address Register (DDR)
Counter (DCO)
Control Register (DCR)
This peripheral is dierent from the others in our model because it can manipulate the enre system
state. In a way, we can view a DMA controller as a special type of processor that only performs
memory transfer operaons. The model implementaon is:
Lisng 1.4.5: DMA model Python
1 def dmaAction(systemState):
2 dmaIrq=0
3 # DMA is the start of the address space
4 # DCR values: 1 = do transfer, 0 = idle
5 if systemState[DMA+DCR]!=0:
6 if systemState[DMA+DCO]!=0:
7 ctr = systemState[DMA+DCO]
8 to_addr = systemState[DMA+DDR]+ctr
9 from_addr = systemState[DMA+DSR]+ctr
10 systemState[to_addr] = systemState[from_addr]
11 systemState[DMA+DCO]=-1
12 systemState[DMA+DCR]=0
13 dmaIrq=1
14 return (systemState,dmaIrq)
To iniate a memory transfer using the DMA controller, the processor writes the source and desnaon
addresses to DSR and DDR, and the size of the transfer to DCO (the ‘counter’). Then the status is set to
1 in the DCR. The DMA controller then starts the transfer and decrements the counter for every word
transferred. When the counter reaches zero, an interrupt is raised (count-zero interrupt).
1.4.5 Complete cycle-based processor model
By including this interrupt support, the complete cycle-based processor model now becomes:
Lisng 1.4.6: Complete cycle-based processor model Python
1 def processorAction(systemState,irqs,registers):
2 ivt = systemState[IVT:IVTsz]
3 # Check for interrupts
4 (registers,irqs)=checkIrqs(registers,ivt,irqs)
5 # Fetch the instruction
6 ir = fetchInstruction(registers,systemState)
7 # Decode the instruction
8 (instr,args) = decodeInstruction(ir)
9 # Execute the instruction
10 (registers,systemState)= executeInstruction(instr,args,registers,systemState)
11 return (systemState,irqs,registers)
15
1.4.6 Caching
In an actual system, accessing DRAM memory requires many clock cycles. To limit the me spent in
waing for memory access, processors have a cache, a small but fast memory. For every memory read
operaon, rst the processor checks if the data is present in the cache, and if so (this is called a ‘cache
hit’) it uses that data rather than accessing the DRAM. Otherwise (‘cache miss’) it will fetch the data
from memory and store it in the cache.
For a single-core processor, memory write operaons are treated in the same way. Real-life caches are very
complicated and will be discussed in more detail in Chapters 3 ‘Hardware architecture’ and 6 ‘Memory
management’. Here we will create a simple conceptual model of a cache to illustrate the key points.
First of all, as a cache is limited in size, how do we store porons of the DRAM content in it? Like the
other memories, we will model the storage part of the cache as an array of xed size. So if we want to
store some data in the cache, we nd a free locaon and copy the data into it. At some point, the data
will be removed from the cache, freeing up this locaon. So we need a data structure, e.g., a stack to
keep track of the free locaons.
So what happens when the cache is full (so the stack is empty)? We need to free up space by evicng data
from the cache. As we will see in Chapter 6 ‘Memory management’, there are several dierent policies
to do this. The simplest one (but certainly not the best one) is to evict data from the most recently used
locaon because all it requires is that we keep track of that single locaon. When we evict data from the
cache, it needs to be wrien back to the DRAM memory. Conversely, the data that we put into the cache
was read from an address locaon in the DRAM memory. Therefore the cache must not only keep track
of the data but also of its original address. In other words, we need a lookup between the address in the
DRAM and the corresponding address in the cache. In Python, we can use a diconary for this, a data
structure that associates keys with values. A cache which behaves like a diconary – in that it allows us to
store any memory address at any cache locaon – is called ‘fully associave’.
In Python, we can write such a cache model as follows:
Lisng 1.4.7: Cache model: inializaon and helper funcons Python
1 # Initialise the cache
2 def init_cache():
3 # Cache of size CACHE_SZ
4 cache_storage=[]
5 location_stack_storage=range(0,CACHE_SZ)
6 location_stack_ptr=CACHE_SZ-1
7 last_used_loc = location_stack[location_stack_ptr]
8 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
9 address_to_cache_loc={}
10 cache_loc_to_address={}
11 cache_lookup=(address_to_cache_loc,cache_loc_to_address)
12 cache = (cache_storage, address_to_cache_loc,cache_loc_to_address,location_stac
13 return cache
14
15 # Some helper functions
16 def get_next_free_location(location_stack):
17 (location_stack_storage,location_stack_ptr,last_used_loc) = location_stack
18 loc = location_stack_storage[location_stack_ptr]
19 location_stack_ptr-=1
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
16
20 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
21 return (location,location_stack)
22
23 def evict_location(location_stack):
24 (location_stack_storage,location_stack_ptr,last_used_loc) = location_stack
25 location_stack_ptr+=1
26 location_stack[location_stack_ptr] = last_used
27 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
28 return location_stack
29
30 def cache_is_full(location_stack_ptr):
31 if location_stack_ptr==0
32 return True
33 else
34 return False
Lisng 1.4.8: Cache model: cache read and write funcons Python
1 def write_data_to_cache(memory, address, cache):
2 (cache_storage, address_to_cache_loc,cache_loc_to_address, location_stack) = cache
3 (location_stack_storage,location_stack_ptr,last_used_loc) = location_stack
4 # If the cache was full, evict rst
5 if cache_is_full(location_stack_ptr):
6 location_stack = evict_location(location_stack)
7 evicted_address = cache_loc_to_address[last_used]
8 memory[evicted_address]=cache_storage[last_used]
9 # Get a free location.
10 (loc,location_stack) = get_next_free_location(location_stack)
11 # Get the DRAM content and write it to the cache storage
12 data = memory[address]
13 cache_storage[loc] = data
14 # Update the lookup table and the last used location
15 address_to_cache_loc[address]=loc
16 cache_loc_to_address[loc] = address
17 last_used=loc
18 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
19 cache = (cache_storage,address_to_cache_loc,cache_loc_to_address,location_stack)
20 return (memory,cache)
21
22 def read_data_from_cache(memory,address,cache):
23 (cache_storage, address_to_cache_loc,cache_loc_to_address,location_stack) = cache
24 location_stack = evict_location(location_stack)
25 # If the data is not yet in the cache, fetch it from the DRAM
26 # Note this may result in eviction, which could modify the memory
27 if address not in address_to_cache_loc:
28 (memory, cache) = write_data_to_cache(memory,address,cache):
29 # Get the data from the cache
30 loc = address_to_cache_loc[address]
31 data = cache_storage[loc]
32 cache = (cache_storage, address_to_cache_loc,cache_loc_to_address, location_stack)
33 return (data,memory,cache)
The problem with the above model is that for a cache of a given size, we need a locaon stack and two
lookup tables of the same size. This requires a lot of silicon. Therefore, in pracce, the cache will not
simply fetch the content of a single memory address, but a conguous block of memory called a cache
line. For example, the Arm Cortex-A53 has a 64-byte cache line. Assuming that our memory stores 32-
bit words, then the size of the locaon stack and lookup tables is 16x smaller than the actual cache size.
17
There is another reason for the use of cache lines: when a given address is accessed, subsequent
memory accesses are frequently to neighboring addresses. So fetching an enre cache line on a cache
miss tends to reduce the number of subsequent cache misses. Adapng our model to use cache lines
is straighorward:
Lisng 1.4.9: Cache model with cache lines Python
1 # Initialise the cache
2 def init_cache():
3 # Cache of size CACHE_SZ, cache line = 64 bytes = 16 words
4 cache_storage=[[0]*16]*(CACHE_SZ/16)
5 location_stack_storage=range(0,CACHE_SZ/16)
6 location_stack_ptr=(CACHE_SZ/16)-1
7 last_used_loc = location_stack[location_stack_ptr]
8 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
9 address_to_cache_loc={}
10 cache_loc_to_address={}
11 cache_lookup=(address_to_cache_loc,cache_loc_to_address)
12 cache = (cache_storage,address_to_cache_loc,cache_loc_to_address,location_stack)
13 return cache
14
15 # The helper functions remain the same
16
17 def write_data_to_cache(memory,address,cache):
18 (cache_storage, address_to_cache_loc,cache_loc_to_address,location_stack) = cache
19 (location_stack_storage,location_stack_ptr,last_used_loc) = location_stack
20 # If the cache was full, evict rst
21 if cache_is_full(location_stack_ptr):
22 location_stack = evict_location(location_stack)
23 evicted_address = cache_loc_to_address[last_used]
24 cache_line = cache_storage[last_used]
25 for i in range(0,16):
26 data = cache_line[i]
27 memory[(evicted_address<<4) + i]=data
28 # Get a free location.
29 (loc,location_stack) = get_next_free_location(location_stack)
30 # Get the DRAM content and write it to the cache storage
31 cache_line = []
32 for i in range(0,16):
33 cache_line.append(memory[((address>>4)<<4)+i]
34 cache_storage[loc] = cache_line
35 # Update the lookup table and the last used location
36 address_to_cache_loc[address>>4]=loc
37 cache_loc_to_address[loc] = address>>4
38 last_used=loc
39 location_stack = (location_stack_storage,location_stack_ptr,last_used_loc)
40 cache = (cache_storage,address_to_cache_loc,cache_loc_to_address,location_stack)
41 return (memory,cache)
42
43 def read_data_from_cache(memory,address,cache):
44 (cache_storage,address_to_cache_loc,cache_loc_to_address,location_stack) = cache
45 location_stack = evict_location(location_stack)
46 # If the data is not yet in the cache, fetch it from the DRAM
47 # Note this may result in eviction, which could modify the memory
48 if address not in address_to_cache_loc:
49 (memory,cache) = write_data_to_cache(memory,address,cache):
50 # Get the data from the cache
51 loc = address_to_cache_loc[address>>4]
52 cache_line = cache_storage[loc]
53 data = cache_line[addres & 0xF]
54 cache = (cache_storage,address_to_cache_loc,cache_loc_to_address,location_stack)
55 return (data,memory,cache)
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
18
The only complicaon in the cache line-based model is that we need to manipulate the memory
address to determine the start of the cache line and the locaon of the data inside the cache line. Do
this using bit shi and bit mask operaons: the rst 4bits of the address idenfy the posion of the
data in the cache line. We don’t need to store these bits in the lookup tables of the cache because the
cache stores only whole cache lines. In other words, from the perspecve of the cache, the memory
consists of cache lines rather than individual locaons. So we have the following formulas:
data_position_in_cache line = address & 0xF
cache_line_address = address >> 4
address = (cache_line_address << 4) + data_position_in_cache line
1.4.7 Running a program on the processor
The processor model is complete and can run arbitrary programs. For example, the following program
generates the rst 10 Fibonacci numbers greater than 1 and writes them to main memory:
Lisng 1.4.10: Fibonacci code Python
1 b_prog=[
2 [SET,R1,1],
3 [SET,R2,1],
4 [SET,R3,0],
5 [SET,R4,10],
6 [SET,R5,1],
7 (‘loop’,[ADD,R3,R1,R2]),
8 [MOV,R1,R2],
9 [MOV,R2,R3],
10 [SUB,R4,R4,R5],
11 [STR,R3,R4],
12 [CBNZ,R4,’loop’],
13 [WFI]
14 ]
Note: the encodeProgram funcon from abstract_model_encoder.py supports strings as labels for
instrucons as shown above. Similar to Arm assembly language, the instrucons CBZ, CBNZ, ADR, BL,
and B actually take labels rather than explicit addresses.
To run this program, we need to encode it, load it into memory, and ensure that the program counter
points to the start of code in the memory:
Lisng 1.4.11: Running a program on the processor Python
1 # Encode the program
2 b_iws=encodeProgram(b_prog)
3
4 # Write the program to RAM memory
5 pc=0
6 for iw inb_iws:
7 ramState[CODE+pc] = iw
8 pc+=1
9
10 # Initialise the processor state
11 registers[PC]=CODE
19
12
13 # Run the system for a given number of cycles
14 MAX_NCYCLES=50
15 for ncycles in range(1,MAX_NCYCLES):
16 # Run the peripheral actions
17 (kbdState,kbdIrq)=kbdAction(kbdState)
18 (nicState,nicIrq)=nicAction(nicState)
19 (ssdState,ssdIrq)=ssdAction(ssdState)
20 (gpuState,gpuIrq)=gpuAction(gpuState)
21 (systemState,dmaIrq)=dmaAction(systemState)
22
23 # The RAM does not have any action,
24 # it is just a slice of the full address space
25 ramState=systemState[0:MEMTOP]
26 # Collect the IRQs
27 irqs=[kbdIrq,nicIrq,ssdIrq,gpuIrq,dmaIrq]
28 # Compose the system state
29 systemState = ramState+timerState+kbdState+nicState+ssdState+gpuState+dmaState
30 # Run the processor action
31 (systemState,irqs,registers) = processorAction(systemState,irqs,registers)
32
33 # Print the portion of memory that holds the results
34 print(systemState[0:10])
1.4.8 High-level instrucons
The model introduced in the previous secon is cycle-based, i.e., it models all acons and state
changes on a cycle-by-cycle, instrucon-by-instrucon basis. To simplify the explanaons in what
follows and to speed up the execuon of the model code, we add support for direct execuon of high-
level Python code using the HLI instrucon. This allows us to work at a higher level of abstracon,
while sll preserving the low-level features of the system that are used by the operang system.
The previous model required us to write individual instrucons and encode them. The HLI
instrucon allows us to use Python funcons that will replace groups of instrucons, as follows:
Lisng 1.4.12: Mul- instrucon acon Python
1 def multi_instruction_action( systemState,registers ):
2 .... (arbitrary Python code) ...
3 return ( systemState,registers )
4
5 hli_prog = [...,
6 [HLI,multi_instruction_action],
7 ...
8 ]
To execute such funcons in the processor, we add the doHLI funcon to the executeInstrucion code:
Lisng 1.4.13: Adapng push for high-level instrucons Python
1 def doHLI(registers,systemState,args)
2 (hl_instr,_,_)=args
3 (systemState,registers) = hl_instr(systemState,registers)
4 return (registers,systemState)
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
20
To illustrate the approach, the Fibonacci example from the previous secon could become a single HLI
instrucon:
Lisng 1.4.14: Fibonacci with high-level instrucons Python
1 def b_hl( systemState,registers ):
2 (r1,r2,r4)=(1,1,10)
3 while r4!=0:
4 r3=r1+r2
5 r1=r2
6 r2=r3
7 r4-=1
8 systemState[r4]=r3
9 registers[1:5]=[r1,r2,r3,r4]
10 return ( systemState,registers )
The key point is that the funcons manipulate the system state and registers in the same way as the
individual instrucons did.
1.5 Basic operang system concepts
In this secon, we use the abstract system model to introduce a number of fundamental operang
system concepts that will be discussed in detail in the following chapters.
1.5.1 Tasks and concurrency
One of the main tasks of an operang system is to support mulple tasks at the same me
(‘concurrently’). If there is only one processor, it means that the code that implements these tasks must
me-share the processor. Let us assume that we have two programs in memory and we want to run
them concurrently so that each running program is a single task, Task 1 and Task 2.
We have seen in Secon 1.4.7 how we run a program: set the program counter to the starng address,
then the fetch-decode-execute cycle will execute each instrucon on subsequent clock cks unl the
program is nished.
Now we want to run two programs at the same me. Therefore, we will need a mechanism to run
instrucons of each program alternangly. This mechanism translates to managing the state. As we
have seen before, the state of a running program consists in principle of the complete system state.
In pracce, each program should have its own secon of memory, as we don’t want one program to
modify the memory of another program.
We start, therefore, by assuming that when the program code is loaded into memory, it is part of
a region of memory that the program is allowed to use when it is running. We will see in Chapter 6
‘Memory management’ that this is indeed the case in Linux. As shown in Figure 1.3, this region (called
‘user space’) contains the program code, the stack for the program and the random-access memory
for the program, commonly known as the ‘heap. Typically, each task gets a xed amount of memory
allocated to it, and in the code, this memory is referenced relave to the program counter.
21
Figure 1.3: Task memory space (Linux).
1.5.2 The register le
However, as we have seen, the processor also has some state, namely the register le. So if we want
to run two tasks alternately, we need to ensure that the register le contains the correct state for each
task. So conceptually, we can store a snapshot of the register le contents for Task 1, then load the
previous snapshot of the register le contents for Task 2.
1.5.3 Time slicing and scheduling
So how can we make two tasks alternate? The code to do this will be the core of our Operang System
kernel and is called a ‘task scheduler’ or scheduler for short. Let’s assume we will simply alternate two
(or more) tasks for xed amounts of me (this is called ‘round-robin scheduling’). For example, on the
Raspberry Pi 3, the Linux real-me scheduler uses an interval (also called ‘me slice’ or quantum’) of
10 ms. For comparison, the average duraon of an eye blink is 100 ms. Note that at a typical clock
speed of 1 GHz, this means a task can execute 10 million (single-cycle) instrucons in this me.
The duraon of a me slice is controlled by a system mer. As we have seen before, a mer can be
congured to re periodically, so in our case, the system mer will raise an interrupt request every
10 ms. On receiving this request, the processor will execute the corresponding Interrupt Service
Roune (ISR). It is this ISR that will take care of the me slicing; in other words, the interrupt service
roune is actually our operang system kernel.
In the Python model, the mer peripheral has a register to store the interval and a control register.
We can set the mer as follows:
Lisng 1.5.1: Timer Python
1 # Set timer to periodic with 100-ticks interval
2 set_timer=[
3 [SET,R1,100],
4 [SET,R2,100], # start periodic timer
5 [STR,R1,TIMER],
6 [STR,R2,TIMER+1]
7 ]
program code
stack
kernel space
0 GB
3 GB
4 GB
user space
heap
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
22
On running this program, the mer will re every 100 clock cks and raise an interrupt request. Let’s
have a look at the interrupt handler. What should this roune do to achieve me slicing between two
tasks? Let’s assume Task 1 has been running and we now want to run Task 2.
First, save the register le for Task 1, we do this by pushing all register contents onto the stack.
(If you spot an issue here, well done! We’ll get back to this in Secon 1.5.4.)
Then determine which task has to be run next (i.e., Task 2). We can idenfy each task using a small
integer (the ‘task idener’) that we store in the memory accessible by the kernel. We load the task
idener for Task 2 into a register and update the memory with the task idener for the next task
(in our case, again Task 1).
We now move the register le of Task 1 from the stack to kernel memory. In pracce, the kernel
uses a special data structure, the Task Control Block (TCB), for this purpose.
Now we can read the register le contents for Task 2 from its TCB. Again, we have to do this via the
stack (why?).
Once this is done, Task 2 will start running from the locaon indicated by PC and run unl the next
mer interrupt.
We can express this sequence of acons in high-level Python code for our processor model:
Lisng 1.5.2: Time slicing model Python
1 def time_slice(systemState,registers ):
2 # Push registers onto the stack
3 for r in range(0,16):
4 systemState[registers[MSP]]]=registers[r]
5 registers[MSP]+=1
6 # Get next task
7 pid1 = systemState[PID] # 0 or 1
8 pid2 = 1-pid1
9 systemState[PID]=pid2
10 tcb1= TCB_OFFSET+pid1*TCB_SZ
11 tcb2= TCB_OFFSET+pid2*TCB_SZ
12 # Pop registers from stack and store to tcb1
13 # We use r0 to show that in actual code we’d need to read into a tempory register
14 for r in range(0,16):
15 r0=systemState[registers[MSP]]
16 systemState[tcb1+r]=r0
17 registers[MSP]-=1
18 # Push registers for Task 2 from tcb2 onto stack
19 for r in range(0,16):
20 r0=systemState[tcb2+r]
21 systemState[registers[MSP]]=r0
22 registers[MSP]+=1
23 # Pop registers for Task 2 from stack
24 for r in range(0,16):
25 registers[r]=systemState[registers[MSP]]
26 registers[MSP]-=1
This code is a minimal example of a round-robin scheduler for two tasks.
23
You can already try and answer these quesons by thinking about how you would address these issues.
1.5.4 Privileges
In Secon 1.5.3, we hinted at a potenal issue with the stack. The problem is that ‘pushing onto the
stack’ means modifying the stack pointer SP. So how can we preserve the stack pointer of the current
task? The short answer for the Arm processor is that it has two stack pointers, one for user space task
stacks (PSP) and one for the kernel stack (MSP). User tasks cannot access the kernel stack pointer; the
kernel code can select between the two using the MRS and MSR instrucon.
This raises the topic of privileges: clearly if the kernel code can access more registers than the user
task code, the kernel code is privileged. This is an essenal security feature of any operang system
because, without privileges, a userspace task code could modify the kernel code or other task code.
We will discuss this in more detail in Chapter 4, ‘Process management’. For the moment, it is sucient
to know that in the Arm Cortex-M3 there are two privilege levels
4
, ‘Unprivileged’ and ‘Privileged’; in
Unprivileged mode the soware has limited access to the MSR and MRS instrucons which allow
access to special registers, and cannot use the CPS instrucon which allows us to change the privilege
level. For further restricons, see [2].
1.5.5 Memory management
So far, we have assumed that tasks already reside in memory. In pracce, the OS will have to load
the program code into memory. To do so, the OS must nd a sucient amount of memory for
both the program code and the memory required by the program. It would clearly not be praccal
if the program were to use absolute memory addresses: this would mean that the compiler (or the
programmer) would need to know in advance where the program would reside in memory. This would
be very inexible. Therefore, program code will use relave addressing, e.g., relave to the value of
the program counter. The OS will set the PC to the starng address of the code in memory.
However, relave addressing does not solve all problems. The main queson is how to allocate space
in memory for the processes. Inially, we could of course simply ll up the memory, as shown in
Figure 1.4. But what happens with the memory of nished tasks? The OS should, of course, reuse it,
but it could only do so if a new task does not use any more memory than one of the nished tasks.
Again, this would be very restricve.
The commonly used soluon to this problem is to introduce the concept of a logical address space.
This is a conguous address space allocated to a process. The physical addresses that correspond to
this logical address space do not have to be conguous. The operang system is responsible for the
translaon between the logical and physical address spaces. What this involves is explained in detail
in Chapter 6, ‘Memory management’, but you can already think of ways to organize non-conguous
blocks of physical memory of varying size into a logical conguous space. Apart from address
translaon, the OS also must ensure that a process cannot access the memory space of another
process: this is called memory protecon. Typically, this involves checking a logical address against
the upper and a lower bound of the process logical address space. Because this is a very common
4
In more advanced processors such as the Arm Cortex-A53, there are 4 levels of privilege, called ‘Excepon Levels’ (EL) and numbered EL0-EL3. The userspace tasks run
in EL0, the OS kernel in EL1.
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
24
operaon, there is usually hardware support for it in the form of a Memory Protecon Unit (MPU) in
low-end processors such as the Cortex-M3 or as part of a more elaborate Memory Management Unit
(MMU) in processors such as the Cortex-A53.
Figure 1.4: Problem with conguous memory allocaon.
1.5.6 Translaon look-aside buer (TLB)
The MMU can be implemented as a peripheral as we have done for the DMA unit above, but we
will defer this to the in-depth discussion of memory management provided in Chapter 6. However,
we want to introduce one parcular part of the MMU, a special type of cache called the translaon
look-aside buer (TLB). The translaon from logical to physical addresses is quite me-consuming, and
therefore, the MMU uses the TLB to keep track of recently used translaons (Figure 1.5). Unlike the
memory cache, which contains the data stored in the memory, the TLB contains the physical address
corresponding to a logical address.
Figure 1.5: Logical to physical address translaon with translaon look-aside buer (TLB).
The same consideraons which lead us to use cache lines lead to a similar approach to reducing the
memory space: we divide both the logical and physical memory into chunks of a xed size (which we
call respecvely pages and frames), and we store the starng addresses of those chunks in the TLB,
rather than individual addresses. The posion inside the page is calculated in quite the same way as
the posion in a cache line, using a xed number of LSBs. Typically, pages in Linux are 4KB; dierent
sizes are possible; see Chapter 3 and Chapter 7. The TLB diers from the cache in that a miss does
Logical
address
Physical
address
hit
miss
TLB
Page
table
Physical
memory
CPU
Initial memory use
(green = empty)
Memory use after tasks
2,4,6 and 7 have finished
task 9
task 10
Where to allocate
tasks 9 and 10?
task 2
task 1
task 3
task 5
task 4
task 7
task 6
task 8
task 1
task 3
task 5
task 8
25
not result in a fetch from memory but in a lookup of the physical address in what is called the Page
Table; also, writes to the TLB only happen on a miss. However, the similarity between the cache and
TLB serves allows us to explain the main points of memory management without the need to know
anything about how the actual Page Table works.
1.6 Exercises and quesons
1.6.1 Task scheduling
1. Create a scheduler for a single task in Python. You can use the above code and the Fibonacci
example, or you can write your own code.
2. Extend the me_slice funcon and the memory layout to support a larger number (NTASKS) tasks.
1.6.2 TLB model
1. Create a TLB model in Python, starng from the cache model code in Secon 1.4.6.
2. Given the concept of logical and physical address spaces and the idea of pages, propose a data
structure that allows the OS to allocate non-conguous blocks of physical memory to a process
as a conguous logical address space. Discuss the pros and cons of your proposed data structure.
3. Assuming 4GB of memory divided into 4KB-size pages, and assuming that the page table lookup
is 100x slower than the TLB lookup. What should be the hit rate of the TLB to have an average
lookup me of twice the TLB lookup me? What would the TLB size have to be?
1.6.3 Modeling the system
1. In a physical system, all acons in the above model take place in parallel. What eect does this have
on the model?
2. Suppose you have to design the peripheral for a keyboard which has no locking keys nor modier
keys. What would be the state and which events would raise interrupts?
1.6.4 Bare-bones processor model
1. The LDR and STR instrucons work on memory addresses. In principle, there is nothing that
stops two programs from using the same memory addresses, but this is, of course, in general, not
desirable. What could we do to avoid this?
2. Can you think of features that our bare bones processor is missing?
1.6.5 Advanced processor model
1. If the processor has mulple cores that can execute tasks in parallel, what would need to change to
the processor model?
2. Can you see any issues with the cache if every core would have its own cache? What if they share
a single cache?
1.6.6 Basic operang system concepts
The explanaon in Secon 1.5 omits a lot of detail and raises several quesons, which will be
answered in the later chapters. For example:
What happens if there are more than 2 running tasks?
How does a user start a task?
Chapter 1 | A Memory-centric system model
Operang Systems Foundaons with Linux on the Raspberry Pi
26
How does the OS load programs from disk into in memory?
How does the OS ensure that programs can only access their own memory?
What about sharing of peripherals?
What happens when a task is nished?
The issues of privileges and memory management are discussed in more detail in Chapters 5 and 6.
The model presented so far raises several quesons:
What does it involve to guarantee memory protecon? For example, how could the OS know the
bounds of the logical address space of each process?
Is it sucient to provide memory protecon? Should other resources have similar protecons?
What could be the reason that the default page size on Linux is 4KB? What would happen if it was
10x smaller, or 10x larger?
Can you think of scenarios where logical memory is not necessary?
References
[1] Principles of Arm Memory Maps, Arm Ltd, 10 2012, issue C. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf
[2] Arm
®
Cortex
®
-A53 MPCore Processor – Technical Reference Manual Rev: r0p4, Arm Ltd, 2 2016, revision: r0p4. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.ddi0500g/DDI0500G_cortex_a53_trm.pdf
[3] ARM Compiler toolchain Version 5.03 Assembler Reference, Arm Ltd, 1 2013. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.dui0489i/DUI0489I_arm_assembler_reference.pdf
[4] A. G. Dean, Embedded Systems Fundamentals with Arm Cortex-M based Microcontrollers: A Praccal Approach.
Arm Educaon Media UK, 2017.
[5] Cortex-M3 Devices Generic User Guide, Arm Ltd, 12 2010. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.dui0552a/DUI0552A_cortex_m3_dgug.pdf
[6] Procedure Call Standard for the Arm Architecture ABI r2.10, Arm Ltd, 2015. [Online].
Available: hps://developer.arm.com/docs/ihi0042/latest/procedure-call-standard-for-the-arm-architecture-abi-2018q4-documentaon
27
Chapter 1 | A Memory-centric system model
Chapter 2
A praccal view
of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
30
2.1 Overview
In this chapter, we approach the Linux system from a praccal perspecve, as experienced by users
of the system, in parcular administrators and applicaon programmers rather than kernel or driver
programmers. We rst introduce the essenal concepts and techniques that you need to know in order
to understand the overall system, and then we discuss the system itself from dierent angles: what
is the OS role in boong and inializing the system; what OS knowledge does a system administrator
and systems programmer need. This chapter is not a how-to guide, but rather provides you with the
background knowledge behind how-to guides. It also serves as a roadmap for the rest of the book.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Explain basic operang system concepts: processes, users, les, permissions, and credenals.
2. Analyze the chain of events when boong Linux on the Raspberry Pi.
3. Create a Linux kernel module and build a custom Linux kernel.
4. Discuss the administrator and programmers view on the key operang system concepts covered in
the further chapters.
2.2 Basic concepts
To understand what happens when the system boots and inializes, as well as how the OS aects
the tasks of system administrator and systems programmer, we need to introduce a number of basic
operang system concepts. Most of these apply to any operang system, although the discussion
here is specic to Linux on Arm-based systems. The in-depth discussion of these concepts forms the
subject of the later chapters, so this secon serves as a roadmap for the rest of the book as well.
The original Linux
announcement on
Usenet (1991).
Photo by Krd.
31
2.2.1 Operang system hierarchy
The Linux kernel is only one component of the complete operang system. Figure 2.1 illustrates the
complete Linux system hierarchy. Interfacing between the kernel and the user space applicaons is
the system call interface, a mechanism to allow user space applicaons to interact with the kernel
and hardware. This interface is used by system tools and libraries, and nally by the user applicaons.
The kernel provides funconality such as scheduling, memory management, networking and le
system support, and support for interacng with system hardware via device drivers.
Interfacing between the kernel and the hardware are device drivers and the rmware. In the Linux
system, device drivers interact closely with the kernel, but they are not considered part of the kernel
because depending on the hardware dierent drivers will be needed, and they can be added on the y.
2.2.2 Processes
A process is a running program, i.e., the code for the program and all system resources it uses.
The concept of a process is used for the separaon of code and resources. The OS kernel allocates
memory resources and other resources to a process, and these are private to the process, and
protected from all other processes. The scheduler allocates me for a process to execute. We also use
the term task, which is a bit less strictly dened, and usually relates to scheduling: a task is an amount
of work to be done by a program. We will also see the concept of threads, which are used to indicate
mulple concurrent tasks execung within a single process. In other words, the threads of a process
share its resources. For a process with a single thread of execuon, the terms task and process are
oen used interchangeably.
When a process is created, the OS kernel assigns it a unique idener (called process ID or PID for
short) and creates a corresponding data structure called the Process Control Block or Task Control
Block (in the Linux kernel this data structure is called task_struct). This is the main mechanism the
kernel uses to manage processes.
Figure 2.1: Operang System Hierarchy.
Linux Kernel
Operating System
Firmware
Applications
System Call Interface
VFS
File Systems
Volume Manager
Block Device Int.
Sockets
TCP/UDP
IP
Ethernet
Scheduler
Virtual
Memory
System Libraries
Device Drivers
Clocksource
based http://www.brendangregg.com/linuxperf.html
CC BY-SA Brendan Gregg 2017
System Tools
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
32
2.2.3 User space and kernel space
The terms ‘user spaceand ‘kernel space’ are used mainly to indicate process execuon with dierent
privileges. As we have seen in Chapter 1, the kernel code can access all hardware and memory in the system,
but for user processes, the access is very much restricted. When we use the term ‘kernel space,we mean
the memory space accessible by the kernel, which is eecvely the complete memory space in the system
1
.
By ‘user space,we mean the memory accessible by a user process. Most operang systems support mulple
users, and each user can run mulple processes. Typically, each process gets its own memory space, but
processes belonging to a single user can share memory (in which case we’ll call them threads).
2.2.4 Device tree and ATAGs
The Linux kernel needs informaon about the system on which it runs. Although a kernel binary must be
compiled for a target architecture (e.g., Arm), a kernel binary should be able to run on a wide variety of
plaorms for this architecture. This means that the kernel has to be provided with informaon about the
hardware at boot me, e.g., number of CPUs, amount of memory, locaon of memory, devices and their
locaon in the memory map, etc. The tradional way to do this on Arm systems was using a format called
ATAGs, which provided a data structure in the kernel that would be populated with informaon that the
bootloader provided. A more modern and exible approach is called Device Tree
3
. It denes a format and
syntax to describe system hardware in a Device Tree Source le. A device tree is a tree data structure with
nodes that describe the physical devices in a system. The Device Tree source les can be compiled using
a special compiler into a machine-architecture-independent binary format called Device Tree Blob.
2.2.5 Files and persistent storage
The Linux Informaon Project denes a le as:
A le is a named collecon of related data that appears to the user as a single, conguous block of
informaon and that is retained in storage.
2
In this denion, storage refers to computer devices or media which can retain data for relavely long
periods (e.g., years or decades), such as solid state drives and other types of non-volale memory,
magnec hard disk drives (HDDs), CDROMs and magnec tape, in other words, persistent storage.
This is in contrast with RAM memory, the content of which is retained only temporarily (i.e., only while
in use or while the power supply remains on).
A persistent storage medium (which I will call ‘disk’) such as an SD card, USB memory sck or hard
disk, stores data in a linear fashion with sequenal access. However, in pracce, the disk does not
contain a single array of bytes. Instead, it is organized using parons and le systems. We discuss
these in more detail in Chapter 9, but below is a summary of these concepts.
Paron
A disk can be divided into parons, which means that instead of presenng as a single blob of data,
it presents as several dierent blobs. Parons are logical rather than physical, and the informaon
about how the disk is paroned (i.e., the locaon, size, type, name, and aributes of each paron)
is stored in a paron table. There are several standards for the structure of parons and paron
tables, e.g., GUID Paron Table and MBR.
1
Assuming the system does not run a hypervisor. Otherwise, it is the memory available to the Virtual Machine running the kernel.
2
hp://www.linfo.org/le.html
3
hps://www.devicetree.org/specicaons
33
File system
Each paron of a disk contains a further system for logical organizaon. The purpose of most le
systems is to provide the le and directory (folder) abstracons. There are a great many dierent
le systems (e.g., fat32, ext4, hfs+, ...) and we will cover the most important ones in Chapter 9. For
the purpose of this chapter, what you need to know is that a le system not only allows to store
informaon in the form of les organized in directories but also informaon about the permissions
of usages for les and directories, as well as mestamp informaon (le creaon, modicaon, etc.).
The informaon in a le system is typically organized as a hierarchical tree of directories, and the directory
at the root of the tree is called the root directory. To use a le system, the kernel performs an operaon
called mounng. As long as a le system has not been mounted, the system can’t access the data on it.
Mounng a le system aaches that le system to a directory (mount point) and makes it available
to the system. The root (/) le system is always mounted. Any other le system can be connected or
disconnected from the root le system at any point in the directory tree.
2.2.6 ‘Everything is a le’
One of the key characteriscs of Linux and other UNIX-like operang systems is the oen-quoted
concept of ‘everything is a le.This does not mean that all objects in Linux are les as dened above, but
rather that Linux prefers to treat all objects from which the OS can read data or to which it can write data
using a consistent interface. So it might be more accurate to say ‘everything is a stream of bytes.’ Linux
uses the concept of a le descriptor, an abstract handle used to access an input/output resource (of which
a le is just one type). So one can also say that in Linux, ‘everything is a le descriptor.
What this means in pracce is that the interface to, e.g., a network card, keyboard or display is
represented as a le in the le system (in the /dev directory); system informaon about both hardware
and soware is available under /proc. For example, Figure 2.2 shows the lisng of /dev and /proc on
the Raspberry Pi. We can see device les represenng memory (ram*), terminals (y*), modem (ppp),
Figure 2.2: Lisng of /dev and /proc on the Raspberry Pi running Raspbian.
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
34
and many others. In parcular, there is /dev/null which is a special device which discards the
informaon wrien to it, and /dev/zero which returns an endless stream of zero bytes (i.e., 0x00,
so when you try cat /dev/zero you will see nothing. Try cat/dev/zero | hd instead.)
2.2.7 Users
A Linux system is typically a mul-user system. What this means is that it supports another level
of separaon, permissions, and protecon above the level of processes. A user can run and control
mulple processes, each in their own memory space, but with shared access to system resources.
In parcular, the concept of users and permission is ghtly connected with the le system. The le
system permissions for a given user control the access of that user in terms of reading, wring, and
execung les in dierent parts of the le system hierarchy.
Just as the kernel runs in privileged mode to control the user space processes, there is also a need
for a privileged user to control the other users (similar to the ‘Administrator’ on Windows systems).
In Linux, this user is called root
3
and when the system boots, the rst process (init, which has PID=1)
is run as the root user. The init process can create new processes. In fact, in Linux, any process can
create new processes (as explained in more detail in Chapter 4). However, a process owned by the
root user can assign ownership of a created process to another user, whereas processes created by
a non-root user process can only be owned by itself.
2.2.8 Credenals
In Linux, credenals is the term for the set of privileges and permissions associated with any object.
Credenals express, e.g., ownership, capabilies, and security management properes. For example,
for les and processes, the key credenals are the user id and group id. To decide what a certain object
(e.g., a task) can do to another object (e.g., a le), the Linux kernel performs a security calculaon using the
credenals and a set of rules. In pracce, processes executed as root can access all les and other resources
in the system; for a non-root user, le and directory access is determined by a system of permissions on
the les and by the membership of groups: a user can belong to one or more groups of users.
File access permissions can be specied for individual users, groups, and everyone. For example, in
Figure 2.3, we see that the directory /home/wim can be wrien to by user wim in group wim. If we try
to create an (empty) le using the touch command, this succeeds. However, if we try to do the same
in the directory /home/pleroma, owned by user pleroma in group pleroma, we get ‘permission denied’
because only user pleroma has write access to that directory.
Figure 2.3: Example of restricons on le creaon on the Raspberry Pi running Raspbian.
3
For more info about the origin of the name, root see www.linfo.org/root.html
35
Note that because of the ‘everything is a le’ approach, this system of permissions extends in general
to devices, system informaon, etc. However, the actual kernel security policies can restrict access
further. For example, in Figure 2.2, the numbers in the /proc lisng represent currently running
processes by their PID.
To illustrate the connecon between users, permissions, and processes, Figure 2.4 shows how user
wim can list processes in /proc belonging to two dierent non-root users, wim, and pleroma.
The command cat /proc/548/maps prints out the enre memory map for the process with PID
548. The map is quite large, so for this example, only the heap memory allocaon is shown (using
grep heap).
Figure 2.4: Example of restricons on process memory access via /proc on the Raspberry Pi running Raspbian.
However, when we try to do the same with /proc/600/maps, we get ‘Permission denied’ because
the cat process owned by user wim does not have the right to inspect the memory map of a process
owned by another user. This is despite the le permissions allowing read access.
2.2.9 Privileges and user administraon
The system administrator creates user accounts and decides on access to resources using groups
(using tools such as useradd(8), groupadd(8), chgrp(1), etc.). The kernel manages credenals per
process using struct cred which is a eld of the task_struct.
The admin also decides how many resources each user and process gets, e.g., using ulimit.
Resource limits are set in /etc/security/limits.conf and can be changed at runme via the
shell command ulimit. Internally, the ulimit implementaon uses the getrlimit and setrlimit system calls
which modify the kernel struct rlimit in include/uapi/linux/resource.h.
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
36
2.3 Boong Linux on Arm-based systems (Raspberry Pi 3)
In this secon, we discuss the boot process for Linux on the Raspberry Pi 3. The boot sequence of
Linux on Arm-based systems varies signicantly from plaorm to plaorm. The dierences somemes
arise due to the needs of the target market but can also be due to choices made by SoC and plaorm
vendors. The boot sequence discussed here is a specic example to demonstrate what happens on
a parcular plaorm.
This Raspberry Pi 3 (Figure 2.5) runs Raspbian Linux on an Arm Cortex-A53 processor which is part of
the Broadcom BCM2837 System-on-Chip (SoC). This SoC also contains a GPU (Broadcom VideoCore
IV) which shares the RAM with the CPU. The GPU controls the inial stages of the boot process. The
SoC also has a small amount of One Time Programmable (OTP) memory which contains informaon
about the boot mode and a boot ROM with the inial boot code.
Figure 2.5: Boot Process for Raspbian Linux on the Raspberry Pi 3.
2.3.1 Boot process stage 1: Find the bootloader
Stage 1 of the boot process begins with reading the OTP to check which boot modes are enabled.
By default, this is SD card boot, followed by a USB device boot. The code for this stage is stored in the
on-chip ROM. The boot code checks each of the boot sources for a le called bootcode.bin in the
root directory of the rst paron on the storage medium (FAT32 formaed); if it is successful, it will
load the code into the local 128K (L2) cache and jump to its rst instrucon to start Stage 2.
Note: The boot ROM supports GUID paroning and MBR-style paroning.
2.3.2 Boot process stage 2: Enable the SDRAM
Stage 2 is controlled by bootcode.bin, which is closed-source rmware. It enables the SDRAM and
loads Stage 3 (start.elf) from the storage medium into the SDRAM.
OTP
GPU
Arm
Cortex-A53
CPU
SSD
SDRAM
bootcode.bin
L2 cache
ROM
start.elf
Stage 1.a
Stage 1.b
Stage 3.a
Stage 1.c
bootcode.bin
start.elf
[boot code]
Stage 2.b
kernel.img
kernel.img
config.txt
cmdline.txt
bcm2710-rpi-3-b.dtb
bcm2710-rpi-3-b.dtb
Stage 2.a
Stage 4.a
Stage 3.b
[decompressed
kernel code]
Stage 5.a
BCM2837
initramfs.gz
initramfs.gz
[mounted initramfs]
Stage 5.b
Stage 4.b
37
2.3.3 Boot process stage 3: Load the Linux kernel into memory
Stage 3 is controlled by start.elf, which is a closed-source ELF-format binary running on the GPU.
start.elf loads the compressed Linux kernel binary kernel.img and copies it to memory. It reads
cong.txt,cmdline.txt and bcm2710-rpi-3-b.dtb (Device Tree Binary).
The le cong.txt is a text le containing system conguraon parameters which would on
a convenonal PC be edited and stored using a BIOS.
The le cmdline.txt contains the command line arguments to be passed on to the Linux kernel
(e.g., the le system type and locaon of the root le system) using ATAGs, and the .dtb le contains
the Device Tree Blob.
2.3.4 Boot process stage 4: Start the Linux kernel
Stage 4 starts kernel.img on the CPU: releasing reset on the CPU causes it to run from the address
where the kernel.img data was wrien. The kernel runs some Arm-specic code to populate CPU
registers and turn on the cache, then decompresses itself, and runs the decompressed kernel code.
The kernel inializes the MMU using Arm-specic code and then run the rest of the kernel code which
is processor-independent.
2.3.5 Boot process stage 5: Run the processor-independent kernel code
Stage 5 is the processor-independent kernel code. This code consists mainly of inializaon funcons
to set up interrupts, perform further memory conguraon, and load the inial RAM disk initramfs.
This is a complete set of directories that you would nd on a normal root le system and was loaded
into memory by the Stage 3 boot loader. It is copied into kernel space memory and mounted. This
initramfs serves as a temporary root le system in RAM and allows the kernel to fully boot and
perform user-space operaons without having to mount any physical disks.
A single Linux kernel image can run on mulple plaorms with support for a large number of devices/
peripherals. To reduce the overhead of loading and running a kernel binary bloated with features that
aren’t widely used, Linux supports runme loading of components (modules) that are not needed
during early boot. Since the necessary modules needed to interface with peripherals can be part of
the initramfs, the kernel can be very small, but sll, support a large number of possible hardware
conguraons. Aer the kernel is booted, the initramfs root le system is unmounted, and the real
root le system is mounted. Finally, the init funcon is started, which is the rst user-space process.
Aer this, the idle task is started, and the scheduler starts operaon.
2.3.6 Inializaon
Aer the kernel has booted it launches the rst process, called init. This process is the parent of all
other processes. In the Raspbian Linux distribuon that runs on the Raspberry Pi 3, this init is
actually an alias for /lib/systemd/systemd because Raspbian, as a Debian-derived distribuon,
uses systemd as its init system. Other Linux distribuons can have dierent implementaons of init,
e.g., SysV init or upstart.
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
38
The systemd process executes several processes to inialize the system: keyboard, hardware drivers, le
systems, network, services. It has a sophiscated system for conguring all the processes under its control
as well as for starng and stopping processes, checking their status, logging, changing privileges, etc.
The systemd process performs many tasks, but the principle is always the same: it starts a process
under the required user name and monitors its state. If the process exits, systemd takes appropriate
acon, e.g., restarng the process or reporng the error that caused it to exit.
2.3.7 Login
One of the systemd responsibilies is running the processes that let users log into the system
(systemd-logind). To login via a terminal (or virtual console), Linux uses two programs: gey and login
(originally, the y in gey meant ‘teletype,a precursor to modern terminals). Both run as root.
A basic gey program opens the terminal device, inializes it, prints the login prompt, and waits
for a user name to be entered. When this happens, gey executes the login program, passing it the
user name to log in as. The login program then prompts the user for a password. If the password is
wrong, login simply exits. The systemd process will noce this and spawn another gey process. If the
password is correct, login executes the user’s shell program as that user. From then on, the user can
start processes via the shell.
The reason why there are two separate programs is that both gey and login can be used on their own,
for example, a remote login over SSH does not use a terminal but sll uses login: each new connecon
is handled by a program called sshd that starts a login process.
A graphical login is conceptually not that dierent from the above descripon. The dierence is that
instead of the gey/login programs, a graphical login program called the display manager is run, and
aer authencaon, this program launches the graphical shell.
In Raspbian the display manager is LightDM, and the graphical shell is LXDE (Lightweight X11 Desktop
Environment). Like most Linux distribuons, the graphical desktop environment is based on the X
Window System (X11), a project originally started at MIT and now managed by the X.Org Foundaon.
2.4 Kernel administraon and programming
The administrator of a Linux system does not need to know the inner workings of the Linux kernel but
needs to be familiar with tools to congure the operang system, including adding funconality to the
kernel through kernel modules, and compilaon of a custom kernel.
2.4.1 Loadable kernel modules and device drivers
As explained above, the Linux kernel is modular, and funconality can be loaded at run me using
Loadable Kernel Modules (LKM). This feature is used in parcular to congure drivers for the system
hardware. Therefore the administrator needs to be familiar with the main concepts of the module
system and a basic understanding of the role of a device driver.
39
To insert a module into the Linux kernel, the command insmod (8) can be used. insmod makes an
init_module() system call to load the LKM into kernel memory.
The init_module() system call invokes the LKM’s inializaon roune immediately aer it loads the
LKM. insmod passes to init_module() the address of the inializaon subroune in the LKM using
the macro module_init().
The LKM author sets up the module’s init_module to call a kernel funcon that registers the
subrounes that the LKM contains. For example, a character device driver’s init_module subroune
might call the register_chrdev kernel subroune, passing the major and minor number of the device it
intends to drive and the address of its own open() roune as arguments. register_chrdev records that
when the kernel wants to open that parcular device, it should call the open() roune in our LKM.
When an LKM is unloaded (e.g., via the rmmod(8) command), the LKM’s cleanup subroune
is called via the macro module_exit().
In pracce, the administrator will want to use the more intelligent modprobe(8) command to
handle module dependencies automacally. Finally, to list all loaded kernel modules, the command
lsmod(8) can be used.
For the curious, the details of implementaon are init_module, load_module, and do_init_module in
kernel/module.c.
2.4.2 Anatomy of a Linux kernel module
As an administrator, somemes you may have to add a new device to your system for which the
standard kernel of your system’s Linux distro does not provide a driver. That means you will have to
add this driver to the kernel.
A trivial kernel module is very simple. The following module will print some informaon to the kernel
log when it is loaded and unloaded.
Lisng 2.4.1: A trivial kernel module C
1 #include <linux/init.h> // For macros __init __exit
2 #include <linux/module.h> // Kernel LKM functionality
3 #include <linux/kernel.h> // Kernel types and function denitions
4
5 static int __init hello_LKM_init(void){
6 printk(KERN_INFO "Hello from our LKM!\n");
7 return 0;
8 }
9
10 static void __exit hello_LKM_exit(void){
11 printk(KERN_INFO "Goodbye from our LKM!\n");
12 }
13
14 module_init(hello_LKM_init);
15 module_exit(hello_LKM_exit);
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
40
However, note that a kernel module is not an applicaon; it is a piece of code to be used by the kernel.
As you can see, there is no main() funcon. Furthermore, kernel modules:
do not execute sequenally: a kernel module registers itself to handle requests using its
inializaon funcon, which runs and then terminates. The types of request that it can handle are
dened within the module code. This is quite similar to the event-driven programming model that
is commonly ulized in graphical-user-interface (GUI) applicaons.
do not have automac resource management (memory, le handles, etc.): any resources that are
allocated in the module code must be explicitly deallocated when the module is unloaded.
do not have access to the common user-space system calls, e.g., prin(). However, there is a printk()
funcon that can output informaon to the kernel log, and which can be viewed from user space.
can be interrupted: kernel modules can be used by several dierent programs/processes at the
same me, as they are part of the kernel. When wring a kernel module you must, therefore, be
very careful to ensure that the module behavior is consistent and correct when the module code
is interrupted.
have to be very resource-aware: as a module is kernel code, its execuon contributes to the kernel
runme overhead, both in terms of CPU cycles and memory ulizaon. So you have to be very
aware that your module should not harm the overall performance of your system.
The macros module_init and module_exit are used to idenfy which subrounes should be run when
the module is loaded and unloaded. The rest of the module funconality depends on the purpose of
the module, but the general mechanism used in the kernel to connect a specic module to a generic
API (e.g., the le system API) is via a struct with funcon pointers, which funcons in the same way
as an object interface declaraon in Java or C++. For example, the le system API provides a struct
le_operaons (dened in include/linux/fs.h) which looks as follows:
Lisng 2.4.2: le_operaons struct from <include/linux/fs.h> C
1 structle_operations{
2 struct module *owner;
3 lo_t(*llseek)(structle*,lo_t,int);
4 ssize_t (*read) (structle*,char __user *, size_t,lo_t*);
5 ssize_t (*write) (structle*,const char __user *, size_t,lo_t*);
6 ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
7 ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
8 int (*iterate) (structle*,struct dir_context *);
9 int (*iterate_shared) (structle*,struct dir_context *);
10 __poll_t (*poll) (structle*,struct poll_table_struct *);
11 long (*unlocked_ioctl) (structle*,unsigned int, unsigned long);
12 long (*compat_ioctl) (structle*,unsigned int, unsigned long);
13 int (*mmap) (structle*,struct vm_area_struct *);
14 unsigned longmmap_supported_ags;
15 int (*open) (struct inode *, structle*);
16 int(*ush)(structle*,_owner_tid);
17 int (*release) (struct inode *, structle*);
18 int (*fsync) (structle*,lo_t,lo_t,int datasync);
41
19 int (*fasync) (int, structle*,int);
20 int (*lock) (structle*,int, structle_lock*);
21 ssize_t (*sendpage) (structle*,struct page *, int, size_t,lo_t*,int);
22 unsigned long (*get_unmapped_area)(structle*,
23 unsigned long, unsigned long, unsigned long, unsigned long);
24 int(*check_ags)(int);
25 int(*ock)(structle*,int, structle_lock*);
26 ssize_t (*splice_write)(struct pipe_inode_info *, structle*,lo_t*,
27 size_t, unsigned int);
28 ssize_t (*splice_read)(structle*,lo_t*,struct pipe_inode_info *,
29 size_t, unsigned int);
30 int (*setlease)(structle*,long, structle_lock**,void **);
31 long (*fallocate)(structle*le,intmode,lo_toset,
32 lo_tlen);
33 void (*show_fdinfo)(structseq_le*m,structle*f);
34 #ifndef CONFIG_MMU
35 unsigned (*mmap_capabilities)(structle*);
36 #endif
37 ssize_t(*copy_le_range)(structle*,lo_t,structle*,
38 lo_t,size_t, unsigned int);
39 int(*clone_le_range)(structle*,lo_t,structle*,lo_t,
40 u64);
41 ssize_t(*dedupe_le_range)(structle*,u64,u64,structle*,
42 u64);
43 } __randomize_layout;
So if you want to implement a module for a custom le system driver, you will have to provide
implementaons of the calls you want to support with the signatures as provided in this struct. Then
in your module code, you can create an instance of this struct and populate it with pointers to the
funcons you’ve implemented, for example, assuming you have implemented my_le_open, my_le_
read, my_le_write, and my_le_close, you would create the following struct:
Lisng 2.4.3: Example le_operaons struct C
1 static structle_operationsmy_le_ops=
2 {
3 .open=my_le_open,
4 .read=my_le_read,
5 .write=dmy_le_write,
6 .release=dmy_le_close,
7 };
Now all that remains is to make the kernel use this struct, and this is achieved using yet another
API call which you call in the inializaon subroune. In the case of a driver for a character le
(e.g., a serial port or audio device), this call would be register_chrdev(0, DEVICE_- NAME,
&my_le_ops). This API call is also dened in include/linux/fs.h. Other types of devices have similar
calls to register new funconality with the kernel.
2.4.3 Building a custom kernel module
If you want to create your own kernel module, you don’t need the enre kernel source code, but you
do need the kernel header les. On a Raspberry Pi 3 running Raspbian, you can use the following
commands to install the kernel headers:
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
42
Lisng 2.4.4: Installing kernel headers on Raspbian Bash
1 $ sudo apt-get update
2 $ sudo apt-get install raspberrypi-kernel-headers
The Linux kernel has a dedicated Makele-based system to build modules (and to build the actual
kernel) called kbuild. The kernel documentaon provides a good explanaon of how to build a kernel
module in Documentaon/kbuild/modules.txt.
The disadvantage to building a kernel module from source is that you have to rebuild it every me you
upgrade the kernel. The Dynamic Kernel Module Support (dkms) framework oers a way to ensure
that custom modules are automacally rebuilt whenever the kernel version changes.
2.4.4 Building a custom kernel
In some cases, it might be necessary or desirable for the system administrator to build a custom kernel.
Building a custom kernel gives ne-grained control over many of the kernel conguraons and can be
used to achieve beer performance or a smaller footprint.
The process to build a custom kernel is explained on the Raspberry Pi web site. For this, you will
need the complete kernel sources. Again the kernel documentaon is a great source for addional
informaon, have a look at Documentaon/kbuild/kcong.txt, Documentaon/kbuild/kbuild.txt, and
Documentaon/kbuild/makeles.txt.
If you compile the Linux kernel on a Raspberry Pi device, it will take several hours—even with parallel
compilaon threads enabled. On the other hand, cross-compiling the kernel on a modern x86-64 PC
only takes a few minutes at most.
2.5 Administrator and programmer view of the key chapters
From a systems programmer or administrator perspecve, Linux is a POSIX-compliant system.
POSIX (the Portable Operang System Interface) is a family of IEEE standards aimed at maintaining
compability between operang systems. POSIX denes the applicaon programming interface (API)
used by programs to interact with the operang system. In pracce, the standards are maintained
by The Open Group, the cerfying body for the UNIX trademark, which publishes the Single UNIX
Specicaon, an extension of the IEEE POSIX standards (currently at version 4). The key chapters in
this book discuss both the general (non-Linux specic) concepts and theory as well as the POSIX-
compliant Linux implementaons.
2.5.1 Process management
Linux administrators and programmers need to be familiar with processes, what they are, and how
they are managed by the kernel. Chapter 4 ‘Process management’ introduces the process abstracon.
We outline the state that needs to be encapsulated. We walk through the typical lifecycle of a process
from forking to terminaon. We review the typical operaons that will be performed on a process.
43
2.5.2 Process scheduling
Scheduling of processes and threads has a huge impact on system performance, and therefore
Linux administrators and programmers need a good understanding of scheduling in general and the
scheduling capabilies of the Linux kernel in parcular. It is important to understand how to manage
process priories, per-process and per-user resources, and how to make ecient use of the scheduler.
Chapter 5 ‘Process scheduling,’ discusses how the OS schedules processes on a processor. This
includes the raonale for scheduling, the concept of context switching, and an overview of scheduling
policies (FCFS, priority, ...) and scheduler architectures (FIFO, mullevel feedback queues, priories, ...).
The Linux scheduler is studied in detail, with parcular aenon to the Completely Fair Scheduler but
also discussing so and hard real-me scheduling in the Linux kernel.
2.5.3 Memory management
While memory itself is remarkably straighorward, OS architects have built lots of abstracon layers
on top. Principally, these abstracons serve to improve performance and/or programmability. For
both the administrator and the programmer, it is important to have a good understanding of how
the memory system works and what its performance trade-os are. This is ghtly connected with
concepts such as virtual memory, paging, swap space, etc. The programmer also needs to understand
how memory is allocated and what the memory protecon mechanisms are. All this is covered in
Chapter 6, ‘Memory Management. We briey review caches (in hardware and soware) to improve
access speed. We go into detail about virtual memory to improve the management of physical memory
resource. We will provide highly graphical descripons of address translaon, paging, page tables,
page faults, swapping, etc. We explore standard schemes for page replacement, copy-on-write, etc.
We will examine concrete examples in Arm architecture and Linux OS.
2.5.4 Concurrency and parallelism
Concurrency and parallelism are more important for the programmer than the administrator, as
concurrency is needed for responsive, interacve applicaons and parallelism for performance. From
an administrator perspecve, it is important to understand the impact of the use of mulple hardware
threads by a single applicaon. In Chapter 7, ‘Concurrency and parallelism,we discuss how the
OS supports concurrency and how the OS can assist in exploing hardware parallelism. We dene
concurrency and parallelism and discuss how they relate to threads and processes. We discuss the key
issue of resource sharing, covering locking, semaphores, deadlock, and livelock. We look at OS support
for concurrent and parallel programming via POSIX threads and present an overview of praccal
parallel programming techniques such as OpenMP, MPI, and OpenCL.
2.5.5 Input/output
Chapter 8 ‘Input/output,’ presents the OS abstracon of an I/O device. We review device interfacing,
covering topics like polling, interrupts, and DMA, and we discuss memory-mapped I/O. We invesgate
a range of device types, to highlight their diverse features and behavior. We cover hardware registers,
memory mapping, and coprocessors. Furthermore, we examine the ways in which devices are exposed
to programmers, and we review the structure of a typical device driver.
2.5.6 Persistent storage
Because Linux, as a Unix-like operang system, is designed around the le system abstracon, a good
understanding les and le systems is important for the administrator, in parcular of concepts such
Chapter 2 | A praccal view of the Linux system
Operang Systems Foundaons with Linux on the Raspberry Pi
44
as mounng, formang, checking, permissions and links. Chapter 9 ‘Persistent storage’ focuses on
le systems. We discuss the use cases and explain how the raw hardware (block- and sector-based
storage, etc.) is abstracted at the OS level. We talk about mapping high-level concepts like les,
directories, permissions, etc. down to physical enes. We review allocaon, space management, and
recovery from failure. We present a case study of a Linux le system. We also discuss Windows-style
FAT, since this is how USB bulk storage operates.
2.5.7 Networking
Networking is important at many levels: when boong, the rmware deals with the MAC layer, the
kernel starts the networking subsystem (arp, dhcp and init starts daemons; then user processes start
clients and/or daemons. The administrator may need to tune the TCP/IP stack and congure the
kernel rewall. Most applicaons today require network access. As the Linux networking stack is
handled by the kernel, the programmer needs to understand how Linux manages networking as well
as the basic APIs.
Chapter 10 ‘Networking’ introduces networking from a Linux kernel perspecve: why is networking
treated dierently from other types of IO, what are the OS requirements to support the network
stack, etc.. We introduce socket programming with a focus of the role the OS plays (e.g.~buering,
le abstracon, supporng mulple clients, ...).
2.6 Summary
In this chapter, we have introduced several basic operang system concepts and illustrated how they
relate to Linux. We have discussed what happens when a Linux system (in parcular on the Raspberry
Pi) boots and inializes. We have introduced kernel modules and kernel compilaon. Finally, we have
presented a roadmap of the key chapters in the book, highlighng their relevance to Linux system
administrators and programmers.
2.7 Exercises and quesons
2.7.1 Installing Raspbian on the Raspberry Pi 3
1. Following the instrucons on raspberrypi.org, download the latest Raspbian disk image and install it
either as a Virtual Machine using qemu or on an actual Raspberry Pi 3 device.
2. Boot the device or VM and ping it (as explained on the Raspberry Pi web site).
2.7.2 Seng up SSH under Raspbian
1. Congure your Raspberry Pi to start an ssh server when it boots (this is not discussed in the text).
2. Log in via ssh and create a dedicated user account.
3. Forbid access via ssh to any account except this dedicated one.
2.7.3 Wring a kernel module
1. Write a simple kernel module that prints some informaon to the kernel log le when loaded,
as explained in the text.
2. Write a more involved kernel module that creates a character device in /dev.
45
2.7.4 Boong Linux on the Raspberry Pi
1. Describe the stages of the Linux boot process for the Raspberry Pi.
2. Explain the purpose of the initramfs RAM disk.
2.7.5 Inializaon
1. Aer the kernel has booted it launches the rst process, called init. What does this process do?
2. Are there specic requirements on the init process?
2.7.6 Login
1. Which are the programs involved in logging in to the system via a terminal?
2. Explain the login process and how the kernel is involved.
2.7.7 Administraon
1. Explain the role of the /dev and /proc le systems in system administraon.
2. Explain the Linux approach to permissions: who are the parcipants, what are the restricons, what
is the role of the kernel?
3. As a system administrator, which tools do you have at your disposal to control and limit the
behavior of your user processes in terms of CPU and memory ulizaon.
Chapter 2 | A praccal view of the Linux system
Chapter 3
Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
48
A brief history of Arm, based on an interview from
2012 with Sophie Wilson FRS FREng, a Brish
computer scienst and soware engineer who
designed the Acorn Micro-Computer and later
the instrucon set of the Arm processor, which
became the de facto model used in 21st-century
smartphones.
Image ©2013 Chris Monk, CC BY 2.0, commons.wikimedia.org
In 1983 Acorn Computers had produced the BBC Microcomputer. It was designed as a two-
processor system from the outset in order to be able to build both a small cheap machine and a big
expensive workstaon-style machine. This was possible by using two processors: an IO processor
and a second processor that would do the actual heavy liing. Acorn made many variants of the
second processor based on exisng microprocessors.
In Sophie’s words, "We could see what all these processors did and what they didn’t do. So, the rst
thing they didn’t do was they didn’t make good use of the memory system. The second thing they
didn’t do was that they weren’t fast, they weren’t easy to use." Regarding the raonale behind the
design of the original Arm processor, Sophie said, "We rather hoped that we could get to a power
level such that if you wrote in a higher-level language, you could , e.g., write 3D graphics games.
For the processors that were on sale at the me that wasn’t true. They were too slow. So we
felt we needed a beer processor. We parcularly felt we needed a beer processor in order to
compete with what was just beginning to be a ood of IBM PC compables. So, we gave ourselves
a project slogan which was MIPS for the masses". "This was very dierent to what other people were
doing at the me. RISC processor research had just been sort of released by IBM, by Berkeley, by
Stanford, and they were all aer making workstaon-class machines that were quite high end.
We ended up wanng to do the same thing but at the low end, a machine for the masses that
would be quite powerful but not super powerful."
"ARM was that machine: a machine that was MIPS for the masses. We started selling Arm powered
machines in 1986, 1987. The things that we’d endowed it with, what we’d set Arm up to be, with
its cheap and powerful mindset, were the things that became valuable. When people wanted to
put good amounts of processing into something, that was the really important aribute."
"We designed a deeply embedded processor, or an embedded processor, without consciously
realizing it in our striving for what we thought would be ideal for our marketplace; that’s been
what’s really maered. As a sort of side eect of making it cheap and simple to use, we also ended
up making it power ecient; that wasn’t intenonal. In hindsight, it was an obvious accident. We
only had 25,000 transistors in the rst one. We were worried about power dissipaon. We needed
to be extremely careful for something that would be mass manufactured and put into cheap
machines without heat sinks and that sort of thing. So there were already some aspects of power
conservaon in the design, but we performed way beer than that and as the world has gone
increasingly mobile that aspect of Arm has maered as well. But to start o, we designed a really
good, deeply embedded processor."
49
3.1 Overview
In this chapter, we discuss the hardware on which the operang system runs, with a focus on the
Linux view on the hardware system and the OS support of the Arm Cortex series processors. The
purpose of this chapter is to provide you with a useable mental model for the hardware system and
to explain the need for an operang system and how the hardware supports the OS.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Discuss the support that modern hardware oers for operang systems (dedicated registers, mers,
interrupt architecture, DMA).
2. Compare and contrast instrucon sets for the Arm Cortex M0+ and Arm Cortex A53 in terms of
purpose, capability and OS support.
3. Explain the role and structure of the address map.
4. Explain the hardware structure of the memory subsystem (caches, TLB, MMU).
3.2 Arm hardware architecture
Figure 3.1 [1] shows the enre Arm processor family, with the most recent members on the right, and
the highest performance and capability processors at the top. We will illustrate the Arm hardware
architecture using two quite dierent processors as examples: the Arm Cortex M0+ is a single-core,
very low gate count, highly energy-ecient processor that is intended for microcontroller and deeply
embedded applicaons that require an area opmized processor and low power consumpon, such as
IoT devices. It does not have a cache and uses the 16-bit Armv6-M Thumb instrucon set. In general,
such processors will not run Linux, however many of the main OS support features are sll available.
Figure 3.1: Arm processor family.
Application
Processors
(withMMU,
supportLinux,
MSmobileOS)
RealTime
Processors
Microcontrollers
anddeeply
embedded
Systemcapability&
perfo rmance
ARM7
TM
series
ARM920T
TM
,
ARM940T
TM
ARM946
TM
,
ARM966
TM
ARM926
TM
Cortex
-
M3
Cortex
-
M1
(FPGA)
Cortex
-
M0
Cortex
-
M0+
Cortex
-
M4
Cortex
-
R4
Cortex
-
R5
Cortex
-
R7
Cortex
-
A8
Cortex
-
A9
Cortex
-
A5
Cortex
-
A15
Cortex
-
A7
ARMCortexProcessors
ClassicARMProcessors
Cortex
-
A57
Cortex
-
A53
Cortex
-
A12
ARM11
T
M
series
Cortex
-
R8
Cortex
-
A17
Cortex
-
A72
Cortex
-
A73
Cortex
-
A32
Cortex
-
A35
Cortex
-
R52
Cortex
-
M7
Cortex
-
M23
Cortex
-
M33
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
50
By contrast, the Arm Cortex A53, used in the Raspberry Pi 3, is a mid-range, low-power processor that
implements the Armv8-A architecture. The Cortex-A53 processor has one to four cores, each with an
L1 memory system and a single shared L2 cache. It is a 64-bit processor which supports the AArch64
and AArch32 (including Thumb) execuon modes. It is intended as an Applicaon Processor for
applicaon domains such as mobile compung, smartphones, and energy-ecient servers.
All Arm processor systems use the Advanced Microcontroller Bus Architecture (AMBA), an open-standard
specicaon for the connecon and management of funconal blocks in system-on-chip (SoC) designs.
All Arm processors have a RISC (Reduced Instrucon Set Compung) architecture
1
. RISC architecture
based processors typically require fewer transistors than those with a complex instrucon
set compung (CISC) architectures (e.g., x86), which can result in lower cost and lower power
consumpon. Furthermore, as the instrucons are simpler, most instrucons can be executed in a
single cycle, which makes instrucon pipelining simpler and more ecient. The complex funconality
supported in a CISC instrucon set is achieved through a combinaon of mulple RISC instrucons.
Typically, RISC machines have a large number of general-purpose registers (while CISC machines
have more special-purpose registers). In a RISC architecture, any register can contain either data or an
address. Furthermore, a RISC processor typically operates on data held in registers. Separate load and
store instrucons transfer data between the register bank and external memory (this is called a load-
store architecture).
3.3 Arm Cortex M0+
The Arm Cortex-M0+ processor is a low-spec embedded processor, typically used for applicaons
that need lower power and don’t need full OS support. Figure 3.2 shows the Arm MPS2+ Prototyping
Board for Cortex-M based designs, an FPGA development plaorm supporng the enre Cortex-M
processor range except for the M23 and M33. The funconal block diagram of the Cortex-M0+
processor [2] is shown in Figure 3.3. The Cortex-M0+ uses the AHB-Lite (Advanced High-performance
Bus Lite) Lite bus standard [3]. AHB-Lite is a bus interface that supports a single bus master and
provides high-bandwidth operaon.
Figure 3.2: Arm MPS2+ FPGA Prototyping Board for Cortex-M based designs. Photo by author.
1
The name ARM was originally an acronym for Acorn RISC Machine and was altered to Advanced RISC Machines.
51
It is typically used to communicate with internal memory devices, external memory interfaces, and
high bandwidth peripherals. Low-bandwidth peripherals can be included as AHB-Lite slaves but
typically reside on the AMBA Advanced Peripheral Bus (APB). Bridging between AHB and APB is done
using a AHB-Lite slave, known as an APB bridge.
Figure 3.3: Cortex-M0 processor funconal block diagram.
Figure 3.4: Thumb instrucon set support in the Cortex-M processors.
3.3.1 Interrupt control
The Cortex-M0+ handles interrupts via a programmable controller called the Nested Vectored
Interrupt Controller (NVIC). This controller supports up to 240 dynamically re-priorizable interrupts
each with up to 256 levels of priority. It keeps track of stacked/nested interrupts to enable back-to-
back processing (“tail-chaining”) of interrupts.
3.3.2 Instrucon set
As menoned, the Cortex-M0+ implements the Armv6-M Thumb instrucon set; this is a subset
of the Armv7-M Thumb instrucon set and includes a number of 32-bit instrucons that use
Thumb-2 technology. The Thumb instrucon set is a 16-bit instrucon set formed of a subset of the
Cortex-M0+ processor
Cortex-M0+
processor
core
Bus matrix
Nested
Vectored
Interrupt
Controller
(NVIC)
Interrupts
‡Wakeup
Interrupt
Controller (WIC)
‡Debug
Access Port
(DAP)
AHB-Lite interface
to system
‡Serial Wire or
JTAG debug port
‡ Optional component
Debug
‡Debugger
interface
‡Breakpoint
and
watchpoint
unit
Cortex-M0+ components
‡Memory
protection unit
Execution Trace Interface
‡Single-cycle
I/O port
Cortex-M0/M0+
Cortex-M3
Cortex-M4
Cortex-M7
ARMv6-M
ARMv7-M
Advanced data processing
bit field manipulations
General data processing
I/O control tasks
DSP (SIMD, fast MAC)
Floating Point
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
52
most commonly used 32-bit Arm instrucons. Thumb instrucons have corresponding 32-bit Arm
instrucons that have the same eect on the processor model. Thumb instrucons operate with
the standard Arm register conguraon. On execuon, 16-bit Thumb instrucons are transparently
decompressed to full 32-bit Arm instrucons in real-me, without performance loss. For more details,
we refer to [2]. Figure 3.4 illustrates the various Arm Thumb instrucon sets and the purposes of the
instrucons. The key points to noce is that the Armv6-M Thumb instrucon set is very small and that
it is a very reduced subset of the complete Thumb instrucon set.
3.3.3 System mer
An interesng feature of the Cortex-M0+ is the oponal 24-bit System Timer (SysTick). This mer
can be used by an operang system. It can be polled by soware or can be congured to generate an
interrupt. The SysTick interrupt has its own entry in the vector table and therefore can have its own
handler. The SysTick mer is controlled via a set of special system control registers.
3.3.4 Processor mode and privileges
The Cortex-M0+ processor supports the Armv6-M Thread and Handler mode through a control
register (CONTROL) and two dierent stack pointers, Main Stack Pointer (MSP) and Process Stack
Pointer (PSP) as explained in Chapter 1. Thread mode is used to execute applicaon soware. The
processor enters Thread mode when it comes out of reset. Handler mode is used to handle excepons.
The processor returns to Thread mode when it has nished all excepon processing.
It also (oponally) supports dierent privilege levels for soware execuon as follows:
Unprivileged: The soware has limited access to the MSR and MRS instrucons, and cannot use the
CPS instrucon or access the system mer, NVIC, or system control block. It might have restricted
access to memory or peripherals.
Privileged: The soware can use all the instrucons and has access to all resources.
In Thread mode, the CONTROL register controls whether soware execuon is privileged or
unprivileged. In Handler mode, soware execuon is always privileged. Only privileged soware can
write to the CONTROL register to change the privilege level for soware execuon in Thread mode.
Unprivileged soware can use the SVC instrucon to make a supervisor call to transfer control to
privileged soware.
3.3.5 Memory protecon
The Cortex-M0+ (oponally) supports memory protecon through an oponal Memory Protecon
Unit (MPU). When implemented, the processor supports the Armv6 Protected Memory System
Architecture model [2]. The MPU provides support for protecon regions with priories and access
permissions. The MPU can be used to enforce privilege rules, separate processes, and manage
memory aributes.
Considering the above features, in principle, the M0+ is capable of running an OS like Linux. In
pracce, embedded systems with a Cortex-M0+ will not have sucient storage and memory to run
Linux, but they can support other OSs such as freeRTOS.
2
2
hps://www.freertos.org/
53
3.4 Arm Cortex A53
This processor is used in the Raspberry Pi 3, shown in Figure 3.5. The funconal block diagram of
the Cortex-A53 processor [4] is shown in Figure 3.6. It is immediately clear that this is a much more
complex processor, with up to 4 cores and a 2-level cache hierarchy. Each core (boom row) has a
dedicated Floang-point Unit (FPU) and the Neon SIMD (single instrucon mulple data) architecture
extension. From the Governor blocks at the top, the main features of interest from an OS perspecve
are Arch mer” and “GIC CPU interface.The other blocks (CTI, Retenon control, and Debug over
power down) provide advanced debug and power-saving support.
Figure 3.5: Raspberry Pi 3 Model B with Arm Cortex-A53. Photo by author.
Figure 3.6: Cortex-A53 processor funconal block diagram.
3.4.1 Interrupt control
The “GIC CPU interface” block represents the Generic Interrupt Controller CPU Interface, an
implementaon of the Generic Interrupt Controller (GIC) architecture dened as part of the Armv8-A
architecture. The GIC denes the architectural requirements for handling all interrupt sources
for any processing element connected to a GIC and a common interrupt controller programming
L1
ICache
L1
DCache
Debug
and trace
Core 0
L2 cache SCU
ACE/AMBA 5 CHI
master bus interface
ACP slave
Level 2 memory system
Core 0 governor
L1
ICache
L1
DCache
Debug
and trace
Core 1
FPU and NEON
extension
Crypto
extension
L1
ICache
L1
DCache
Debug
and trace
Core 2
L1
ICache
L1
DCache
Debug
and trace
Core 3
Core 1 governor Core 2 governor Core 3 governor
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Arch
timer
GIC CPU
interface
Clock and
reset
CTI
Retention
control
Debug over
power down
Governor
APB decoder APB ROM APB multiplexer CTM
Cortex-A53 processor
FPU and NEON
extension
Crypto
extension
FPU and NEON
extension
Crypto
extension
FPU and NEON
extension
Crypto
extension
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
54
interface applicable to uniprocessor or mulprocessor systems. The GIC is a much more advanced
and exible interrupt handling system than the NVIC of the Cortex-M0+ because it needs to support
heterogeneous mulcore systems and virtualizaon. Rather than the simple set of registers used by
the NVIC, the GIC uses a memory-mapped interface of 255KB as well as a set of GIC control registers
(GICC*) and registers to support virtualizaon of interrupts (GICH*, GICV*) in the CPU.
3.4.2 Instrucon set
The Cortex-A53 supports both the AArch32 and AArch64 instrucon set architectures. The AArch32
includes the Thumb instrucon set used in the Cortex-M series. Consequently, code compiled for
the Cortex-M0+, for example, can run on the Cortex-A53. More to the point, the Raspbian Linux
distribuon for the Raspberry Pi 3 is a 32-bit distribuon, so the processor is running the OS and all
applicaons in the AArch32 state.
Figure 3.7: Arm architecture evoluon.
Figure 3.7, adapted from [5], shows how the Armv7-A architecture has been incorporated into the
Armv8-A architecture. In addion, Armv8 supports two execuon states: AArch32, in which the A32
and T32 instrucon sets (Arm and Thumb in Armv7-A) are supported and AArch64, in the 64-bit
instrucon set. Armv8-A is backwards compable with Armv7-A, but the excepon, privilege, and
security model has been signicantly extended as discussed below. In AArch32, the Armv7-A Large
Physical Address Extensions are supported, providing 32-bit virtual addressing and 40-bit physical
addressing. In AArch64, this is extended in a backward compable way to provide 64-bit virtual
addresses and a 48-bit physical address space. Another addion is the cryptographic at the instrucon
level, i.e., dedicated instrucons to speed up cryptographic computaons.
The latest ISO/IEC standards for C (C11, ISO/IEC 9899:2011) and C++ (ISO/IEC 14882:2011)
introduce standard capabilies for mul-threaded programming. This includes the requirement for
standard implementaons of mutexes and other forms of "uninterrupble object access." The Load-
Acquire and Store-Release instrucons introduced in AArch64 have been added to comply with these
standards.
LargePhysAddrExtn
VirtualizationExtn
TrustZone
ARM+Thumb ISAs
NEON
Hard_Float
ARMv7-A
ARMv8-A
AdvSIMD
( SP float)
AdvSIMD
( SP+DP float)
IEEE 754-2008 compliant floating point
LD acquire/ST release: C1x/C++11 compliance
A32+T32 ISAs
A64 ISA
Crypto
EL3, EL2, EL1 and EL0 exception hierarchy
{4, 16, 64}KB pages
4KB pages
32-bit VA;40-bit PA
>32-bit VA;48-bit PA
32
32
Crypto
55
Floang-point and SIMD support
The Armv8 architecture provides support for IEEE 754-2008 oang-point operaons and SIMD
(Single Instrucon Mulple Data) or vector operaons through dedicated registers and instrucons.
The Armv8 architecture provides two register les, a general-purpose register le, and a SIMD and
oang-point register (SIMD&FP) register le. In each of these, the possible register widths depend on
the Execuon state.
In AArch64 state, there is:
A general-purpose register le containing 31 64-bit registers. Many instrucons can access these
registers as 64-bit registers or as 32-bit registers, using only the boom 32 bits.
A SIMD&FP register le containing 32 128-bit registers. The quadword integer and oang-point
data types only apply to the SIMD&FP register le. The AArch64 vector registers support 128-
bit vectors (the eecve vector length can be 64-bits or 128-bits depending on the instrucon
encoding used).
In AArch32 state, there is:
A general-purpose register le containing 32-bit registers. Two 32-bit registers can support
a doubleword; vector formang is supported.
A SIMD&FP register le containing 64-bit registers. AArch32 state does not support quadword
integer or oang-point data types.
Both AArch32 and AArch64 states support SIMD and oang-point instrucons:
AArch32 state provides:
SIMD instrucons in the base instrucon sets that operate on the 32-bit general-purpose
registers.
Advanced SIMD instrucons that operate on registers in the SIMD&FP register le.
Floang-point instrucons that operate on registers in the SIMD&FP register le.
AArch64 state provides:
Advanced SIMD instrucons that operate on registers in the SIMD&FP register le.
Floang-point instrucons that operate on registers in the SIMD&FP register le.
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
56
3.4.3 System mer
The Arm Cortex-A53 implements the Arm Generic Timer architecture [6]. The Generic Timer can
schedule events and trigger interrupts based on an incremenng counter value. It provides:
Generaon of mer events as interrupt outputs.
Generaon of event streams.
The Generic Timer can schedule events and trigger interrupts based on an incremenng counter value.
It provides a system counter that measures the passing of me in real-me but also supports virtual
counters that measure the passing of virtual-me, i.e., the “equivalent real-me” on a Virtual Machine.
The Cortex-A53 processor provides a set of mer registers within each core of the cluster. The mers are:
An EL1 Non-secure physical mer.
An EL1 Secure physical mer.
An EL2 physical mer.
A virtual mer.
The Cortex-A53 processor does not include the system counter. This resides in the SoC. The system
counter value is distributed to the Cortex-A53 processor with a synchronous binary encoded 64-bit
bus. For more details, we refer to the Technical Reference Manual [4].
3.4.4 Processor mode and privileges
In terms of privileges, the Cortex-A53 denes the Armv8 excepon model, with four Excepon levels,
EL0-EL3, that provide an execuon privilege hierarchy:
EL0 has the lowest soware execuon privilege, and execuon at EL0 is called unprivileged
execuon.
Increased values of n, from 1 to 3, indicate increased soware execuon privilege. The OS would
run in EL1.
EL2 provides support for processor virtualizaon.
EL3 provides support for two security states, as part of the TrustZone architecture:
In Secure state, the processor can access both the Secure and the Non-secure memory address
space. When execung at EL3, it can access all the system control resources.
In Non-secure state, the processor can access only the Non-secure memory address space and
cannot access the Secure system control resources.
57
The addion of EL3 allows, e.g. to run a trusted OS in parallel with a hypervisor supporng non-
trusted OSs on a single system.
It is possible to switch at run me between the AArch32 and AArch64 instrucon set architectures,
but there are certain restricons relang to the excepon levels, explained in Figure 3.8. Essenally,
code running at a higher excepon level can only be AArch64 if the lower excepon levels are also
AArch64.
Figure 3.8: Moving between AArch32 and AArch64.
For each implemented Excepon level, in AArch64 state, a dedicated stack pointer register is
implemented. In AArch32 state, the stack pointer depends on the “PE mode” (these do not exist in
AArch64). PE modes support normal soware execuon and handle excepons. The current mode
determines the set of general-purpose and special-purpose registers that are available. The AArch32
modes are:
Monitor mode. This mode always executes at Secure EL3.
Hyp (hypervisor) mode. This mode always executes at Non-secure EL2.
System, Supervisor, Abort, Undened, IRQ, and FIQ modes. The Excepon level these modes
execute at depends on the Security state:
In Secure state: Execute at EL3 when EL3 is using AArch32.
In Non-secure state: always execute at EL1.
User mode. This mode always executes at EL0.
3.4.5 Memory management unit
As explained in Chapter 1, modern processors provide hardware support for address translaon and
memory protecon. We also explained briey the concept of memory pages and the page table.
A more detailed discussion is provided in Chapter 6, “Memory management.” For the purpose of the
discussion of the Cortex-A53 MMU, we can consider the terms “virtual memory” and “logical memory”
to be the same. An addional complexity is caused by the support for Virtual Machines (hypervisor) in
AArch64
App
EL0
EL1
EL2
An AArch64
OS can host
a mix of
AArch64
and AArch32
applications
An AArch32
OS cannot host
an AArch64
application
An AArch32
hypervisor
cannot host
an AArch64 OS
An AArch64
hypervisor
can host
an AArch64 and
AArch32 OS
AArch64 OS AArch32 OS
Hypervisor
AArch32
App
AArch32
App
AArch64
App
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
58
the Armv8 architecture: as each VM must provide the illusion of running on real hardware, there is an
extra level of addressing called Intermediate Physical Address (IPA) required.
The MMU controls table walk hardware that accesses translaon tables in main memory. It translates
virtual addresses to physical addresses and provides ne-grained memory system control through
a set of virtual-to-physical address mappings and memory aributes held in page tables.
These are loaded into the Translaon look-aside buer (TLB) when a locaon is accessed. In pracce,
the TLB is split into a very small, very fast micro TLB and a larger main TLB.
The MMU in each core comprises the following components:
Translaon look-aside buer
The TLB consists of two levels:
1. A 10-entry fully-associave instrucon micro TLB and 10-entry fully-associave data micro TLB.
We explained the concept of a fully-associave cache in Chapter 1. There are two separate micro
TLBs for instrucons and data to allow parallel access for performance reasons.
2. A 4-way set-associave 512-entry unied main TLB (Figure 3.9). “Unied” means that this TLB is
used for both instrucons and data. The main TLB is not fully associave but 4-way set-associave.
Remember that “fully associave” means that every address can be stored at any possible entry of the
TLB. If the cache or TLB is not fully associave, it means that there are restricons on where a given
address can be stored. A very common approach is an n-way set-associave cache, which means that
the cache is divided into blocks of n entries, and each block is mapped to a xed region of memory.
An address from a given region of memory can only be stored in a given block, but it can be stored in
any of the n entries in that block. For example, on the Raspberry Pi 3, the RAM is 1GB. Given a page
size of 4kB, this means 256K pages. This is mapped to 128 blocks (4 entries per block in the TLB), so
every physical memory block has 2k frames, each of which can be stored in one of 4 entries in the TLB.
Figure 3.9: 4-way set-associave main TLB.
Set 1
entry 1
entry 2
entry 3
entry 4
Set 2
Set 128
entry 1
entry 2
entry 3
entry 4
entry 1
entry 2
entry 3
entry 4
Block 1
Block 1
Block 128
512-entry TLB
1GB memory,
page size 4kB
2,048 frames
per block
59
Addional caches
As we will see in Chapter 6, in pracce page tables are hierarchical and address translaon in a hypervisor-
based environment has two stages (Figure 3.10). The Cortex-53 MMU, therefore, provides addional caches:
4-way set-associave 64-entry walk cache.
The walk cache RAM holds the paral result of a stage 1 translaon. For more details, see Chapter 6.
4-way set-associave 64-entry IPA cache.
The Intermediate Physical Address (IPA) cache RAM holds mappings between intermediate physical
addresses and physical addresses. Only Non-secure EL1 and EL0 stage 2 translaons use this cache.
Note that it is possible to disable stage 1 or stage 2 of the address translaon.
Figure 3.10: Two-stage address translaon.
3.4.6 Memory system
In Chapter 1, we introduced the concept of caching and a simple model for a cache, a small, fast
memory for oen-used data. The actual memory system in the Cortex-A53 is more complicated, but
the same concepts apply (Figure 3.6).
L1 Cache
The L1 memory system consists of separate per-core instrucon and data caches. The implementer
congures the instrucon and data caches independently during implementaon, to sizes of 8KB,
16KB, 32KB, or 64KB. The Raspberry Pi 3 conguraon has 16KB for both Instrucon and Data.
Note that the instrucon cache is read-only because instrucon memory is read-only.
The L1 instrucon cache has the following key features:
Cache line size of 64 bytes.
2-way set associave.
“Physical” address (40-bit IPA)
map of each Guest OS
Translation
performed by
the
Hypervisor
Physical address
(40
-bit PA) map
Virtual address (32-bit VA) map of
each application on each Guest OS
Translations by
each Guest OS
Translations by
each Guest OS
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
60
16-byte read interface to the L2 memory system. This means it takes 4 cycles to read a cache line
from the L2 cache.
The L1 data memory system has the following features:
Cache line size of 64 bytes.
4-way set associave.
32-byte write and 16-byte read interface to the L2 memory system.
64-bit read and 128-bit write path from the data L1 memory system to the datapath. In other
words, the CPU can read one 64-bit word and write 2 64-bit words directly from the L1 data cache.
Support for three outstanding data cache misses. This means that instead of immediately fetching
a cache line on a cache miss, the requests are deferred. So the cache will not block to fetch the
cache line on the rst miss but allow the CPU to connue execung instrucons (and hence
potenally create more misses).
The L1 data cache supports only a Write-Back policy (remember from Chapter 1, this means that
inial writes are to the cache, and write back to memory only occurs on evicon of the cache line).
It normally
3
allocates a cache line on either a read miss or a write miss (i.e., both write-allocate and
read-allocate). A special feature of the L1 data cache is that it includes logic to switch into pure read
allocate mode for certain scenarios. When in read allocate mode, loads behave as normal, and writes
sll lookup in the cache but, if they miss, they write out to L2 only.
The L1 data cache uses physical memory addresses. The micro TLB produces the physical address
from the virtual address before performing the cache access.
L2 Cache
The L2 cache is a unied cache shared by all cores, with a congurable cache size of 128KB, 256KB,
512KB, 1MB, and 2MB. The Raspberry Pi 3 conguraon is 512KB.
Data is allocated to the L2 cache only when evicted from the L1 memory system, not when rst
fetched from the system. Instrucons are allocated to the L2 cache when fetched from the system
and can be invalidated during maintenance operaons.
The L2 cache has the following key features:
Cache line size of 64 bytes;
16-way set-associave cache structure;
Uses physical addresses.
3
This behavior can be altered by changing the inner cache allocaon hints in the page tables.
61
Data cache coherency
Cache coherency refers to the need to ensure that local caches on dierent cores in a mulcore
system with a shared memory have present a coherent view on the memory. This essenally
means that the system should behave as if there are no caches. We note for completeness that the
Cortex-A53 processor uses the MOESI protocol to maintain data coherency between mulple cores.
In this protocol, each cache line is in one of ve states: Modied, Owned, Exclusive, Shared, or Invalid.
The L2 memory system includes a Snoop Control Unit (SCU) which implements this protocol. For more
informaon, we refer to the "Arm Cortex-A Series Programmer’s Guide for ARMv8-A", [7].
3.5 Address map
The descripon of the purpose, size, and posion of the address regions for memory and peripherals
in a system is called the address map or memory map. Because Arm system can be 32 or 64-bit, the
address space ranges from 4GB (32-bit) to 1TB (40-bit). The white paper Principles of Arm Memory
Maps describes Arm address maps for 32, 36 and 40-bit systems, and proposes extensions for 44 and
48-bit systems.
Arm has harmonized the memory maps across its various systems to provide internal consistency and
soware portability, and to address the constraints that come with mixing 32-bit components within
larger address spaces. The introducon of Large Physical Address Extension (LPAE) to ARMv7 class
CPUs has grown the physical address spaces to 36-bit and 40-bits, providing 64GB or 1024GB (1TB)
memory space. The 64-bit ARMv8 architecture can address 48-bits, providing 256TB.
Figure 3.11: Arm 40-bit address map.
Figure 3.11 shows how the address maps for dierent bit widths are related. The address maps are
dened as nested sets. As each memory map increases by 4-bits of address space, it contains all of
0 GB
1 GB
2 GB
4 GB
8 GB
16 GB
32 GB
64 GB
128 GB
256 GB
512 GB
1024 GB
32-bit
36-bit
40-bit
Mapped I/O
DRAM
Reserved
Mapped I/O
DRAM
Reserved
Mapped I/O
2 GB of DRAM
ROM & RAM & I/O
32-bit 36-bit 40-bit
0
32-bit
36-bit
40-bit
32 GB hole or DRAM
2 GB hole or DRAM
Log2 scale
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
62
the smaller address maps, at the lower addresses. Each increment of 4 address bits results in a 16-fold
increase in addressable space. The address space is paroned in a repeatable way:
8/16 DRAM;
4/16 Mapped I/O;
3/16 Reserved space;
1/16 Previous memory map (i.e., without the addional 4 address bits).
For example, the 36-bit address map contains the enre 32-bit address map in the lowest 4GB of
address space.
The address maps are paroned into four types or regions:
1. Stac I/O and Stac Memories, for register, mapped on-chip peripherals, boot ROMs,
and scratch RAMs.
2. Mapped I/O, for dynamically congured, memory-mapped buses, such as PCIe.
3. DRAM, for main system dynamic memory.
4. Reserved space, for future use.
The “DRAM holes” menoned in the Figure are an oponal mechanism to simplify the decoding
scheme when paroning a large capacity DRAM device across the lower physically addressed
regions, at the cost of leaving a small percentage of the address space unused.
Figure 3.12: Broadcom BCM2835 Arm-based SoC (Raspberry Pi) address maps.
I/O Peripherals
0x80000000
(2GB)
0x7E000000
(start of 32MB range)
0x40000000
(1GB)
00000000
I/O Peripherals
0x20000000
(I/O base set
in arm loader)
System
SDRAM
CPU bus
addresses
Arm physical
addresses
0x40000000
(1GB)
63
If we consider the 32-bit address space in the case of the Broadcom BCM2835 System-on-Chip used
in the Raspberry Pi 3, the picture (Figure 3.12 ) is a bit more complicated because the actual 32-bit
address space is used for the addresses on the system bus, but an MMU translates these addresses
to a dierent set of “physical” addresses for the Arm CPU. The lowest 1GB of the Arm physical
address map is eecvely the Linux kernel memory. For addressing of user memory, an addional
MMU is used.
3.6 Direct memory access
Direct memory access (DMA) is a mechanism that allows blocks of data to be transferred to or from
devices with no CPU overhead. The CPU manages DMA operaons by subming DMA requests to
a DMA controller. While the DMA transfer is in progress, the CPU can connue execung code. When
the DMA transfer is completed, the DMA controller signals the CPU via an interrupt.
DMA is advantageous if large blocks of memory have to be copied or if the transfer is repeve
because both cases would otherwise consume a considerable amount of CPU me. Like most modern
operang systems, Linux supports DMA transfers through a kernel API, if the hardware has DMA
support. It should be noted that this does not require special instrucons:the DMA controller is
memory-mapped, and the CPU simply writes the request to that region of memory.
Figure 3.13: Example system with Cortex-A and CoreLink DMA controller.
Arm processors do not include a DMA engine as part of the CPU core. Arm provides dedicated
DMA controllers such as the lightweight PrimeCell µDMAController [8], a very low gate count DMA
controller compable with the AMBA AHB-Lite protocol as used in the Cortex-M series, and the more
advanced CoreLink DMA-330 DMA Controller [9] which has a full AMBA-compliant interface or the
SoC manufacturers can provide their own DMA engines. Figure 3.13 shows an example system with
the CoreLink DMAC. In the Arm Cortex M series, the DMA controller will be a peripheral on the
AHB bus.
SMC
DMC DRAM
Flash
memory
Secure
APB slave
interface
AXI
master
interface
DMAC
AXI
Interconnect
AXI-APB
bridge
GPIO
Non-
secure
APB slave
interface
Peripheral
request
interface
Interrupt
outputs
AXI-APB
bridge
UART
Timer
ARM
processor
ARM
processor
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
64
However, in the higher-end Arm Cortex-A series, a special interface called Accelerator Coherency Port
(ACP) is provided as part of the AMBA AXI standard. The reason is that on mulcore processors with
cache coherency, the cache system complicates the DMA transfer because it is possible that some
data has not been wrien to the main memory at the me of the transfer. With the ACP, the Cortex-A
series implement a hardware mechanism to ensure that accesses to shared DMA memory regions are
cache-coherent. Without such a mechanism, the operang system (or end-user soware on a bare-
metal system) must ensure the coherency. More details on integrang a DMA engine in an Arm-based
mulprocessor SoC are provided in the Applicaon Note Implemenng DMA on ARM SMP Systems [10].
On the Arm Cortex A53, the ACP port is oponal, and it is not provided on the SoC in the Raspberry
Pi 3. The DMA controller on the Raspberry Pi SoC is not a ARM IP core. It is part of the I/O Peripheral
address space. An addional complicaon, in this case, is that the DMA controller uses CPU bus
addresses so for a DMA transfer the soware needs to translate between the Arm physical addresses
and the CPU bus addresses.
In general, DMA controllers are complex devices that usually have their own instrucon set as well as
a register le. This means that the Linux kernel needs a dedicated driver for the DMA controller.
3.7 Summary
In this chapter, we had a look a two dierent types of Arm processors: the Arm Cortex M0+, a single-
core, very low gate count, highly energy-ecient processor intended for microcontroller and deeply
embedded applicaons that implements the ARMv6-M architecture, and the Arm Cortex A53 used in
the Raspberry Pi 3, a mid-range, low-power processor that implements the Armv8-A architecture and
has all features required to run an OS like Linux. We have discussed these processors in terms of their
instrucon set, interrupt model, security model, and memory system. We have also introduced the
Arm address maps and Direct memory access (DMA) support.
3.8 Exercises and quesons
3.8.1 Bare-bones programming
The aim of this exercise is to implement some of the basic operang system funconality. To do this
from scratch is quite a lot of work but we suggest you start from exisng code provided in the tutorial
series Bare-Metal Programming on Raspberry Pi 3 on GitHub.
1. Create a cyclic execuve with three tasks where each task creates a connuous waveform: task 1
creates a sine; task 2, a block wave; and task 3, a triangle; each with a dierent period. Print either
the values of the waveforms or a text-based graph on the terminal.
2. Make your cyclic execuve preempve.
3. Share a resource between the three tasks. This can be a simple shared variable with read and write
access.
65
Other, harder suggesons:
1. Make memory allocaon dynamic, i.e., write your own malloc() and free().
2. Create a minimal memory le system.
3.8.2 Arm hardware architecture
1. What was the meaning of “MIPS for the masses”?
2. What are the advantages of a RISC architecture over a CISC architecture?
3.8.3 Arm Cortex M0+
1. For what kind of projects would you use an Arm Cortex M0+?
2. Why is the Arm Cortex M0+ not suitable for running Linux?
3.8.4 Arm Cortex A53
1. Discuss oang-point and SIMD support in the Arm Cortex A53
2. Discuss the processor modes and privileges in the Arm Cortex A53
3. Discuss the cache and TLB architecture of the Arm Cortex A53
3.8.5 Address map
1. Explain why Arm systems share a common address map for 32, 36 and 40-bit systems.
2. What is the purpose of “DRAM holes”?
3.8.6 Direct memory access
1. What is the role of the Accelerator Coherency Port (ACP) in the DMA architecture?
Chapter 3 | Hardware architecture
Operang Systems Foundaons with Linux on the Raspberry Pi
66
References
[1] J. Yiu, Arm Cortex-M for Beginners – An overview of the Arm Cortex-M processor family and comparison, Arm Ltd, 3 2017, v2. [Online].
Available: hps://developer.arm.com/-/media/Files/pdf/Porng%20to%20ARM%2064-bit%20v4.pdf
[2] Cortex-M0+ Technical Reference Manual Revision: r0p1, Arm Ltd, 12 2012, revC. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.ddi0484c/DDI0484C_cortex_m0p_r0p1_trm.pdf
[3] AMBA 3 AHB-Lite Protocol – Specicaon, Arm Ltd, 3 2017, v1.0. [Online].
Available: hps://silver.arm.com/download/download.tm?pv=1085658
[4] Arm
®
Cortex
®
-A53 MPCore Processor – Technical Reference Manual Rev: r0p4, Arm Ltd, 2 2016, revision: r0p4. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.ddi0500g/DDI0500G_cortex_a53_trm.pdf
[5] C. Shore, Porng to 64-bit Arm, Arm Ltd, 7 2014, revC. [Online].
Available: hps://developer.arm.com/-/media/Files/pdf/Porng%20to%20ARM%2064-bit%20v4.pdf
[6] ARM
®
Architecture Reference Manual – ARMv8, for ARMv8-A architecture prole, Arm Ltd, 12 2017, issue: C.a. [Online].
Available: hps://silver.arm.com/download/download.tm?pv=4239650&p=1343131
[7] Arm Cortex-A Series - Programmer’s Guide for ARMv8-A - Version: 1.0, Arm Ltd, 3 2015, issue A. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf
[8] PrimeCell uDMA Controller (PL230) Technical Reference Manual Revision: r0p0, Arm Ltd, 1 2007, issue: A. [Online].
Available: hp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0417a/index.html
[9] CoreLink DMA-330 DMA Controller Technical Reference Manual Revision: r1p2, Arm Ltd, 1 2012, issue: D. [Online].
Available: hp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0424d/index.html
[10] Implemenng DMA on ARM SMP Systems, Arm Ltd, 8 2009, issue: A. [Online].
Available: hp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html
67
Chapter 3 | Hardware architecture
Chapter 4
Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
70
4.1 Overview
Processes are programs in execuon. Have you ever seen a whale skeleton displayed in a museum?
This is like a program—it’s a stac object, see Figure 4.1. Although it has shape and structure, it’s never
going to ‘do’ anything of interest. Now think about a live whale swimming through the ocean, see
Figure 4.2. This is like a process—it’s a dynamic object. It incorporates the skeleton structure, but it has
more aributes and is capable of acvity.
In this chapter, we explore what Linux processes look like, how they operate, and how they enable
mul-program execuon. We outline the context that needs to be encapsulated in a process. We walk
through the process lifecycle, considering typical operaons that will be performed on a process.
Figure 4.1: Blue whale skeleton at the Natural History Museum in London.
Photo by author.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Describe why processes are used in operang systems.
2. Jusfy the need for relevant metadata to be maintained for each process.
3. Sketch an outline data structure for the process abstracon.
4. Recognize the state of processes on a running Linux instance.
5. Develop simple programs that interact with processes in Linux.
4.2 The process abstracon
A process is a program in execuon. A program consists only of executable code and stac data. These
are stored in a binary arfact, such as an ELF object or a Java class le. On the other hand, a process
also encapsulates the runme execuon context. This includes the program counter, stack pointer and
other hardware register values for each thread of execuon, so we know whereabouts we are in the
program execuon. The process also records memory management informaon. Further, the process
needs to keep track of owned resources such as open le handles and network connecons.
Figure 4.2: Blue whale swimming in the ocean.
Public domain photo by NOAA.
71
A process maps onto a user applicaon (e.g., spreadsheet), a background ulity (e.g., le indexer) or
a system service (e.g., remote login daemon). It is possible that mulple processes might be execung
the same program at once. These would be dierent runme instances of the same program. Some
complex applicaons only permit a single instance of the process to be executed at once. For instance,
the Firefox web browser has a lock le, that prevents mulple instances of the applicaon from
execung with the same user prole, see Figure 4.3.
Figure 4.3: Firefox displays an error message and refuses to run mulple applicaon instances for a single user prole.
4.2.1 Discovering processes
How many processes are execung on your system right now? In an interacve shell session, type:
Lisng 4.2.1: List processes Bash
1 ps aux | wc -l
The ps command displays informaon about processes currently registered with the OS. The opons
we use are as follows:
My Linux server shows 257 processes. How many processes are on your machine? Every me you
invoke a new program, a new process starts. This might occur if you click a program icon in an app
launcher bar, or if you type an executable le name at a shell prompt.
4.2.2 Launching a new process
Let’s nd out how to start a new process programmacally, using the fork system call. This is the
standard Unix approach to create a new process. The fork-ing process (known as the parent) generates
an exact copy (known as the child), that is execung the same code. The only dierence between the
parent and the child (i.e., the only way to disnguish between the two processes) is the return value
of the fork call. In the child process, this return value is 0. In the parent process, the return value is
a posive integer which denotes the allocated process idener (or pid) of the child. Figure 4.4 shows
this sequence schemacally. Note that if we can’t fork a new process, fork returns 1.
a
Include all users’ processes
u
Display user-friendly output
x
Include processes not started from a user terminal
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
72
Below we show a simple Python script. This runs a program that creates a second copy of itself.
Figure 4.4: Schemac diagram of the behavior of the fork system call.
Lisng 4.2.2: Example of fork() in Python Python
1 import os
2
3 def child():
4 print ("Child process has PID {:d}".format(os.getpid()))
5
6 def parent():
7 # only parent executes this code
8 print ("Parent process has PID {:d}".format(os.getpid()))
9 child_pid = os.fork()
10 # both parent and child will execute subsequent if statement
11 if child_pid==0:
12 # child executes this
13 child()
14 else:
15 # parent executes this
16 print ("Parent {:d} has just forked child {:d}".format(
17 os.getpid(), child_pid))
18
19 parent()
The child process is a copy of the parent process, with the only dierence being the return value
of fork. However, the child process occupies an enrely separate virtual address space—so any
subsequent changes made to either the parent or the child memory will not be visible in the other
process. This duplicaon of memory is done in a lazy way, using the copy on write technique to avoid
massive memory copy overheads. Data is shared unl one process (either parent or child) tries to
modify it; then the two processes are each allocated a private copy of that data. Copy on write is
explained in more detail in Secon 6.5.7.
73
4.2.3 Doing something dierent
The fork call allows us to start a new process, but the child is almost exactly a replica of the parent.
How do we execute a dierent program in a child process? Linux supports this with the execve system
call, which replaces the currently running process with data from a specied program binary. The rst
parameter is the name of the executable le, the second parameter is the argument vector (eecvely
argv in C programs), and the third parameter is a set of environment variables, as key/value pairs.
Lisng 4.2.3: Example of execve() in Python Python
1 import os
2
3 os.execve("/bin/ls", ["ls", "-l", "*"], {})
This is precisely how an interacve shell, like bash, launches a new program; rst, the shell calls fork to
start a new process, then the shell calls execve to load the new program binary that the user wants to run.
The execve call does not return unless there is an error that prevents the new program from being
executed. See man execve for details of such errors, in which case execve returns -1. There are
several other variants of execve, which you can nd via man execl.
In Linux, the fork operaon is implemented by the underlying clone system call. The clone funcon
allows the programmer to specify explicitly which parts of the old process are duplicated for the new
process, and which parts are shared between the two processes. A clone call enables the child process
to share parts of its context, such as the virtual address space, with the parent process. This allows us to
support threads as well as processes with a single API, which is the implementaon basis of the Nave
Posix Threads Library (NPTL) in Linux. For instance, the pthread_create funcon invokes the clone
system call. Torvalds [1] gives a descripon of the design raonale for clone.
4.2.4 Ending a process
A parent process can block, waing for a child process to complete. The parent calls the wait funcon
for this purpose. Conversely, a child process can complete by calling the exit funcon with a status
code argument (a non-zero value convenonally indicates an error). Alternavely, the child process
may terminate by returning from its main roune.
The example C code below illustrates the duality between wait in the parent and exit in the child
processes.
Lisng 4.2.4: Example use of wait() in C code C
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <sys/types.h>
4 #include <sys/wait.h>
5 #include <unistd.h>
6
7 int main() {
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
74
8
9 pid_t child_pid, pid;
10 int status;
11
12 child_pid = fork();
13
14 if (child_pid == 0) {
15 //child process
16 pid = getpid();
17 printf("I'm child process %d\n", pid);
18 printf("... sleep for 10 seconds, then exit with status 42\n");
19 sleep(10);
20 exit(42);
21 }
22 else if (child_pid > 0) {
23 //parent
24 //waiting for child to terminate
25 pid = wait(&status);
26 if (WIFEXITED(status)) {
27 printf("Parent discovers child exit with status: %d\n", WEXITSTATUS(status));
28 }
29 }
30 else {
31
32 perror("fork failed");
33 exit(1);
34 }
35 return 0;
36 }
Figure 4.5 illustrates the sequence of Linux system calls that are executed by a parent and a child
process during the lifeme of the child.
Figure 4.5: Schemac diagram showing how to start and terminate a child process.
If the parent process completes before the child process, then the child becomes an orphan process.
It is ‘adopted’ by one of the parent’s ancestors, known as a subreaper. See man prctl for details.
If there are no nominated subreapers in the process ancestors, then the child is adopted by the init
process. In either case, the parent eld of the child processtask_struct is updated when its
original parent exits. This process is known as re-parenng.
4.3 Process metadata
A great deal of informaon is associated with each process. The OS requires this metadata to idenfy,
execute, and manage each process. Generally, the relevant informaon is encapsulated in a data
structure known as a process control block.
75
The most basic metadata is the unique, posive integer idener associated with a process,
convenonally known as the process pid. Some metadata is related to context switch saved data, such
as register values, open le handles, or memory conguraon. This informaon enables the process to
resume execuon aer it has been suspended by the OS. Further metadata relates to the interacons
between a process and the OS—e.g., proling stascs and scheduling details. Figure 4.6 shows
a high-level schemac diagram of the metadata stored in a process control block.
Figure 4.6: Generic OS management metadata required for each process, stored in a per-process data structure known as the process control block.
4.3.1 The /proc le system
The Linux kernel exposes some process metadata as part of a virtual le system. Let’s look in the
/proc directory on your Linux system:
Lisng 4.3.1: The /proc le system Bash
1 cd /proc
2 ls
You should see a list of directories, many of which will have names that are integers. Each integer
corresponds to a pid, and the les inside these pid directories capture informaon about the relevant
process.
Table 4.1: Virtual les associated with a process in /proc/[pid]/.
cmdline
The textual command that was invoked to start this process
cwd
A symbolic link to the current working directory for this process
exe
A symbolic link to the executable le for this process
fd/
A folder containing le descriptors for each le opened by the process
maps
A table showing how data is arranged in memory
stat
A list of counters for various OS events, specic to this process
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
76
Table 4.1 lists a few of these les and the informaon they contain. For the full list, execute man 5
proc at a Linux terminal prompt. The /proc/[pid] les are not ‘real’—look at the le sizes with
ls -l. These pseudo-les are not stored on the persistent le system: instead, they are le-like
representaons of in-memory kernel metadata for each process.
Let’s list the commands that all the processes in our system are execung:
Lisng 4.3.2: Finding all processes in the system via /proc Bash
1 cd /proc
2 forCMDin`nd.-maxdepth2-name"cmdline"`; do cat $CMD; echo "";done | sort
We observe that some commands are blank—these processes do not have a corresponding command-
line invocaon.
4.3.2 Linux kernel data structures
The Linux kernel spreads process metadata across several linked blocks of memory. In this secon, we
will examine three key data structures:
thread_info
task_struct
thread_struct
The C struct called thread_info is architecture-specic; for the Arm plaorm the struct is dened
in arch/arm/include/asm/thread_info.h. Each thread of execuon has its own unique
thread_info instance, embedded at the base of the thread’s runme kernel stack. (Each thread
has a dedicated 8KB stack in kernel memory for use when execung kernel code; this is disnct from
the regular user-mode stack.) We can extract the thread_info pointer by a low-overhead bitmask
operaon on the stack pointer register, see code below.
Lisng 4.3.3: Snippet from funcon current_thread_info(void) C
1 return (struct thread_info *)
2 (current_stack_pointer & ~(THREAD_SIZE - 1));
The majority of informaon in thread_info relates to the low-level processor context, such as
register values and status ags. The data structure includes a pointer to the corresponding task_-
struct instance for the process.
The C struct called task_struct is the Linux-specic instanaon of the process control block. It is
necessarily a large data structure, storing all the context for the process. The data structure is dened
in the architecture-independent kernel header le linux/sched.h. In the kernel, the C macro
current returns a pointer to the task_struct for the current process. On 32-bit Arm Linux kernel
4.4 the code sizeof(*current) measures the data structure size as 3472 bytes.
77
The thread_struct data structure is dened in the header le arch/arm/include/asm/
processor.h. This is a small block of memory, referenced by task_struct, which stores more
processor-specic context relang to fault events and debugging informaon.
Each thread has its own unique instances of these three key data structures, although references
to other metadata elements might be shared (e.g., for memory maps or open les, recall the earlier
discussion of the clone system call). Figure 4.7 shows a schemac diagram of these per-thread data
structures and their relaonships.
Figure 4.7: Runme layout of Linux data structures that encapsulate process metadata, residing in kernel memory space.
When a process starts, it runs with a single thread. Its process idener (PID) has the same integer value
as its thread group idener (TGID). If the process creates a new thread, then the new thread shares the
original process address space. The new thread acquires its own PID but retains the original TGID.
As we will see in the next chapter, the Linux scheduler handles all threads in a process as separate
items: in other words, a thread is a kernel-visible schedulable execuon enty, but a process is a user-
visible execuon context. Process tools like top generally merge mulple threads that share a TGID
into a single process.
4.3.3 Process hierarchies
Every process p has a parent, which is the process that created p. The inial system process is the
ancestor of all other processes. In Linux, this is the init process, which has pid 1. The global variable
init_task contains a pointer to the init processtask_struct.
There are two ways to iterate over processes:
1. Chase the linked list of pointers from one process to the next. This circular doubly-linked list runs
through the processes. Each task_struct instance has a next and prev pointer. The macro
for_each_process iterates overall tasks.
stack frames
thread info
SP
stack grows
down
task struct
thread struct
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
78
2. Chase the linked list of pointers from child process to parent. Each task_struct instance has
a parent pointer. This linear linked list terminates at the init_task.
The C code below will iterate over the linked list from the current task’s process control block to the
init task. It prints out the ‘family tree’ of the processes.
When you invoke this program, how deep is the tree? On my machine, it traverses 5 levels of process
unl it reaches the init process.
Note that this code needs to run in the kernel. It is privileged code since it accesses crical OS data
structures. The easiest way to implement this is to wrap up the code as a kernel module, which is
explained in Secon 2.4.3. The printk funcon is like printf only it outputs to the kernel log,
which you can read with the dmesg ulity.
Lisng 4.3.4: C code to trace a task’s ancestry C
1 #include <linux/module.h> /* Needed by all modules */
2 #include <linux/kernel.h> /* Needed for KERN_INFO */
3 #include <linux/sched.h> /* for task_struct */
4
5 int init_module(void)
6 {
7 struct task_struct *task;
8
9 for (task = current; task != &init_task; task = task->parent) {
10 printk(KERN_INFO " %d (%s) -> ", task->pid, task->comm);
11 }
12 printk(KERN_INFO " %d (%s) \n", task->pid, task->comm);
13
14 return 0;
15 }
In general, it is more ecient to avoid kernel code. Where possible, ulies remain in ‘userland,as the
non-kernel code is oen called.
For this reason, most Linux process informaon ulies like ps and top gather process metadata from
the /proc le system, which can be accessed without expensive kernel-level system calls or special
privileges. The pstree tool is another example ulity—it displays similar informaon to our process
family tree code outlined above, but pstree uses the /proc pseudo-les rather than expensive
system calls. The pstree ulity is part of the psmisc Debian package, which you may need to install
explicitly. Figure 4.8 shows typical output from pstree, for a Pi with a single user logged in via ssh.
79
Figure 4.8: Process hierarchy output from pstree.
4.4 Process state transions
When a process begins execuon, it can move between various scheduling states. Figure 4.9 shows
a simple state transion diagram, which indicates the state a process might be in, and the acon that
will transfer the process to a dierent state. A more complex version is presented in the next chapter.
Figure 4.9: The state transion diagram for a Linux process, states named in circles, with possible ps state codes indicated.
Table 4.2 lists the dierent process states, and their standard Linux abbreviaons, which you might
see in the output of the ps or top command. Each state corresponds to a bilag value, stored in the
corresponding task->state eld. The values are dened in include/linux/sched.h, which we
explore in more detail in the next chapter.
born
ready
running
dead
waiting
{Z}
{D.S,T}
{R}
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
80
Let’s play with some processes in Linux. Start a process in your terminal, perhaps a text editor like vim.
While it is running, make it stop by pressing CTRL + z. This sends the STOP signal to the process.
Eecvely we have paused its execuon. This is how program debugging works.
Now let’s run another process that incurs heavy disk overhead, perhaps
Table 4.2: Linux process states, see man ps for full details.
Lisng 4.4.1: nd Bash
1 nd/-name"foo" &
or
Lisng 4.4.2: dd Bash
1 dd if=/dev/zero of=/tmp/foo bs=1K count=200K &
Now you can observe your processes with the ps command. Use the watch tool to see how the states
change over me.
Lisng 4.4.3: watch Bash
1 watch ps u
You should see that some processes are running (R) and others are sleeping (S), waing for I/O (D), or
stopped (T). Press CTRL + c to exit the watch program.
A zombie process is a completed child process that is waing to be ‘died up’ by its parent process.
A process remains in the zombie state unl its parent calls the wait funcon, or the parent terminates
itself. The example Python code below will demonstrate a zombie child, as the parent is sleeping for
one minute aer the fork, but the child process exits immediately.
Lisng 4.4.4: Python zombie example Python
1 import os
2 import time
R
Running, or runnable
S
Sleeping, can be interrupted
D
Waing on IO, not interrupble
T
Stopped, generally by a signal
Z
Zombie, a dead process
81
3
4 def main():
5 child_pid = os.fork()
6 # both parent and child will execute subsequent if statement
7 if child_pid==0:
8 # child executes this
9 pid = os.getpid()
10 print ("To see the zombie, run ps u -p {:d}".format(os.getpid()))
11 exit()
12 else:
13 # parent executes this
14 time.sleep(60)
15 print ("Zombie process disappears now")
16
17 main()
4.5 Context switch
The earliest electronic computers were single-tasking. These systems executed one program
exclusively unl another program was loaded into memory. For instance, the early EDSAC machine at
Cambridge would ring a warning bell when a program completed execuon, so the technician could
read o the results and load in a new program. Up unl the 1980s, micro-computers ran single-program
operang systems like DOS and CP/M. For such computers, process management was unnecessary.
Processes are the basis of mul-programming, where the operang system executes mulple
programs concurrently. Eecvely, the operang system mulplexes many processes onto a smaller
number of physical processor cores.
The context switch operaon enables this mulplexing. All the runme data required for a process (as we
outlined in Secon 4.3) is saved into a process control block (eecvely the task_struct in Linux). The
OS serializes the process metadata. Then the process is paused, and another process resumes execuon.
If processes are switched in and out of execuon at suciently high frequency, then it appears that all
the processes are execung simultaneously. This is analogous to a person who is juggling, see Figure
4.10. In the same way, as the OS handles more processes than there are processors, the persons deals
with more juggling balls than s/he has hands.
Figure 4.10: Juggling with more balls than hands is like mul-tasking execuon. Image owned by the author.
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
82
For short-term process scheduling, the process context data is stored in RAM (i.e., kernel memory).
For processes that are not likely to be executed again in the short-term, the process memory is paged
out to disk. Given that the context captures all we need to know to resume the process, this paging
is relavely straighorward (see Chapter 6). Another possibility is that the process might be migrated
across a network link to another machine, perhaps within a cloud datacenter (see Chapter 11).
There are three praccal quesons to ask, in terms of context switching on a Linux system.
Q1: How long does a process actually execute before it is switched out?
We will cover process scheduling in more detail in the next chapter. However, Linux species a
scheduling quantum which is a noonal amount of me each process will be executed in a round-robin
style before a context switch. This quantum me value is specied on my Raspberry Pi as 10ms. You
can check the default value on your Linux system with:
Lisng 4.5.1: Default Linux meslice Bash
1 cat /proc/sys/kernel/sched_rr_timeslice_ms
Q2: How much data do we need to save for a process context?
For each thread, there is a thread_info struct, to capture saved register values and other processor
context. This data structure can be up to around 500 bytes on a 32-bit Arm processor with hardware
oang-point support. There is also process control informaon; however, much of this data will
already be resident in memory, so probably only minor updates required at a context switch event.
Q3: How long does a context switch take, on a standard Linux machine?
The context switch overhead measures the me taken to suspend one process and resume another.
This overhead must be made as low as possible on interacve systems, to enable rapid and smooth
context switching between user processes.
The open-source lmbench ulity [2] contains code to measure a range of low-level system
performance characteriscs, including the context switch overhead. Download the code tarball, then
execute the following commands:
Lisng 4.5.2: Using lmbench Bash
1 tar xvzf lmbench3.tar.gz
2 cd lmbench3/src
3 make results
4 # ignore errors
5 cd ../bin/armv7l-linux-gnu/
6 ./lat_ctx -s 0 10
This reports the context switch overhead for your machine. On my Raspberry Pi 2 Model B v1.1
running Linux kernel 4.4, lmbench reports a context switch overhead of around 12 µs. What do you
measure on your machine?
83
4.6 Signal communicaons
Inter-process communicaon will be covered in a future chapter. For now, we focus only on sending
signals to processes. A signal is like an interrupt — it’s an event generated by the kernel to invoke
a signal handler in another process. Signals are a mechanism for one-way asynchronous nocaons,
with a minimal data payload. The recipient process only knows the signal number and the identy
of the sender. Check out the siginfo_t struct denion in the <sys/siginfo.h> header for
more details.
4.6.1 Sending signals
The simplest way to send a signal to a process is to use the kill command, at a shell prompt, also
specifying the target pid. Below is an example to kill an annoying repeat print loop.
Table 4.3: A selecon of Linux signal codes, consult signal.h for the full set.
Lisng 4.6.1: Example kill process Bash
1 while ((1)) ; do echo "hello $BASHPID"; sleep 5; done &
2 # suppose this prints out hello 15082
3 # ... then you should type
4 kill 15082
Eecvely, this kill command is like interacvely pressing CTRL + c on the console. Study the table
of selected signals below to see some other events that a process may handle and their equivalent
interacve key combinaons.
Note that some signals are standardized across all Unix variants, whereas other signals may be system-
specic. Execute the command man kill or kill -l for details.
Name Number Descripon Interacve
SIGINT
2 Terminal interrupt
CTRL + c
SIGQUIT
3 Terminal quit
SIGILL
4 Illegal instrucon
SIGKILL
9 Kill process (cannot be caught/ignored)
SIGSEGV
11 Segmentaon fault (bad memory access)
SIGPIPE
13 Write on a pipe with no reader, broken pipe
SGALRM
14 Alarm clock Use alarm funcon to set an alarm
SIGCHLD
17 Child process has stopped or exited
SIGCONT
18 Connue execung, if stopped bg or fg
SIGSTOP
19 Stop execung (cannot be caught/ignored)
CTRL + z
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
84
4.6.2 Handling signals
We have looked at sending signals to processes. Now let’s consider how to handle such signals when
a process receives them. A signal handler is a callback roune which is installed by the process to deal
with a parcular signal. Below is a simple example of a program that responds to the SIGINT signal.
Lisng 4.6.2: Simple signal handler in C C
1 #include <stdio.h>
2 #include <signal.h>
3 #include <string.h>
4 #include <unistd.h>
5
6 struct sigaction act;
7
8 void sighandler(int signum, siginfo_t *info, void *p) {
9 printf("Received signal %d from process %lu\n",
10 signum, (unsigned long)info->si_pid);
11 printf("goodbye\n");
12 }
13
14 int main() {
15 // instructions for interactive user
16 printf("Try kill -2 %lu, or just press CTRL+C\n", (unsigned long)getpid());
17 // zero-initialize the sigaction instance
18 memset(&act, 0, sizeof(act));
19 // set up the callback pointer
20 act.sa_sigaction = sighandler;
21 // set up the ags, so the signal handler receives relevant info
22 act.sa_ags=SA_SIGINFO;
23 // install the handler
24 sigaction(SIGKILL, &act, NULL);
25 // wait for something to happen
26 sleep(60);
27 return 0;
28 }
Some signals cannot be handled by the user process, in parcular, SIGKILL and SIGSTOP. Even if you
aempt to install a handler for these signals, it will never be executed.
If we don’t install a handler for a signal, then the default OS handler is used instead. This will generally
report the signal then cause the process to terminate. For example, consider what happens when your
C programs dereference a null pointer; normally the default SIGSEGV handler supplied by the OS is
invoked, see Figure 4.11.
Figure 4.11 When a program deferences a null pointer, a segmentaon fault occurs and the appropriate OS signal handler reports the error.
85
4.7 Summary
In this chapter, we have explored the concept of a process as a program in execuon. We have seen
how to instanate processes using Linux system calls. We have reviewed the typical lifecycle of
a process and considered the various states in which a process can be found. We have explored the
runme data structures that encapsulate process metadata. Finally, we have seen how to aract the
aenon of a process using the signaling mechanism. Future chapters will explore how processes
are scheduled by the OS and how one process can communicate with other concurrently execung
processes.
4.8 Further reading
O’Reilly’s book on Linux System Programming [3] covers processes from a detailed user code
perspecve. The companion volume on Understanding the Linux Kernel [4] goes into much greater
depth about process management in Linux; although this textbook covers earlier kernel versions,
most of the material is sll directly relevant.
4.9 Exercises and quesons
4.9.1 Mulple choice quiz
1. Which of these is not a mechanism for allowing two processes to communicate with each
another?
a) message passing
b) context switch
c) shared memory
2. What happens when a process receives a signal?
a) The processor switches to privileged mode.
b) Control jumps to a registered signal handler.
c) The process immediately quits.
3. Which of the following items is shared by two threads that are cloned by the same process?
a) thread_info runme metadata
b) program memory
c) call stack
4. Immediately aer a successful fork system call, the only observable dierence between parent
and child processes is:
a) the return value of the fork call
b) the stack pointer
c) the program counter value
Chapter 4 | Process management
Operang Systems Foundaons with Linux on the Raspberry Pi
86
4.9.2 Metadata mix
1. Process metadata may be divided into three dierent kinds: (1) identy, (2) context switch saved
state, and (3) scheduling control informaon. Look at the following elds from the Linux task_
struct data structure in the linux/sched.h header le. For each eld, idenfy which sort of
metadata it is. You may want to look at the comments in the header le for more informaon.
a) unsigned int rt_priority
b) pid_t pid
c) volatile long state;
d) structles_struct*les;
e) void *stack;
f) unsignedlongmaj_t;
4.9.3 Russian doll project
A matryoshka doll is a set of wooden dolls of decreasing size placed one inside another. This challenge
involves creang a matryoshka process. Dene a constant called MATRYOSHKA, and set it to a
small integer value. Now write a C program with a main funcon that sets a local variable x to the
MATRYOSHKA value. Then construct a loop that checks the value of x. If x is less than or equal to 0,
then return, otherwise decrement the value of x and fork a new process. Recall from Secon 4.2 that
the fork call should be wrapped in an if statement to ensure dierent behavior for the parent and
child processes. To make your code more interesng, each individual process could print out its unique
id and its value of x. The output should look like this:
Lisng 4.9.1: Matryoshka program output C
1 " I'm 1173: x is 4 "
2 " I'm 1174: x is 3 "
3 " I'm 1175: x is 2 "
4 " I'm 1176: x is 1 "
5 " I'm 1177: x is 0 "
4.9.4 Process overload
When one user starts too many processes rapidly, the enre system can become unusable. Discuss
why this might happen. Eecvely, rapid process creaon is an OS denial-of-service aack. Search
online for ‘fork-bomb’ aacks to nd out more details [5]. How does the ulimit command migate
such denial-of-service aacks?
4.9.5 Signal frequency
Consider the signals listed in Table 4.3. Which of these signals are likely to be received frequently?
Which signals are rarer? In what circumstances might you use a custom signal handler for your
applicaon?
4.9.6 Illegal instrucons
You can aempt to execute an illegal instrucon on your Raspberry Pi with the assembler code block
shown below:
87
Lisng 4.9.2: Execute an illegal instrucon C
1 int main() {
2 asm volatile (".word 0xe7f0000f\n");
3 return 0;
4 }
Compile this code and execute it. You should see an Illegal Instruction error message. Now
dene a signal handler for SIGILL. At rst, the signal handler should just report the illegal instrucon
and exit the program. As an advanced step, try to get the signal handler to advance the user program
counter by one instrucon (4 bytes) and return. You will need to access and modify the context-
>uc_mcontext.arm_pc data eld.
References
[1] L. Torvalds, The Linux Edge. O’Reilly, 1999, hp://www.oreilly.com/openbook/opensources/book/linus.html
[2] L. W. McVoy, C. Staelin et al., “lmbench: Portable tools for performance analysis.” in USENIX annual technical conference, 1996,
pp. 279–294, download code from hp://www.bitmover.com/lmbench/
[3] R. Love, Linux System Programming: Talking Directly to the Kernel and C Library, 2nd ed. O’Reilly, 2013.
[4] D. P. Bovet and M. Cesa, Understanding the Linux Kernel, 3rd ed. O’Reilly, 2005.
[5] E. S. Raymond, “The new hacker’s diconary: Fork bomb,” 1996, see also hp://www.catb.org/~esr/jargon/html/F/fork-
bomb.html
Chapter 4 | Process management
Chapter 5
Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
90
5.1 Overview
This chapter discusses how the OS schedules processes on a processor. This includes the raonale for
scheduling, the concept of context switching, and an overview of scheduling policies (FCFS, priority, ...)
and scheduler architectures (FIFO, mullevel feedback queues, priories, ...). The Linux scheduler is
studied in detail.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Explain the raonale for scheduling and relaonship to the process lifecycle.
2. Discuss the pros and cons of dierent policies for scheduling in terms of the principles and criteria.
3. Calculate scheduling criteria and reason about scheduling policy performance with respect to the criteria.
4. Analyze the implementaon of scheduling in the Linux kernel.
5. Control scheduling of threads and processes as a programmer or system administrator.
5.2 Scheduling overview: what, why, how?
In Chapter 1, we introduced the concept of tasks and explained what a processor needs to do to allow
mulple tasks to execute concurrently. Each task constutes an amount of work for the CPU, and
scheduling is the method by which this work is assigned to the CPU. The operang system scheduler
is the component responsible for the scheduling acvity.
5.2.1 Denion
According to the Oxford diconary [1], a schedule
1
is “a plan for carrying out a process or procedure,
giving lists of intended events and mes: we have drawn up an engineering schedule”; to schedule
means to “arrange or plan (an event) to take place at a parcular me” or to “make arrangements for
(someone or something) to do something”. In the context of operang systems, both meanings hold:
the scheduler arranges events (i.e., execuon of task code on the CPU) to take place at a parcular
me and makes arrangements for the task to run.
5.2.2 Scheduling for responsiveness
Scheduling is primarily movated by the need to execute mulple tasks concurrently. In a modern
compung system, many tasks are acve at the same me. For example, on a desktop system, every
tab in a web browser is a task; the graphical user interface requires a number of tasks, there are tasks
taking care of networking, etc. At the me of wring this text, my laptop was running 317 processes.
From these, 106 were superuser tasks, 24 were services, and the remaining 190 were owned by my
user account. Most of these tasks are long-running, i.e., they only exit when the system shuts down.
In fact, out of the 190 processes under my user name, only 33 belonged to applicaons that I had
actually launched.
1
The origin is late Middle English (in the sense ‘scroll, explanatory note, appendix’): from Old French cedule, from late Lan schedula ‘slip of paper,’ diminuve of scheda,
from Greek σχεδη ‘papyrus leaf.
91
Now assume for a moment that the system would execute these tasks one by one, waing unl a task
completes, then execute the next task. The very rst task would occupy the processor forever, so none
of the other tasks would be able to run. Therefore the operang system gives each process, in turn,
a slice of CPU me.
5.2.3 Scheduling for performance
However, there is another important benet of scheduling. The processor is very fast (remember,
even the humble Raspberry Pi executes 10 million instrucons in a single Linux me slice). But
when accessing peripherals for I/O, the processor has to wait for the peripheral, and this can take
a long me because peripherals such as disks are comparavely slow. For example, simply accessing
DRAM without a cache takes between 10 and 100 clock cycles; accessing a hard disk takes several
milliseconds, i.e., millions of clock cycles. Without concurrent execuon, the CPU would idle unl the
I/O request had completed. Instead, the operang system will schedule the next task on the CPU.
5.2.4 Scheduling policies
A scheduling policy is used to decide what share of CPU me a process will get and when it will
be scheduled. In pracce, processes have dierent needs. For example, when playing a video, it is
important that the image does not freeze or stuer, so it is beer to give such a process frequent
short slices than infrequent long slices. On the other hand, many of the system processes that run
invisibly in the background are not ming crical, so the operang system might decide to schedule
them when with low priority.
In the rest of the chapter, we will look in detail at the scheduling component of the kernel and its
relaonship to the process management infrastructure discussed in the previous chapter.
5.3 Recap: the process lifecycle
Recall from the previous chapter that the operang system manages each process through a data
structure called the Process Control Block, which in Linux is implemented using the task_- struct
datastructure. With respect to the process lifecycle, the main aribute of interest is the state which
can be one of the following (from linux/sched.h)
#deneTASK_RUNNING0x0000
#deneTASK_INTERRUPTIBLE0x0001
#deneTASK_UNINTERRUPTIBLE0x0002
#dene__TASK_STOPPED0x0004
#dene__TASK_TRACED0x0008
#deneTASK_PARKED0x0040
#deneTASK_DEAD0x0080
#deneTASK_WAKEKILL0x0100
#deneTASK_WAKING0x0200
#deneTASK_NOLOAD0x0400
#deneTASK_NEW0x0800
#deneTASK_STATE_MAX0x1000
#deneTASK_NORMAL(TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE)
#deneTASK_IDLE(TASK_UNINTERRUPTIBLE|TASK_NOLOAD)
as well as the exit_state which can be one of the following:
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
92
#deneEXIT_DEAD0x0010
#deneEXIT_ZOMBIE0x0020
Observe that each of these states represents a unique bit in the state value. Figure 5.1 shows the
actual states a process can be in, annotated with the state values. Scheduling is concerned with
moving tasks between these states, in parcular from the run queue to the CPU and from the CPU
to the run queue or the waing state.
Figure 5.1: Linux process lifecycle.
The key point to note is that when the task is running on the CPU, the OS is not running unl an
interrupt occurs. Typically the interrupt is caused by the mer that controls the me slice allocated
to the running process, or raised by peripherals. Another point to note is that most processes actually
spend most of their me in the waing state. This is because most processes frequently perform I/O
operaons (e.g., disk access, network access, keyboard/mouse/touch screen input, ...) and these I/O
operaons usually take a relavely long me to complete. You can check this using the me command,
for example, we can me a command that waits for user input, e.g.
Lisng 5.3.1: Timing a command that waits for user input Bash
1 wim@rpi:~ $ time man man
2
3 real 0m5.275s
4 user 0m0.620s
5 sys 0m0.060s
The man command displays the man page for a command (in this case its own man page) and waits
unl the user hits ’q’ to exit. I hit ’q’ aer about ve seconds.
To interpret the output of the me, we need the denions of real, user, and sys. According to the
man page:
TASK_NEW
TASK_RUNNING
TASK_DEAD
EXIT_DEAD
EXIT_ZOMBIE
TASK_NORMAL
TASK_IDLE
born
(code loaded, PCB created)
ready
(run queue)
running
(on CPU)
waiting
(for I/O, thread sync,...)
died
(PCB still
active)
TASK_RUNNING
93
The me command runs the specied program command with the given arguments. When the
command nishes, me writes a message to standard error giving ming stascs about this program
run. These stascs consist of
the elapsed real me between invocaon and terminaon,
the user CPU me (the sum of the tms_utime and tms_cutime values in a struct tms as returned
by mes(2)), and
the system CPU me (the sum of the tms_stime and tms_cstime values in a struct tms as returned
by mes(2)).
The man page of mes gives us some more details:
The struct tms is as dened in <sys/times.h>:
Lisng 5.3.2: struct tms from <sys/mes.h> C
1 struct tms {
2 clock_t tms_utime; /* user time */
3 clock_t tms_stime; /* system time */
4 clock_t tms_cutime; /* user time of children */
5 clock_t tms_cstime; /* system time of children */
6 };
The tms_utime eld contains the CPU me spent execung instrucons of the calling process. The
tms_stime eld contains the CPU me spent in the system while execung tasks on behalf of the
calling process. The tms_cutime eld contains the sum of the tms_utime and tms_cutime values
for all waited-for terminated children. The tms_cstime eld contains the sum of the tms_stime and
tms_cstime values for all waited-for terminated children.
So what the example tells us is that the process spent only 620 ms out of 5.275 s running user
instrucons and the OS spent 60 ms performing work on behalf of the user process. So for about 4.6
seconds the process was waing for I/O, i.e., the interrupt from the keyboard caused by hing the ’q’
key. Most processes will alternate many mes between running and waing. The me a process spends
in the running state is called the burst me.
5.4 System calls
When a user process wants to perform I/O or any other system-related operaon, it needs to instruct
the operang system to perform the required acon. This operaon is called a system call. Because
the operang system is interrupt-driven, the user process needs to raise a soware interrupt to
give control to the operang system. Furthermore, Linux system calls are idened by a unique
number and take a variable number of arguments. Linux allows us to implement system calls via the
syscall() library funcon (although this is not the used for the common system calls in the C
library). The syscall(2) man page provides a very good discussion of the details. The following secon
gives a summary of the man page, oming specic details for non-Arm architectures.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
94
5.4.1 The Linux syscall (2) funcon
Lisng 5.4.1: Linux syscall C
1 #dene _GNU_SOURCE /* See feature_test_macros(7) */
2 #include <unistd.h>
3 #include <sys/syscall.h> /* For SYS_xxx denitions */
4 long syscall(long number, ...);
syscall() is a small library funcon that invokes the system call whose assembly language interface
has the specied number with the specied arguments. Employing syscall() is useful, for example,
when invoking a system call that has no wrapper funcon in the C library. syscall() saves CPU
registers before making the system call, restores the registers upon return from the system call, and
stores any error code returned by the system call in errno(3) if an error occurs. Symbolic constants for
system call numbers can be found in the header le <sys/syscall.h>.
The return value is dened by the system call being invoked. In general, a 0 return value indicates
success. A −1 return value indicates an error, and an error code is stored in errno.
Architecture calling convenons
Each architecture ABI (Applicaon Binary Interface) has its own requirements on how system call
arguments are passed to the kernel. For system calls that have a glibc wrapper (e.g., most system
calls), glibc handles the details of copying arguments to the right registers in a manner suitable for the
architecture.
Every architecture has its own way of invoking and passing arguments to the kernel. The details for the
(32-bit) EABI and arm64 (i.e., AArch64) architectures are listed in the two tables below.
Table 5.1 lists the instrucon used to transion to kernel mode (which might not be the fastest or best way
to transion to the kernel, so you might have to refer to vdso(7)), the register used to indicate the system
call number, the register used to return the system call result, and the register used to signal an error.
Table 5.1: Instrucon used to transion to kernel mode.
Table 5.2 shows the registers used to pass the system call arguments.
Table 5.2: Registers used to pass the system call arguments.
ABI Instrucon Syscall# Retval Error
arm/EABI swi #0 r7 r0 -
arm64 svc #0 x8 x0 -
ABI arg1 arg2 arg3 arg4 arg5 arg6 arg7
arm/EABI r0 r1 r2 r3 r4 r5 r6
arm64 x0 x1 x2 x3 x4 x5 x-
95
The Cortex-A53 is an AArch64 core which supports both ABIs. However, the Raspbian Linux shipped
with the Raspberry Pi 3 is a 32-bit Linux, so it uses the EABI. This means that it uses swi (Soware
Interrupt) rather than svc (Supervisor Call) to perform a system call. However, in pracce, they are
synonyms, and their purpose is to provide a mechanism for unprivileged soware to make a system
call to the operang system. The X* registers in AArch64 indicated that the general-purpose R*
registers are accessed as 64-bit registers. [2]
For example (taken from the syscall man page), using syscall(), the readahead() system call would be
invoked as follows on the Arm architecture with the EABI in lile-endian mode:
Lisng 5.4.2: Example syscall: readahead() C
1 syscall(SYS_readahead, fd, 0,
2 (unsigned int)(oset&0xFFFFFFFF),
3 (unsigned int)(oset>>32),
4 count);
5.4.2 The implicaons of the system call mechanism
Whenever a user process wants to perform I/O or any other system-related operaon, the operang
system takes over. This means that every system call involves a context switch, with overheads,
as discussed in the previous chapter. Note that in the me taken to perform a context switch me
(around 10µs) the CPU could have executed 10,000 operaons, so the overhead of context switching
is considerable.
Virtual dynamic shared object (vDSO)
To reduce the overhead of system calls, over me two mechanisms have been introduced
into the Linux kernel: vsyscall (virtual system call) and vDSO Dynamic Shared Object). The
original vsyscall mechanism is now obsolete so we only discuss the vDSO. The purpose of
both mechanism is the same: to allow system calls without the need for a context switch.
The raonale behind this mechanism is that some system calls that are frequently used do
not actually require kernel privileges, and therefore handing control over these operaons to
the kernel is an unnecessary overhead. As the name indicates, these calls are implemented
in a special dynamically shared library (linux-vdso.so) which is automacally provided by the
kernel to any process created. In pracce, for the Arm architecture only two system calls are
implemented this way: clock_geme() and gemeofday().
5.5 Scheduling principles
Aer this detour into the process lifecycle and the role of system calls, let’s have a look at the
principles of OS scheduling and what criteria an OS can use to make scheduling decisions.
Let’s assume a number of tasks are acve in the system, and that each of these tasks spends a certain
poron of its lifeme running on the CPU and another poron waing. It is also possible that the task
is ready to run, but the CPU is occupied by another task.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
96
5.5.1 Preempve versus non-preempve scheduling
A rst design decision to make is if the scheduler will be able to interrupt running tasks, for example,
to run a task that it considers more important (i.e., it has a higher priority). If this is the case, the
scheduler is called pre-empve. In Linux, all scheduling is preempve. The opposite, non-preempve
scheduling, can be used if the tasks voluntarily yield the CPU to other tasks. This is called cooperave
multasking and is not commonly used in modern operang systems.
Note that we do not use the term preempon when a task is moved to the waing state because
this is not a scheduling acvity. From a scheduling perspecve, the remainder of the task can be
considered as a new task (belonging to the same process or thread).
5.5.2 Scheduling policies
The scheduling policy is the approach to scheduling taken by the scheduler. To understand the concept
beer, consider that the scheduler must keep a list of tasks that are ready to run. This list is ordered in
some way, and the task at the head of the list is the one that will run next. Therefore the main decision of
the scheduler is in which posion in the list to put a new ready task. Furthermore, the scheduler must also
decide for how long a task can run if it is not pre-empted by another task or interrupted by a system call.
Essenally, these two decisions form the scheduling policy. Linux has several dierent scheduling policies,
each task (i.e., each process or thread) can be set to one of these policies. The praccal implementaon
of a policy is an algorithm, so somemes we will use the term scheduling algorithm instead.
5.5.3 Task aributes
We menoned above (Secon 5.5.1) that the scheduler can consider one task more important than
another, and therefore give a higher priority of execuon to the more important task. This means that
the more important task can either be run sooner, or for longer, or both. The importance of a task
depends on its aributes. A task aribute could, for example, be the me when the task was put in the
task list, or its posion in the task list; or the me it takes for the task to run; or the amount of CPU
me that has already been spent by the task. Or the task can have an explicit priority aribute, which
in pracce is a small integer value used by the kernel to assess how important a process is.
The Linux kernel uses several of the above-menoned aributes, depending on the scheduling policy
used, and all threads have a priority aribute.
5.6 Scheduling criteria
When selecng a scheduling policy, we can use dierent criteria, e.g., depending on the typical
process mix on the system, or depending on the requirements on the threads in an applicaon.
The most commonly used criteria are:
CPU ulizaon: Ideally, the CPU would be busy 100% of the me, so that we don’t waste any CPU cycles.
Throughput: The number of processes completed per unit me.
Turnaround me: The elapsed (wall clock) me required for a parcular task to complete, from birth
me to death.
97
Waing me: The me spent by a task in the ready queue waing to be run on the CPU.
Response me: The me taken between subming a request and obtaining a response.
Load average: The average number of processes in the ready queue. On Linux, it is reported by
"upme" and "who."
In general, we want to opmize the average value of criteria, i.e., maximize CPU ulizaon and
throughput, and minimize all the others. It is also desirable to minimize the variance of a criteria
because users prefer a consistently predictable system over an inconsistent one, even if the laer
performs beer on average.
5.7 Scheduling policies
In this secon, we discuss some common scheduling policies that make it easier to understand the
actual design choices and implementaon details are for the Linux kernel scheduler. To analyze the
behavior and performance of the various scheduling algorithms we use a Gan chart, i.e., a simple plot
of the task id on a discrete meline. Table 5.3 shows the example task conguraon that will be used
to create the Gan charts for the dierent scheduling policies.
Table 5.3: Example task conguraon.
5.7.1 First-come, rst-served (FCFS)
This is a very simple scheduling policy where the aribute deciding its priority is simply its relave
arrival me in the list of runnable tasks. In this context, this lists is a FIFO queue called the run queue.
The scheduler simply takes the task at the head of the queue and runs it on the CPU unl it either
nishes or gets interrupted by a system call and hence moves to the waing state. When the tasks
have nished waing, it will be re-added at the tail of the run queue. FCFS scheduling can either be
preempve or non-preempve, as illustrated in Figures 5.2 and 5.3.
Figure 5.2: Schedule for the example task conguraon with non-preempve FCFS.
Pid Burst me Arrival me Priority
1 12 0 0
2 6 2 1
3 2 4 1
4 4 8 2
5 8 16 0
6 8 20 1
7 2 20 0
8 10 24 0
FCFS,non-preemptive
time
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
arrival
1
2
3
4
5
6,7
8
run
1
1
1
1
1
1
2
2
2
3
4
4
5
5
5
5
6
6
6
6
7
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
98
Figure 5.3: Schedule for the example task conguraon with preempve FCFS.
5.7.2 Round-robin (RR)
Round robin is another very simple scheduling policy that is nevertheless very widely used. We
introduced it already in Chapter 1. This policy consists of running every task for a xed amount of
me.This amount of me is known as the me slice or scheduling quantum. The choice of the quantum
is crucial: if it is too long, the system will become unresponsive; if it is too short, the context switching
overhead will be considerable. As menoned in the previous chapter, you can check this value on your
Linux system using:
Lisng 5.7.1: Linux round-robin quantum from /proc Bash
1 cat /proc/sys/kernel/sched_rr_timeslice_ms
On the Raspberry Pi 3, it is 10 ms.
The schedule for the example task conguraon using RR is shown in Figure 5.4.
Figure 5.4: Schedule for the example task conguraon with Round-Robin scheduling.
5.7.3 Priority-driven scheduling
In priority-driven scheduling, the order in the run queue is determined by the priority of the process
or thread; in other words, the run queue is a priority queue. In general, we can observe the following:
A priority-driven scheduler is an on-line scheduler.
It does NOT precompute a schedule of tasks/jobs.
It assigns priories to jobs when they are released and places them on a ready job queue
in priority order.
When preempon is allowed, a scheduling decision is made whenever a job is released
or completed.
At each scheduling decision me, the scheduler updates the ready job queue and then schedules
and executes the job at the head of the queue.
FCFS,preemptive
time
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
arrival
1
2
3
4
5
6,7
8
run
1
1
2
3
4
4
1
1
5
5
6
6
8
8
8
8
8
1
1
2
2
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 112234415566788112556
RR,q=4
99
We can disnguish between xed-priority and dynamic-priority algorithms:
A xed-priority algorithm assigns the same priority to all the jobs in a task.
A dynamic-priority algorithm assigns dierent priories to the individual jobs in a task.
The priority of a job is usually assigned upon its release and does not change. The next two example
scheduling policies use me-related informaon as the priority.
5.7.4 Shortest job rst (SJF) and shortest remaining me rst (SRTF)
If we knew how long it would take for a task to run, we could reorder the run queue so that the
shortest task would be at the head of the queue. This policy is called shortest job rst (SJF) or
somemes shortest job next, and an illustrave schedule is shown in Figure 5.5. I menon it because
it is a very common one in other textbooks, e.g. [3], but it is not very praccal as in general the
scheduler can’t know how long a task will take to complete. It is, however, the simplest example of
the use of a task aribute as a priority (the priority is inverse to the predicted remaining CPU me).
Furthermore, SJF is provably opmal, in that for a given set of tasks and their execuon mes, it gives
the least average waing me for each process.
Figure 5.5: Schedule for the example task conguraon with Shortest Job First scheduling.
The preempve version of SJF is called shortest remaining me rst (SRTF). The criterion for
preempon, in this case, is that a newly arrived task has a shorter remaining run me than the
currently running task (Figure 5.6). This policy has been proven to be the opmal preempve policy
[4]. Both SJF and SRTF have an addional drawback: it is possible that some tasks will never run unl
because their remaining me is always considered to be longer than that of any other task in the
system. This is known as starvaon.
Figure 5.6: Schedule for the example task conguraon with Shortest Remaining Time First scheduling.
Figure 5.7: Schedule for the example task conguraon with Shortest Elapsed Time First scheduling.
SJF
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 1111113442227555
56666
SRTF
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 123224415555766661111
SETF
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 123144215566888887255
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
100
5.7.5 Shortest elapsed me rst (SETF)
SJF and SRTF are so-called clairvoyant algorithms, as they require the scheduler to know informaon
that is not available, in this case, the remaining run me of the process. A more praccal approach is
to use the elapsed run me of a process instead, which is of course easily measurable by the OS. The
paper “Speed Is as Powerful as Clairvoyance” [5] proved that SETF not only obtains good average-case
response me but also does not starve any job.
5.7.6 Priority scheduling
The term “priority scheduling” is used for priority-driven scheduling where the priority of the task is an
enrely separate aribute, not related to other task aributes. Priority driven scheduling can either be
preempve or non-preempve, as illustrated in Figures 5.8 and 5.9.
Figure 5.8: Schedule for the example task conguraon with non-preempve Priority scheduling.
Figure 5.9: Schedule for the example task conguraon with preempve Priority scheduling.
The advantage of using a separate priority rather than, e.g. a me-based aribute of the task is that
the priority can be changed if required. This is essenal to prevent starvaon, as menoned for SJF.
Any priority-based scheduling policy carries the risk that low-priority processes may never execute
because there is always a higher-priority process taking precedence. To remedy this, the priority
should not be stac but increased with the age of the process. This is called aging.
5.7.7 Real-me scheduling
Real-me applicaons are applicaons that process data in real-me, i.e., without delays. From a
scheduling perspecve, this means that the tasks have well dened me constraints. Processing must
be done within the dened constraints to be considered correct, in parcular, not nishing a process
within a given deadline can cause incorrect funconality.
We can disnguish two types of real-me systems:
So real-me systems give no guarantee as to when a crical real-me process will be scheduled,
but only guarantee that the crical process will have a higher priority. A typical example is video and
audio stream processing: missing deadlines will aect the quality of the playback bit is not fatal.
In hard real-me systems, a task must be serviced by its deadline, so the scheduler must be able to
guarantee this. This is, for example, the case for the controls of an airplane or other safety-crical
systems.
Priority,non-preemptive
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 111111222555578888836
Priority,preemptive
time 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
arrival 1 2 3 4
5 6,7 8
run 1111112255557888
88366
101
5.7.8 Earliest deadline rst (EDF)
The Linux kernel supports both types of real-me scheduling. For so real-me scheduling, it uses
Round-Robin or FIFO. For hard real-me scheduling, it uses an algorithm known as Earliest Deadline
First (EDF). This is a dynamic priority-driven scheduling algorithm for periodic tasks, i.e., tasks that
periodically need some work to be done. This periodic acvity is usually called a job. The period and
the deadline for the jobs of each task must be known.
The job queue is ordered by the earliest deadline of the jobs. To compute this deadline, the scheduler
must be aware of the period of each task, the phase dierences between those periods and the
execuon mes and deadlines for each job. Usually, the deadline is the same as the period, i.e.,
a job for a given task must nish within one period. In that case, each task can be described by a tuple
(phase, period, execuon me).
Algorithm 5.1 EDF Schedule for example tasks T1 = (0,2,1), T2 = (0,5,2.5)
Time Ready to Run Scheduled
0 J1,1[2]; J2,1[5] J1,1
1 J2,1[5] J2,1
2 J1,2[4]; J2,1[5] J1,2
3 J2,1[5] J2,1
4 J2,1[5]; J1,3[6] J2,1
4.5 J1,3[6] J1,3
5 J1,3[6]; J2,2[10] J1,3
5.5 J2,2[10] J2,2
6 J1,4[8]; J2,2[10] J1,4
7 J2,2[10] J2,2
8 J1,5[10]; J2,2[10] J1,5
9 J2,2[10] J2,2
For example, consider a system with two tasks which both started at me t=0, so the phase is 0 for both.
T1 has a period of 2 and an execuon me of 1; T2 has a period of 5 and an execuon me of 2.5:
T1 = (0 , 2, 1)
T2 = (0 , 5, 2.5)
In other words, both tasks are acve half of the me, so in principle together, they will use the CPU
100%. Because the tasks are periodic, it is sucient to calculate a schedule for the least common
mulple of the periods of T1 and T2, in this task 2*5=10. The schedule is shown below. This is an
important property of EDF: it guarantees that all deadlines are met provided that the total CPU
ulizaon is not more than 100%. In other words, it is always possible to create a valid schedule.
5.8 Scheduling in the Linux kernel
The Linux kernel supports two categories of scheduling, normal and real-me. A good explanaon
is provided in the sched(7) man page. With regards to scheduling, the thread is the main abstracon,
i.e., the scheduler schedules threads rather than processes.
Each thread has an associated scheduling policy and a stac scheduling priority. The scheduler makes its
decisions based on knowledge of the stac scheduling policy and priority of all threads in the system.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
102
There are currently (kernel 4.14) three normal scheduling policies: SCHED_OTHER, SCHED_IDLE and
SCHED_BATCH, and three real-me policies, SCHED_FIFO, SCHED_RR and SCHED_DEADLINE. Of these,
SCHED_OTHER, SCHED_FIFO and SCHED_RR are required by the POSIX 1003.1b real-me standard [6].
For threads scheduled using one of the normal policies, the stac priority is not used in scheduling decisions
(it is set to 0). Processes scheduled under one of the real-me policies have a stac priority value in the
range 1 (low) to 99 (high). Thus real-me threads always have higher stac priority than normal threads.
The scheduler maintains a list of runnable threads per stac priority value. To determine which thread
to run next, it looks for the non-empty list with the highest stac priority and selects the thread at the
head of this list.
The scheduling policy determines where a thread is to be inserted into the list of threads with equal
priority and how it will progress to the head of the list.
In Linux, all scheduling is preempve: if a thread with a higher stac priority becomes ready to run,
the currently running thread will be pre-empted and returned to the run list for its priority level. The
scheduling policy of the thread determines the ordering within the run list. This means that, e.g. for
the run list with stac priority 0, i.e., the normal scheduling category (SCHED_- NORMAL), there can
be up to three dierent policies that decide the relave ordering of the threads. For each of the higher
stac priority run lists (real-me), there can be one or two.
5.8.1 User priories: niceness
Niceness or nice value is the relave, dynamic priority of a process. Niceness values range from
-20 (most favorable to the process) to 19 (least favorable to the process) and the value aects how
the process is scheduled, but not in a direct way. The nice value of a running process can be changed
by the user via the nice(1) command or the nice(2) system call. We will see further how the dierent
schedulers use these values. Note that nice values are only for non-real-me processes.
5.8.2 Scheduling informaon in the task control block
As menoned before, the task control block is implemented in the Linux kernel in the task_struct
data structure, dened in include/linux/sched.h. Let’s have a look at the scheduling-specic informaon
stored in the task_struct (all other elds have been removed for conciseness).
Lisng 5.8.1: task_struct from <include/linux/sched.h> C
1 struct task_struct {
2
3 int on_rq;
4 /** - int prio, static_prio;
5 priority of a process used when scheduled. Variable prio, which is the
6 user-nice values can be converted to static priority to better scale
7 various scheduler parameters.
8 */
9 int prio, static_prio, normal_prio;
10 unsigned int rt_priority; // for soft real-time
11
12 const struct sched_class *sched_class; // see below
103
13 struct sched_entity se; // see below
14 struct sched_rt_entity rt; // for soft real-time
15 struct sched_dl_entity dl; // for hard real-time
16
17 /** the scheduling policy used for this process, as listed above */
18 unsigned int policy;
19 };
This structure includes a number of other scheduling-related data structures. We will discuss sched_
entity and the real-me variants sched_rt_entity and sched_dl_entity. in the secons on the
CFS and real-me schedulers. The sched_class struct is eecvely an interface for the actual scheduling
class in use: all funconality is implemented in each of the separate scheduling classes fair, idle,rt,deadline.
Lisng 5.8.2: sched_class from <include/linux/sched.h> C
1 struct sched_class {
2 const struct sched_class *next;
3
4 void (*enqueue_task) (struct rq *rq, struct task_struct *p, intags);
5 void (*dequeue_task) (struct rq *rq, struct task_struct *p, intags);
6 void (*yield_task) (struct rq *rq);
7 bool (*yield_to_task)
8 (struct rq *rq, struct task_struct *p, bool preempt);
9
10 void (*check_preempt_curr)
11 (structt rq *rq, struct task_struct *p, intags);
12
13 struct task_struct * (*pick_next_task)
14 (struct rq *rq,struct task_struct *prev,structrq_ags*rf);
15 void (*put_prev_task) (struct rq *rq, struct task_struct *p);
16
17 void (*set_curr_task) (struct rq *rq);
18 void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
19 void (*task_fork) (struct task_struct *p);
20 void (*task_dead) (struct task_struct *p);
21
22 void (*switched_from) (struct rq *this_rq, struct task_struct *task);
23 void (*switched_to) (struct rq *this_rq, struct task_struct *task);
24 void (*prio_changed)
25 (struct rq *this_rq, struct task_struct *task,int oldprio);
26
27 unsigned int (*get_rr_interval) (struct rq *rq,
28 struct task_struct *task);
29
30 void (*update_curr) (struct rq *rq);
31
32 };
So in order to perform a scheduling operaon for a process p, all the scheduler has to do is call
p->sched_class-><name of the operation>
and the corresponding operaon for the parcular scheduling class for that process will be carried out.
The Linux kernel keeps a per-CPU runqueue (struct rq) which contains dierent runqueues per
scheduling class as follows (from sched.h):
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
104
Lisng 5.8.3: runqueue struct from <include/linux/sched.h> C
1 / *
2 * This is the main, per-CPU runqueue data structure.
3 *
4 */
5 struct rq {
6 / * runqueue lock: */
7 raw_spinlock_t lock;
8 unsigned int nr_running;
9 #dene CPU_LOAD_IDX_MAX 5
10 unsigned long cpu_load[CPU_LOAD_IDX_MAX];
11 struct load_weight load;
12 unsigned long nr_load_updates;
13 u64 nr_switches;
14 struct cfs_rq cfs;
15 struct rt_rq rt;
16 struct dl_rq dl;
17 struct task_struct *curr, *idle, *stop;
18 };
5.8.3 Process priories in the Linux kernel
The kernel uses the priories as set or reported by nice() and as stac priories and represents them
on a scale from 0 to 139. Priories from 0 to 99 are reserved for real-me processes and 100 to 139
(which are the nice values from -20 through to +19 shied by 120) are for normal processes. The
kernel code implemenng this can be found in include/linux/sched/prio.h, together with some macros
to convert between nice values and priories.
Lisng 5.8.4: Linux kernel priority calculaon C
1 #dene MAX_NICE 19
2 #dene MIN_NICE -20
3 #dene NICE_WIDTH (MAX_NICE - MIN_NICE + 1)
4
5 / *
6 * Priority of a process goes from 0..MAX_PRIO-1, valid RT
7 * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
8 * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
9 * values are inverted: lower p->prio value means higher priority.
10 *
11 * The MAX_USER_RT_PRIO value allows the actual maximum
12 * RT priority to be separate from the value exported to
13 * user-space. This allows kernel threads to set their
14 * priority to a value higher than any user task. Note:
15 * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
16 */
17
18 #dene MAX_USER_RT_PRIO 100
19 #dene MAX_RT_PRIO MAX_USER_RT_PRIO
20
21 #dene MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
22 #dene DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
23
24 / *
25 * Convert user-nice values [ -20 ... 0 ... 19 ]
26 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
105
27 * and back.
28 */
29 #dene NICE_TO_PRIO(nice) ((nice) + DEFAULT_PRIO)
30 #dene PRIO_TO_NICE(prio) ((prio) - DEFAULT_PRIO)
31
32 / *
33 * 'User priority' is the nice value converted to something we
34 * can work with better when scaling various scheduler parameters,
35 * it's a [ 0 ... 39 ] range.
36 */
37 #dene USER_PRIO(p) ((p)-MAX_RT_PRIO)
38 #dene TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
39 #dene MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
Priority info in task_struct
The task_struct contains several priority-related elds:
int prio, static_prio, normal_prio;
unsigned int rt_priority; // for soft real-time
stac_prio is the priority set by the user or by the system itself:
p->static_prio = NICE_TO_PRIO(nice_value);
normal_priority is based on stac_prio and on the scheduling policy of a process, i.e., real-me or
“normal” process. Tasks with the same stac priority that use dierent policies will get dierent normal
priories. Child processes inherit the normal priories.
p->prio is the so-called “dynamic priority.” It is called dynamic because it can be changed by the system,
for example when the system temporarily raises a task’s priority to a higher level, so that it can preempt
another high-priority task. Inially, prio is set to the same value as stac_prio. The actual dynamic
priority is computed as:
p->prio=eective_prio(p);
This funcon, dened in kernel/sched/core.c, returns the normal_prio unless the task is a real-me
task, in which case it uses normal_prio() to recompute the normal priority
Lisng 5.8.5: Implementaon of eecve_prio() C
1 static int eective_prio(struct task_struct *p)
2 {
3 p->normal_prio = normal_prio(p);
4 / *
5 * If we are RT tasks or we were boosted to RT priority,
6 * keep the priority unchanged. Otherwise, update priority
7 * to the normal priority:
8 */
9 if (!rt_prio(p->prio))
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
106
10 return p->normal_prio;
11 return p->prio;
12 }
For a real-me task, it calculates normal_prio as
Lisng 5.8.6: Implementaon of normal_prio() C
1 static inline int normal_prio(struct task_struct *p)
2 {
3 int prio;
4 if (task_has_dl_policy(p))
5 prio = MAX_DL_PRIO-1;
6 else if (task_has_rt_policy(p))
7 prio = MAX_RT_PRIO-1 - p->rt_priority;
8 else
9 prio = p->static_prio;
10 return prio;
11 }
In other words, if the task is not real-me, then prio, stac_prio, and normal_prio have the same value.
Priority and load weight
The way the priories are used is not simply to order tasks but to compute a “load weight,which is
then used to calculate the CPU me allowed for a task.
The structure task_struct->se.load contains the weight of a process in a struct load_weight:
Lisng 5.8.7: load_weight struct C
1 struct load_weight {
2 unsigned long weight;
3 u32 inv_weight;
4 };
The weight is roughly equivalent to 1024/(1.25)^(nice), the actual values are hardcoded in the array
sched_prio_to_weight (in kernel/sched/core.c):
Lisng 5.8.8: Scheduling priority-to-weight conversion C
1 const int sched_prio_to_weight[40] = {
2 /* -20 */ 88761, 71755, 56483, 46273, 36291,
3 /* -15 */ 29154, 23254, 18705, 14949, 11916,
4 /* -10 */ 9548, 7620, 6100, 4904, 3906,
5 /* -5 */ 3121, 2501, 1991, 1586, 1277,
6 /* 0 */ 1024, 820, 655, 526, 423,
7 /* 5 */ 335, 272, 215, 172, 137,
8 /* 10 */ 110, 87, 70, 56, 45,
9 /* 15 */ 36, 29, 23, 18, 15,
10 };
107
This conversion is used in set_load_weight
Lisng 5.8.9: Implementaon of set_load_weight() C
1 static void set_load_weight(struct task_struct *p)
2 {
3 int prio = p->static_prio - MAX_RT_PRIO;
4 struct load_weight *load = &p->se.load;
5 /*
6 * SCHED_IDLE tasks get minimal weight:
7 */
8 if (idle_policy(p->policy)) {
9 load->weight = scale_load(WEIGHT_IDLEPRIO);
10 load->inv_weight = WMULT_IDLEPRIO;
11 return;
12 }
13 load->weight = scale_load(sched_prio_to_weight[prio]);
14 load->inv_weight = sched_prio_to_wmult[prio];
15 }
Here scale_load is a macro which increases resoluon on 64-bit architectures; SCHED_IDLE is a
scheduler policy for very low priority system background tasks. The inv_weight eld is used to speed
up reverse computaons. So in essence, the operaon is
load->weight = sched_prio_to_weight[prio];
The way the weight is used depends on the scheduling policy.
5.8.4 Normal scheduling policies: the completely fair scheduler
All normal scheduling policies in the Linux kernel (SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH) are
implemented as part of what is known as the “Completely Fair Scheduler” (CFS). The philosophy behind this
scheduler, which was introduced in kernel version 2.6.23 in 2009, is stated in the kernel documentaon
(hps://elixir.bootlin.com/linux/latest/source/kernel/sched/sched.h) as follows:
80% of CFS’s design can be summed up in a single sentence: CFS basically models an “ideal, precise mul-
tasking CPU” on real hardware.
“Ideal mul-tasking CPU” is a (non-existent :-)) CPU that has 100% physical power and which can run each
task at precise equal speed, in parallel, each at 1nr_running speed. For example: if there are 2 tasks running,
then it runs each at 50% physical power --- i.e., actually in parallel.
On real hardware, we can run only a single task at once, so we have to introduce the concept of “virtual
runme.The virtual runme of a task species when its next meslice would start execuon on the ideal
mul-tasking CPU described above. In pracce, the virtual runme of a task is its actual runme normalized
to the total number of running tasks.
In other words, the CFS aempts to balance the virtual runme overall tasks. The CFS scheduler run
queue (struct cfs_rq cfs in struct rq in sched.h) is a priority queue with the task with the
smallest virtual runme at the head of the queue.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
108
Lisng 5.8.10: Implementaon of CFS runqueue C
1 /* CFS-related elds in a runqueue */
2 struct cfs_rq {
3 struct load_weight load;
4 unsigned long runnable_weight;
5 unsigned int nr_running, h_nr_running;
6 u64 exec_clock;
7 u64 min_vruntime;
8 struct rb_root_cached tasks_timeline;
9 / *
10 * 'curr' points to currently running entity on this cfs_rq.
11 * It is set to NULL otherwise (i.e., when none are currently running).
12 */
13 struct sched_entity *curr, *next, *last, *skip;
14 };
The CFS algorithm computes the duraon of the next me slice for this task based on the priories of
all tasks in the queue and runs it.
The calculaon of the virtual runme is done in the funcons sched_slice(), sched_- vslice()
and calc_delta_fair() in fair.c, using informaon from the sched_enty struct se:
Lisng 5.8.11: sched_enty struct for calculaon of virtual runme C
1 struct sched_entity {
2 /* For load-balancing: */
3 struct load_weigh load;
4 struct rb_node run_node;
5 struct list_head group_node;
6 unsigned int on_rq;
7
8 u64 exec_start;
9 u64 sum_exec_runtime;
10 u64 vruntime;
11 u64 prev_sum_exec_runtime;
12
13 u64 nr_migrations;
14
15 struct sched_statistics statistics;
16
17 };
As the actual C code in the kernel is quite convoluted, below we present equivalent Python code:
Lisng 5.8.12: Calculaon of virtual runme slice Python
1 # Targeted preemption latency for CPU-bound tasks.
2 # NOTE: this latency value is not the same as the concept of 'timeslice length'
3 # - timeslices in CFS are of variable length and have no persistent notion
4 # like in traditional, time-slice based scheduling concepts.
5 sysctl_sched_latency = 6 ms * (1 + ilog(ncpus))
6 # Minimal preemption granularity for CPU-bound tasks:
7 sysctl_sched_min_granularity = 0.75 ms * (1 + ilog(ncpus))
8 sched_nr_latency = sysctl_sched_latency/sysctl_sched_min_granularity #6/0.75=8
109
9
10 def sched_slice(cfs_rq, tasks):
11 se =head(tasks)
12 # The idea is to set a period (slice) in which each task runs once.
13 # When there are too many tasks (sched_nr_latency)
14 # we have to stretch this period because otherwise, the slices get too small.
15 nrr = cfs_rq.nr_running + (not se.on_rq)
16 slice = sysctl_sched_latency
17 if nrr > sched_nr_latency:
18 slice = nrr * sysctl_sched_min_granularity
19 # slice is scaled using the weight of every other task in the run queue
20 for se in tasks:
21 cfs_rq = cfs_rq_of(se)
22 if not se.on_rq:
23 cfs_rq.load.weight += se.load.weight
24 slice = slice*se.load.weight/cfs_rq.load.weight
25 return slice
26
27
28 # The vruntime slice of a to-be-inserted task is: vslice = slice / weight
29
30 def calc_delta_fair(slice,task):
31 return slice*1024/task.load.weight
32
33 def sched_vslice(cfs_rq, tasks):
34 slice = sched_slice(cfs_rq, tasks)
35 se = head(tasks)
36 vslice = calc_delta_fair(slice,se)
37 return vslice
The actual posion of a task in the queue depends on vrunme, which is calculated as follows:
Lisng 5.8.13: Calculaon of vrunme Python
1 # Update the current task's runtime statistics.
2 def update_min_vruntime(cfs_rq):
3 curr = cfs_rq.curr
4 leftmost=rb_rst_cached(cfs_rq.tasks_timeline)
5 vruntime = cfs_rq.min_vruntime
6 if curr:
7 if curr.on_rq:
8 vruntime = curr.vruntime
9 else:
10 curr = None
11
12 if leftmost: /* non-empty tree */
13 se = rb_entry(leftmost)
14 if not curr:
15 vruntime = se.vruntime
16 else:
17 vruntime = min_vruntime(vruntime, se.vruntime)
18
19 # ensure we never gain time by being placed backwards.
20 cfs_rq.min_vruntime = max_vruntime(cfs_rq.min_vruntime, vruntime)
21
22 def update_curr(cfs_rq):
23 curr = cfs_rq.curr
24 now = rq_clock_task(rq_of(cfs_rq))
25 delta_exec = now - curr.exec_start
26 curr.exec_start = now
27 curr.sum_exec_runtime += delta_exec
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
110
28 curr.vruntime += calc_delta_fair(delta_exec,curr)
29 cfs_rq = update_min_vruntime(cfs_rq)
30
31 def update_curr_fair(rq):
32 update_curr(cfs_rq_of(rq.curr.se))
In other words, the kernel calculates the dierence between the me the process started (exec_- start)
and the current me (now) and then updates exec_start to now. Then it uses this delta_exec and
the load weight to calculate vrunme. Finally, the min_vrunme is calculated as the minimum of the
vrunme of the task at the head of the queue (i.e., the lemost node in the red-black tree) and the
vrunme of the current task. The code checks if there is a current task, and if the queue is not empty
and provides fallbacks. This calculated value is then compared with the currently stored value (cfs_
rq.min_vrunme) and the largest of the two becomes the new cfs_rq.min_vrunme.
5.8.5 So real-me scheduling policies
The Linux kernel supports both so real-me scheduling policies SCHED_RR and SCHED_FIFO
required by the POSIX real-me specicaon [7]. Real-me processes are managed by a separate
scheduler, dened in <kernel/sched/rt.c>.
From the kernel’s perspecve, real-me processes have one key dierence compared to other
processes: if there is a runnable real-me task, it will be run—unless there is another real-me task
with a higher priority.
There are currently two scheduling policies for so real-me tasks:
SCHED_FIFO: This is a First-Come. First-Served scheduling algorithm as discussed in Secon
5.7.1. Tasks following this policy do not have meslices; they run unl they block, yield the CPU
voluntarily or get pre-empted by a higher priority real-me task. A SCHED_FIFO task must have
a stac priority > 0 so that it always preempts any SCHED_NORMAL, SCHED_BATCH or SCHED_
IDLE process. Note that this means that a SCHED_FIFO task will use the CPU unl it nished, and
no non-real-me tasks will be scheduled on that CPU. Several SCHED_FIFO tasks of the same
priority run round-robin. A task can be pre-empted by a higher-priority task, in which case it will
stay at the head of the list for its priority and will resume execuon as soon as all tasks of higher
priority are blocked again. When a blocked SCHED_FIFO thread becomes runnable, it will be
inserted at the end of the list for its priority.
SCHED_RR: This is a Round-Robin (as explained in Secon 5.7.2) enhancement of SCHED_FIFO
scheduler, so it runs every task for a maximum xed me slice. Tasks of the same priority run round-
robin unl pre-empted by a more important task. If aer running for a me quantum, a task is not
nished, it will be put at the end of the list for its priority. A task that has been pre-empted by a
higher priority task and subsequently resumes execuon will complete the remaining poron of its
round-robin me quantum. As menoned before, the length of the me quantum can be retrieved
via /proc/sys/kernel/sched_rr_timeslice_ms or by using sched_rr_get_interval(2).
111
The kernel gives real-me tasks a stac priority, which does not get dynamically recalculated; the only
way to change this priority is by using the chrt(1) command. This ensures that a real-me task always
preempts a normal one and that strict order is kept between real-me tasks of dierent priories.
So real-me processes use a separate scheduling enty struct sched_rt_enty (rt in the task_struct):
Lisng 5.8.14: So real-me scheduling enty struct C
1 struct sched_rt_entity {
2 struct list_head run_list;
3 unsigned long timeout;
4 unsigned long watchdog_stamp;
5 unsigned int time_slice;
6 unsigned short on_rq;
7 unsigned short on_list;
8
9 struct sched_rt_entity *back;
10 #ifdef CONFIG_RT_GROUP_SCHED
11 struct sched_rt_entity *parent;
12 /* rq on which this entity is (to be) queued: */
13 struct rt_rq *rt_rq;
14 /* rq "owned" by this entity/group: */
15 struct rt_rq *my_q;
16 #endif
17 };
As explained in Secon 5.8.2, the main runqueue contains dedicated runqueues for the normal (CFS),
so real-me (rt) and hard real-me (dl) scheduling classes. The so real-me queue uses a priority
queue implemented using a stac array of linked lists and a bitmap. All real-me tasks of a given
priority prio are kept in a linked list in active.queue[prio] and a bitmap (active.bitmap),
keeps track of whether a parcular queue is empty or not.
Lisng 5.8.15: So real-me runqueue C
1 /* Real-Time classes' related eld in a runqueue: */
2 struct rt_rq {
3 struct rt_prio_array active;
4 unsigned int rt_nr_running;
5 unsigned int rr_nr_running;
6 #if dened CONFIG_SMP || dened CONFIG_RT_GROUP_SCHED
7 struct {
8 int curr; /* highest queued rt task prio */
9 } highest_prio;
10 #endif
11 int rt_queued;
12
13 int rt_throttled;
14 u64 rt_time;
15 u64 rt_runtime;
16 /* Nests inside the rq lock: */
17 raw_spinlock_t rt_runtime_lock;
18
19 #ifdef CONFIG_RT_GROUP_SCHED
20 unsigned long rt_nr_boosted;
21
22 struct rq *rq;
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
112
23 struct task_group *tg;
24 #endif
25 };
26
27 / *
28 * This is the priority-queue data structure of the RT scheduling class:
29 */
30 struct rt_prio_array {
31 DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
32 struct list_head queue[MAX_RT_PRIO];
33 };
34
35 struct rt_bandwidth {
36 /* nests inside the rq lock: */
37 raw_spinlock_t rt_runtime_lock;
38 ktime_t rt_period;
39 u64 rt_runtime;
40 struct hrtimer rt_period_timer;
41 unsigned int rt_period_active;
42 };
Similar to update_curr() in the CFS, there is an update_curr_rt() funcon, dened in kernel/
sched/rt.c in real-me scheduler. This funcon keeps track of the CPU me spent by so real-
me tasks, collects some stascs, updates meslices where needed, and calls the scheduler when
appropriate. All calculaons are done using actual me; no virtual clock is used.
5.8.6 Hard real-me scheduling policy
Since kernel version 3.14 of the Linux kernel (2014), Linux supports hard real-me scheduling via the
SCHED_DEADLINE scheduling class. This is an implementaon of the Earliest Deadline First (EDF)
algorithm discussed in Secon 5.7.8, combined with the Constant Bandwidth Server (CBS) algorithm [8].
According to the sched(7) Linux manual page:
The SCHED_DEADLINE (sporadic task model deadline scheduling) policy is currently implemented using
GEDF (Global Earliest Deadline First) in conjuncon with CBS (Constant Bandwidth Server). A sporadic task
is one that has a sequence of jobs, where each job is acvated at most once per period. Each job also has
a relave deadline, before which it should nish execuon, and a computaon me, which is the CPU me
necessary for execung the job. The moment when a task wakes up because a new job has to be executed is
called the arrival me. The start me is the me at which a task starts its execuon. The absolute deadline
is thus obtained by adding the relave deadline to the arrival me.
A SCHED_DEADLINE task is guaranteed to receive a given runme every period, and this runme
is available within deadline from the beginning of the period.
The runme, period, and deadline are stored in the struct sched_dl_enty struct (dl in the task_
struct) and can be set using the sched_setar() system call:
113
Lisng 5.8.16: Hard real-me scheduling enty struct C
1 struct sched_dl_entity {
2 /* the node in the red-black tree.
3 The red-black tree is used as priority queue
4 */
5 struct rb_node rb_node;
6
7 / *
8 * Original scheduling parameters.
9 */
10 u64 dl_runtime; /* Maximum runtime for each instance */
11 u64 dl_deadline; /* Relative deadline of each instance */
12 u64 dl_period; /* Separation of two instances (period) */
13 u64 dl_bw; /* dl_runtime / dl_period */
14 u64 dl_density; /* dl_runtime / dl_deadline */
15
16 / *
17 * Actual scheduling parameters. Initialized with the values above,
18 * they are continuously updated during task execution.
19 */
20 s64 runtime; /* Remaining runtime for this instance */
21 u64 deadline; /* Absolute deadline for this instance */
22 unsigned int ags;/* Specifying the scheduler behavior */
23
24 / *
25 * Some bool ags
26 */
27 unsigned int dl_throttled : 1;
28 unsigned int dl_boosted : 1;
29 unsigned int dl_yielded : 1;
30 unsigned int dl_non_contending : 1;
31
32 / *
33 * Per-task bandwidth enforcement timer.
34 */
35 struct hrtimer dl_timer;
36
37 / *
38 * Inactive timer
39 */
40 struct hrtimer inactive_timer;
41 };
Time budget allocaon
When a task wakes up because a new job has to be executed (i.e., at arrival me), deadline and
runtime are recalculated as follows (this is the Constant Bandwidth Server or CBS algorithm [8]):
if deadline < currentTime or
runme
> dl_runme then
deadline—currentTime dl_period
deadline = currentTime+dl+deadline
runtime = dl_runtime
else deadline and runme are le unchanged.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
114
This calculaon is done in setup_new_dl_enty in kernel/sched/deadline.c:
Lisng 5.8.17: Deadline and runme recalculaon C
1 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
2 {
3 struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
4 struct rq *rq = rq_of_dl_rq(dl_rq);
5
6 if (dl_se->dl_throttled)
7 return;
8
9 dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
10 dl_se->runtime = dl_se->dl_runtime;
11 }
This funcon is called via
enqueue_task_dl()
g
enqueue_dl_entity()
g
eupdate_dl_entity()
As explained in Secon 5.7.8, the EDF algorithm selects the task with the smallest deadline like the
one to be executed rst. In other words, we have a priority queue where the deadline is the priority.
Just like for the CFS, in the kernel, this priority queue is implemented using a red-black tree. The
lemost node in the tree has the smallest deadline and is cached so that selecng this node is O(1).
When a task executes for an amount of me t, its runme is decreased as
runme = runme t
This is done in update_curr_dl in kernel/sched/deadline.c:
Lisng 5.8.18: Runme update for EDF scheduling C
1 static void update_curr_dl(struct rq *rq)
2 {
3 struct task_struct *curr = rq->curr;
4 struct sched_dl_entity *dl_se = &curr->dl;
5 u64 delta_exec;
6
7 if (!dl_task(curr) || !on_dl_rq(dl_se))
8 return;
9
10 delta_exec = rq_clock_task(rq) - curr->se.exec_start;
11 if (unlikely((s64)delta_exec <= 0)) {
12 return;
13 }
14
15 dl_se->runtime -= delta_exec;
16
17 throttle:
18 if (dl_runtime_exceeded(dl_se) ) {
19 dl_se->dl_throttled = 1;
20 __dequeue_task_dl(rq, curr, 0);
21 if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
22 enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
115
23
24 if (!is_leftmost(curr, &rq->dl))
25 resched_curr(rq);
26 }
27
28 if (rt_bandwidth_enabled()) {
29 struct rt_rq *rt_rq = &rq->rt;
30
31 raw_spin_lock(&rt_rq->rt_runtime_lock);
32 if (sched_rt_bandwidth_account(rt_rq))
33 rt_rq->rt_time += delta_exec;
34 raw_spin_unlock(&rt_rq->rt_runtime_lock);
35 }
36 }
This funcon is called via
scheduler_tick()
g
task_tick_dl()
g
update_curr_dl()
When the runme becomes less than or equal to 0, the task cannot be scheduled unl its deadline.
The CBS feature in the kernel throles tasks that aempt to over-run their specied runme. This
is done by seng a mer for the replenishment of the me budget to the deadline (start_dl_
timer(curr)).
When this replenishment me is reached, the budgets are updated:
deadline = currentTime+dl+deadline
runtime = dl_runtime
5.8.7 Kernel preempon models
User space programs are always preempble. However, in certain real-me scenarios, it may be
desirable to preempt kernel code as well.
The Linux kernel provides several preempon models, which have to be selected when compiling
the kernel. For hard real-me performance, the “Fully Preempble Kernel” preempon model must
be selected. The last two entries below are available only with the PREEMPT_RT patch set. This is an
ocial kernel patch set which gives the Linux kernel hard real-me capabilies. We refer to HOWTO
setup Linux with PREEMPT_RT properly for more details. The possible preempon models are detailed
in the kernel conguraon le kernel/Kcong.preempt:
No Forced Preempon (Server): The tradional Linux preempon model, geared towards
throughput. System call returns and interrupts are the only preempon points.
Voluntary Kernel Preempon (Desktop): This opon reduces the latency of the kernel by adding
more “explicit preempon points” to the kernel code at the cost of slightly lower throughput.
In addion to explicit preempon points, system call returns and interrupt returns are implicit
preempon points.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
116
Preempble Kernel (Low-Latency Desktop): This opon reduces the latency of the kernel by making
all kernel code (that is not execung in a crical secon) preempble. An implicit preempon point
is located aer each preempon disables secon.
Preempble Kernel (Basic RT): This preempon model resembles the “Preempble Kernel (Low-
Latency Desktop)” model. Besides the properes menoned above, threaded interrupt handlers
are forced (as when using the kernel command line parameter threadirqs). This model is mainly used
for tesng and debugging of substuon mechanisms implemented by the PREEMPT_RT patch.
Fully Preempble Kernel (RT): All kernel code is preempble except for a few selected crical
secons. Threaded interrupt handlers are forced. Furthermore, several substuon mechanisms,
like sleeping spinlocks and rt_mutex are implemented to reduce preempon disabled secons.
Addionally, large preempon disabled secons are substuted by separate locking constructs.
This preempon model has to be selected in order to obtain real-me behavior.
5.8.8 The red-black tree in the Linux kernel
The Linux kernel uses a red-black tree as the implementaon of its priority queues. The red-black tree
is a self-balancing binary search tree with O(log(n)) guarantees on accessing (search), inseron and
deleon of node. More specically, the height H of a red-black tree with n nodes (the length of the
path from the root to the deepest node in the tree) is bounded by:
log (n + 1) ≤ H ≤ 2log (n + 1)
The implementaon of the red-black tree in the linux kernel is lib/rbtree.c, the API is include/linux/
rbtree.h and the data structure is documented in rbtree.txt. The API is quite simple, as illustrated by
example in the documentaon:
Creang a new rbtree
Data nodes in a rbtree tree are structures containing a struct rb_node member:
Lisng 5.8.19: Node in a rbtree C
1 struct mytype {
2 struct rb_node node;
3 char *keystring;
4 };
When dealing with a pointer to the embedded struct rb_node, the containing data structure may be
accessed with the standard container_of() macro. In addion, individual members may be accessed
directly via rb_entry(node, type, member).
At the root of each rbtree is a rb_root structure, which is inialized to be empty via:
Lisng 5.8.20: Root for rbtree C
1 struct rb_root mytree = RB_ROOT;
117
Searching for a value in a rbtree
Wring a search funcon for your tree is fairly straighorward: start at the root, compare each value,
and follow the le or right branch as necessary.
Example:
Lisng 5.8.21: Search funcon for rbtree C
1 struct mytype *my_search(struct rb_root *root, char *string)
2 {
3 struct rb_node *node = root->rb_node;
4
5 while (node) {
6 struct mytype *data = container_of(node, struct mytype, node);
7 int result;
8
9 result = strcmp(string, data->keystring);
10
11 if (result < 0)
12 node = node->rb_left;
13 else if (result > 0)
14 node = node->rb_right;
15 else
16 return data;
17 }
18 return NULL;
19 }
Inserng data into a rbtree
Inserng data in the tree involves rst searching for the place to insert the new node, then inserng
the node and rebalancing ("recoloring") the tree. The search for inseron diers from the previous
search by nding the locaon of the pointer on which to gra the new node. The new node also needs
a link to its parent node for rebalancing purposes.
Example:
Lisng 5.8.22: Inseron in rbtree C
1 int my_insert(struct rb_root *root, struct mytype *data)
2 {
3 struct rb_node **new = &(root->rb_node), *parent = NULL;
4
5 /* Figure out where to put new node */
6 while (*new) {
7 struct mytype *this = container_of(*new, struct mytype, node);
8 int result = strcmp(data->keystring, this->keystring);
9
10 parent = *new;
11 if (result < 0)
12 new = &((*new)->rb_left);
13 else if (result > 0)
14 new = &((*new)->rb_right);
15 else
16 return FALSE;
17 }
18
19 /* Add new node and rebalance tree. */
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
118
20 rb_link_node(&data->node, parent, new);
21 rb_insert_color(&data->node, root);
22
23 return TRUE;
24 }
Removing or replacing exisng data in a rbtree
To remove an exisng node from a tree, call:
Lisng 5.8.23: Removal from rbtree C
1 void rb_erase(struct rb_node *victim, struct rb_root *tree);
Example:
Lisng 5.8.24: Removal from rbtree – example C
1 struct mytype *data = mysearch(&mytree, "walrus");
2
3 if (data) {
4 rb_erase(&data->node, &mytree);
5 myfree(data);
6 }
To replace an exisng node in a tree with a new one with the same key, call:
Lisng 5.8.25: Replace node in rbtree C
1 void rb_replace_node(struct rb_node *old, struct rb_node *new,
2 struct rb_root *tree);
Replacing a node this way does not re-sort the tree: If the new node doesn’t have the same key as the
old node, the rbtree will probably become corrupted.
Iterang through the elements stored in a rbtree (in sort order)
Four funcons are provided for iterang through a rbtree’s contents in sorted order. These work on
arbitrary trees, and should not need to be modied or wrapped (except for locking purposes):
Lisng 5.8.26: Iterate through rbtree C
1 struct rb_node *rb_rst(struct rb_node *tree);
2 struct rb_node *rb_last(struct rb_node *tree);
3 struct rb_node *rb_next(struct rb_node *node);
4 struct rb_node *rb_prev(struct rb_node *node);
119
To start iterang, call rb_rst() or rb_last() with a pointer to the root of the tree, which will return
a pointer to the node structure contained in the rst or last element in the tree. To connue, fetch the
next or previous node by calling rb_next() or rb_prev() on the current node. This will return NULL when
there are no more nodes le.
The iterator funcons return a pointer to the embedded struct rb_node, from which the containing
data structure may be accessed with the container_of() macro, and individual members may be
accessed directly via rb_entry(node, type, member).
Example:
Lisng 5.8.27: Iterate through rbtree – example C
1 struct rb_node *node;
2 for(node=rb_rst(&mytree);node;node=rb_next(node))
3 printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring);
Cached rbtrees
An interesng feature of the Linux implementaon of the red-black tree is caching. Because compung
the lemost (smallest) node in a red-black tree is quite a common task, the cached rbtree rb_root_cached
can be used to opmize O(logN) rb_rst() calls to an O(1) simple pointer fetch, avoiding potenally
expensive tree iteraons. The runme overhead for maintenance is negligible, and the memory footprint
is only slightly larger: a cached rbtree is simply a regular rb_root with an extra pointer to cache the
lemost node. Consequently, any occurrence of rb_root can be substuted by rb_root_cached.
5.8.9 Linux scheduling commands and API
There are a number of commands that allow users to set and change process priories for both normal
and real-me tasks.
Normal processes
The nice command allows the user to set the priority of the process to be executed:
Lisng 5.8.28: Use of the nice command Bash
1 $ nice -n 12 command
The renice command allows to change the priority of a running process:
Lisng 5.8.29: Use of the renice command Bash
1 $ renice -n 15 -p pid
Remember that nice values range from -20 to 19 and lower nice values correspond to higher priority.
So, -12 has a higher priority than 12. The default nice value is 0. Regular users can set lower priories
(posive nice values).To use higher priories (negave nice values), superuser privileges are required.
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
120
Real-me processes
There is a single command to control the real-me properes of a process, chrt. This command sets
or retrieves the real-me scheduling aributes of a running process or runs the command with the
given aributes. The are a number of ags that allow us to set the scheduling policy (–other, –fo,–rr,
–batch, –idle, –deadline).
For example:
Lisng 5.8.30: Use of the chrt command Bash
1 $ chrt --batch 0 pwd
All real-me policies require superuser privileges, for example:
Lisng 5.8.31: Use of the chrt command Bash
1 $ sudo chrt --rr 32 pwd
The –deadline policy only works with sporadic tasks that have actual runme, deadline, and period
aributes set via the sched_setar system call.
5.9 Summary
In this chapter, we have introduced the concept of scheduling, the raonale behind it, and how it
relates to the process life cycle and to the concept of system calls. We have discussed the dierent
scheduling principles and criteria and covered a number of scheduling policies, both the basic policies
and the more advanced policies used in the Linux kernel, including so and hard real-me scheduling
policies. Then we have applied all this basic scheduling theory in a study of the Linux scheduler,
covering the actual data structures and algorithms used by the dierent schedulers supported by
the Linux kernel, the Completely Fair Scheduler, the so real-me scheduler, and the hard real-me
scheduler.
5.10 Exercises and quesons
5.10.1 Wring a scheduler
For this exercise, we suggest you start from exisng code provided in the tutorial series Bare-Metal
Programming on Raspberry Pi 3 on GitHub. Start from the provided cyclic execuve example.
1. Create a round-robin scheduler.
2. Create a FIFO scheduler.
5.10.2 Scheduling
1. What are the reasons for having an operang systems scheduler?
2. How does scheduling relate to the process lifecycle?
121
5.10.3 System calls
1. What is the raonale behind system calls?
2. What are the implicaons of the system call mechanism on scheduling?
5.10.4 Scheduling policies
1. What are the criteria for evaluang the suitability of a given scheduling policy?
2. Consider the following set of processes, with the arrival me and burst me given in milliseconds:
It is assumed below that a process arriving at me t is added to the Ready Queue before a scheduling
decision is made.
a) Draw three Gan charts that illustrate the execuon of these processes using the following
scheduling algorithms: FCFS, preempve priority (a smaller priority number implies a higher priority),
and RR (quantum = 1).
b) The best possible turnaround me for a process is its CPU burst me – i.e., that it is scheduled
immediately upon arrival and runs to compleon without being pre-empted. We will call the dierence
of the turnaround me, and the CPU burst me the excess turnaround me. Which of the algorithms
results in the minimum average excess turnaround me?
3. Discuss the similaries and dierences between the Shortest job rst (SJF), Shortest remaining me
rst (SRTF) and Shortest elapsed me rst (SETF) scheduling policies.
5.10.5 The Linux scheduler
1. How are priories used in the Completely Fair Scheduler?
Explain the use of the Red-Black tree in the Completely Fair Scheduler.
Discuss the policies for so and hard real-me scheduling in the Linux kernel.
Process Arrival Time Burst Time Priority
P1 0 10 3
P2 1 1 1
P3 2 2 3
P4 3 1 4
P5 4 5 2
Chapter 5 | Process scheduling
Operang Systems Foundaons with Linux on the Raspberry Pi
122
References
[1] A. Stevenson, Oxford diconary of English. Oxford University Press, USA, 2010.
[2] ARM
®
Architecture Reference Manual – ARMv8, for ARMv8-A architecture prole, Arm Ltd, 12 2017, issue: C.a. [Online].
Available: hps://silver.arm.com/download/download.tm?pv=4239650&p=1343131
[3] A. Silberschatz, P. B. Galvin, and G. Gagne, Operang system concepts essenals. John Wiley & Sons, Inc., 2014.
[4] D. R. Smith, “A new proof of the opmality of the shortest remaining processing me discipline,Operaons Research, vol. 26,
no. 1, pp. 197–199, 1978.
[5] B. Kalyanasundaram and K. Pruhs, “Speed is as powerful as clairvoyance,J. ACM, vol. 47, no. 4, pp. 617–643, Jul. 2000. [Online].
Available: hp://doi.acm.org/10.1145/347476.347479
[6] N. Navet, I. Loria, N. N. Koblenz, N. N. Koblenz, and N. N. Koblenz, “Posix 1003.1b: scheduling policies (1/2).
[7] M. G. Harbour, “Real-me Posix: an overview,” in VVConex 93 Internaonal Conference, Moscu. Citeseer, 1993.
[8] L. Abeni and G. Buazzo, “Integrang mulmedia applicaons in hard real-me systems,Proceedings of the 19th IEEE. Real-
Time Systems Symposium, 1998, pp. 4–13.
123
Chapter 5 | Process scheduling
Chapter 6
Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
126
6.1 Overview
As with other hardware resources, a machine’s random access memory (RAM) is managed by the
operang system on behalf of user applicaons. This chapter explores specic details of memory
management in Linux.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Contrast the speed and size of data storage locaons across the range of physical memory
technologies.
2. Jusfy the reasons for using a virtual addressing scheme.
3. Navigate Linux page table data structures to decode a virtual address.
4. Assess the relave merits of various page replacement policies.
5. Appraise the design decisions underlying Arm hardware support for virtual memory.
6. Explain why a process must maintain its working set of data in memory.
7. Describe the operaon of key kernel rounes that must be invoked to maintain the virtual memory
abstracon.
6.2 Physical memory
Memory is a key high-level compung component. Along with the processor, it is the main element
idened in Von Neumann’s original, abstract model of computer architecture from the 1940s, see
Figure 6.1.
Figure 6.1: Von Neumann architecture of a computer.
RAM technology has advanced signicantly since those early days, when a at memory structure,
featuring a few kilobytes of storage, would require large, specialized, analog circuits.
The sheer complexity of modern memory is mostly due to the inherent trade-o between size and
speed. Small memory may be accessed rapidly, e.g., an individual register in a CPU. On the other hand,
large memory has a high access latency—the worst case is oen backing storage based on tape drives
in a data warehouse.
i
i
“chapter” 2019/8/13 18:08 page 1 #1
i
i
i
i
i
i
Processor
Memory
Bus
127
Let’s examine the physical memory hierarchy of a Raspberry Pi device. Figure 6.2 shows a photo of
a Pi board, labeling the components that contain the physical memory (processor registers and cache
in the system-on-chip package, o-chip memory in the DRAM, and ash storage in the SD card).
In terms of memory size and access speed, the diversity on the Pi is striking; there are six orders
of magnitude dierence in access latency from top to boom of the hierarchy, and four orders of
magnitude dierence in size. The memory technology pyramid in Figure 6.3 shows precise details for
a Raspberry Pi model 3B.
Figure 6.2: Raspberry Pi 2 board with labeled physical memory components; note that on more recent Pi models, the DRAM is stacked directly underneath
the Broadcom system-on-chip, so it is not visible externally. Photo by author.
Figure 6.3: Typical access latency and size for the range of physical memory technologies in Raspberry Pi.
The OS cooperates with hardware facilies to minimize applicaon memory access latency as much
as possible. This involves ensuring cache locality and DRAM residency for applicaon code and data.
First, let’s consider how the OS assists processes in organizing their allocated memory.
6.3 Virtual memory
6.3.1 Conceptual view of memory
In simplest terms, memory may be modeled as a giganc linear array data structure. From the
perspecve of a C program, memory is a one-dimensional int[] or byte[].
i
i
“chapter” 2019/8/13 18:08 page 2 #2
i
i
i
i
i
i
processor & cache
flash memory
DRAM
i
i
“chapter” 2019/8/13 18:08 page 3 #3
i
i
i
i
i
i
Flash >8GB; 1 000 000 cycles
DRAM 1GB; 100 cycles
L2 cache 256KB; 30 cycles
L1 cache 16KB; 5 cycles
registers
<1KB; 1 cycle
i
i
“chapter” 2019/8/13 18:08 page 2 #2
i
i
i
i
i
i
processor & cache
flash memory
DRAM
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
128
Each data item has an address (its index in the conceptual array) and a value (the bits stored at that
address). Low-level machine instrucons allow us to access data at byte, word, or mul-word
granularity, where a word might be 32 or 64 bits, depending on the plaorm conguraon. The Arm
instrucon set is a classic load/store architecture, with explicit instrucons to read from (i.e., LDR)
and write to (i.e., STR) memory.
6.3.2 Virtual addressing
In common with all modern high-level OSs, Linux uses virtual addressing. This is dierent from
microcontrollers like typical Arduino and Mbed devices, which perform direct physical addressing.
In Linux, each process has its own virtual address space, with virtual addresses (also known as logical
addresses) mapped onto physical addresses, conceptually as a one-to-one mapping.
Historically the Atlas computer, built at the University of Manchester in the 1960s, was the rst
machine to implement virtual memory. Figure 6.4 shows the original installaon. The system was
designed to map disparate memory technology onto a single address space, with address translaon
support in dedicated hardware.
Figure 6.4: The Atlas machine, designed at the University of Manchester, was the rst system to feature virtual memory. Photo by Jim Garside.
Several key benets are enabled by virtual memory.
Process isolaon: It is impossible to trash another process’ memory if the currently execung process
is unable to address that memory directly. Accessing ‘wild’ pointers may cause a segmentaon fault,
but this will only impact the currently execung program, rather than the enre system.
Code relocaon: Binary object les are generally loaded at the same virtual address, which is
straighorward for linking and loading tools. This can ensure locality in the virtual address space,
minimizing problems with memory fragmentaon.
129
Hardware abstracon: The virtual address space provides a uniform, hardware-independent view of
memory, despite physical memory resources changing when we install more RAM or modify a hosted
VM conguraon.
Virtual addressing requires direct, integrated hardware support in the form of a memory management
unit (MMU). The MMU interposes between the processor and the memory, to translate virtual
addresses (in the processor domain) to physical addresses (in the memory domain). This translaon
process is known as hardware-assisted dynamic address relocaon and is supported by all modern
processor families. The rened Von Neumann architecture in Figure 6.5 gives a schemac overview
of the MMU’s interposing role.
Figure 6.5: Rened Von Neumann architecture showing the Memory Management Unit (MMU).
When the OS boots up, the processor starts in a physical addressing conguraon with the MMU
turned o. The early stages of the kernel boot sequence inialize basic data structures for virtual
memory management; then the MMU is turned on. For the Linux boot sequence on Arm, this happens
in the ___turn_mmu_on procedure in arch/arm/kernel/head.S.
6.3.3 Paging
The Linux virtual address space layout (for 32- and 64-bit Arm architectures) is shown in Figure
6.6. The split between user-space and kernel-space is either 3:1 or 2:2, for the 4GB address space.
The default Raspberry Pi Linux kernel 32-bit conguraon species CONFIG_VMSPLIT_2G=y which
means a 2:2 split. The 64-bit address boundary literals in Figure 6.6 assume eecve 39-bit virtual
addresses; several other variants are possible.
Figure 6.6: Linux virtual address space map for 32- and 64-bit architectures, lower addresses at the top.
i
i
“chapter” 2019/8/13 18:08 page 5 #5
i
i
i
i
i
i
Processor MMU
Memory
Bus
virtual addresses physical addresses
i
i
“chapter” 2019/8/13 18:08 page 6 #6
i
i
i
i
i
i
user space
kernel mem
0
0x80000000
2GB
0xFFFFFFFF
4GB
user space
unused
kernel mem
0
0x0000008000000000
512GB
0xFFFFFF8000000000
0xFFFFFFFFFFFFFFFF
16EB
32-bit addressing 64-bit addressing
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
130
The parcular mechanism chosen to implement virtual addressing in Linux is paging, which supports
ne-grained resource allocaon and management of physical memory. In a paged memory scheme,
physical memory is divided into xed-size frames. Virtual memory is similarly divided into xed-sized
pages, where a single page has the same size as a single frame. This allows us to set up a mapping
from pages to frames. The typical size of a single page in Linux is 4KB on a 32-bit Arm plaorm.
Try getconf PAGESIZE on your terminal to nd your system’s congured page size in bytes.
The default page size is small enough to minimize fragmentaon but large enough to avoid excessive
overhead for per-page metadata. A page is the minimum granularity of memory that can be allocated
to a user process. Larger pages are supported navely on Arm. For instance, 64KB pages and mul-
MB ‘huge’ pages are possible. The advantage of larger pages is that fewer virtual to physical address
translaon mappings need to be stored. The main disadvantage comes from internal fragmentaon,
where a process is unable to use such a large amount of conguous space eecvely. Eecvely,
internal fragmentaon means there is free memory which belongs to one process and cannot be
assigned to another process. Generally, huge pages are appropriate for database systems and similar
specialized data-intensive applicaon workloads.
The next secon examines the underlying mechanisms required to translate page-based virtual
addresses into physical addresses.
6.4 Page tables
During normal process execuon, the processors and caches operate enrely in terms of virtual
addresses. As outlined above, the MMU intercepts all memory requests and translates virtual
addresses into physical addresses.
The translaon process relies on a mapping table, known as a page table which is stored in memory,
see Secons 6.4.1 and 6.4.2. Dedicated MMU base registers are available to point to page tables for
rapid access. An MMU cache of frequently used address mappings is maintained in the translaon
look-aside buer, see Secon 6.4.4.
Generally, the address translaon is performed by the MMU hardware, transparently from the process
or OS point of view. However, the OS is involved when a translaon does not succeed—this causes
a page fault, see Secon 6.5.2. Further, when a process begins execuon, the OS needs to set up the
inial page table and subsequently maintain it as the virtual address space evolves.
Somemes OS needs to operate on physical addresses directly, perhaps for device driver interacons.
There are macros to convert between virtual and physical addresses, e.g., virt_to_phys(), but
these only work for memory buers allocated by the kernel with the kmalloc roune.
6.4.1 Page table structure
The page table is an in-memory data structure that translates from virtual to physical addresses.
The translaon happens automacally through the MMU hardware, which is directly supported by
the processor. The MMU will automacally read the translaon tables when necessary; this process
is known as a page table walk. The OS simply has to maintain up-to-date mapping informaon in each
process’s page table, and refresh the page table base register each me a dierent process is execung.
131
The simplest possible structure is a single-level page table. For each page in the virtual address space,
there is an entry in the table which contains a value corresponding to the appropriate physical address.
This wastes space—a typical 32-bit 4GB address space, divided into disnct 4K pages, will need
a single-level page table to contain 1M entries. Each entry consists of an address, say 4B, along with
some metadata bits. However, most processes do not make use of their enre virtual address space,
so many page table entries would remain unused.
This movates the design of a hierarchical page table. Before we get into specic details for Linux on Arm,
let’s consider an idealized two-level page table. A typical 32-bit virtual address is divided into three parts:
1. A 10-bit rst-level page table index.
2. A 10-bit second-level page table index.
3. A 12-bit page oset.
Given that pages are 4KB, this is a convenient subdivision. The 10 bits enable us to address 1024 32-
bit word entries. Each entry can contain a single address. This means each sub-table of the page table
can t into a single page.
For unused regions of the address space, the OS can invalidate corresponding entries in the rst-level
page table, as a consequence of which, we do not need second-level page tables for these address
ranges. This is the main space-saving for hierarchical page tables since each invalid rst-level page table
entry corresponds to 1024 invalid second-level page table entries—potenally saving up to 4MB of
second-level page table space.
Figure 6.7 gives a schemac overview of a single virtual address translaon, as handled by the MMU,
using the two-level page table outlined above. Note the consecuve pair of table indexing operaons,
based on the P1 and P2 bitelds in the virtual address. The entry in the second-level page table
contains the physical frame number, which is concatenated bitwise with the page oset to generate
the actual physical address. There are spare bits in the 32-bit page table entry since the frame number
will only occupy 20 bits. These remaining (low-order) bits can be used for page metadata such as
access permissions, see Secon 6.4.3.
Figure 6.7: Virtual address translaon via a two-level page table.
i
i
“chapter” 2019/8/13 18:08 page 7 #7
i
i
i
i
i
i
P1 index P2 index
page offset
+
base register
first-level
table
entry
+
second-level
table
entry
page offset
frame number
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
132
An n-level hierarchical page table will impose an overhead of n (page table) memory references for
each ‘actual’ memory reference. There are techniques to migate this overhead; for instance, see
Secon 6.4.4.
6.4.2 Linux page tables on Arm
This secon explores how the Linux model for page tables is realized on the Arm architecture.
First, we examine the generic Linux page table architecture; then we review the plaorm-specic
opmizaons that are enabled for the Raspberry Pi.
Linux supports a mul-level hierarchical page table. Since kernel version 4.14, page tables can have up
to ve levels.
1. PGD, page global directory: one per process, with a base pointer stored in an MMU register,
and in the process state context at current->mm.pgd.
2. P4D, fourth level directory: only applicable to 5-level page tables, currently not supported on Arm.
3. PUD, page upper directory: applicable to 4- and 5-level page tables, currently supported on
AArch64.
4. PMD, page middle directory: intermediate level table.
5. PTE, page table entry: a leaf of the page table, containing mulple pages to frame translaons.
With some plaorms, fewer hardware page table levels are available than the Linux kernel supports.
For instance, the default 32-bit Raspberry Pi Linux kernel conguraon uses a two-level page table,
as documented in arch/arm/asm/pgtable-2level.h. The PMD is dened to have a nominal size
of single entry; it folds back directly onto the page global directory (PGD), which is opmized away at
compile me. This unit-sized intermediate page table ‘trick’ is also applied to other architectures and
conguraons.
The two-level page table structure maps neatly onto the Arm MMU paging hardware in the Raspberry
Pi Broadcom SoC, which has a two-level page table where the rst level contains 4096 entries (i.e.,
4 consecuve pages) and each of the second level tables has 256 entries. Each entry is a 32-bit word.
However, because the Arm MMU hardware does not provide a suciently rich set of page metadata
for the Linux memory manager, the metadata bits for each page have to be managed in soware, via
page faults and soware xups. For instance, Linux requires a ‘young’ bit for each page. This bit tracks
whether the page has been accessed recently, which is useful for page replacement policies. The
‘young’ bit is not supported navely on Arm.
Linux sees the abstracon of 2048 64-bit entries in the PGD, dened in the pgtable-2level.h
with #denePTRS_PER_PGD2048. Each 64-bit PGD composite entry breaks down into two 32-bit
pointers to consecuve second-level blocks. Since the Arm MMU supports 256 entries in a second-
level page table block, then there are 512 entries in two consecuve blocks. Thus Linux sees the
133
abstracon of 512 32-bit entries in a logical PTE. This is dened in the pgtable-2level.h le with
#denePTRS_PER_PTE512.
These PTE blocks only occupy half a 4KB page. The other half is occupied by arrays of Linux per-page
metadata, which is not supported navely by the Arm MMU. Eecvely, the Linux PTE metadata
shadows the Arm hardware-supported metadata and is maintained by the OS using a page fault
and xup mechanism. The relevant code is in set_pte_ext, which is generally implemented as an
assembler intrinsic roune, for eciency reasons. For instance, check out the assembler roune
cpu_v7_set_pte_ext in le arch/arm/mm/proc-v7-2level.S. The hardware page metadata
word is generally 2048 bytes ahead of the corresponding Linux shadow metadata. To nd this, execute
the command:
grep -4, 2048 *.S
Bash
in the linux/arch/arm/mm/ directory. Secon 6.4.3 outlines the Linux metadata that the OS
maintains for each page.
Eecvely, two dierent page table mechanisms are superimposed seamlessly onto the one-page
table data structure, for both the Arm MMU and the Linux virtual memory manager. Figure 6.8 shows
this page table organizaon as a schemac diagram.
Figure 6.8: Linux page table organizaon ts into the Arm hardware-supported two-level paging structure, with Linux page metadata bits shadowing
hardware metadata at a distance of half a page (2048 bytes).
There are several more complex variants on this virtual addressing scheme. For instance:
1MB secons are conguous areas of physical memory that can be translated directly from a single
PGD entry. This enables more rapid virtual address translaon.
Large Physical Address Extension (LPAE) is a scheme that enables 32-bit virtual addresses to be
mapped onto 40-bit physical addresses. This permits 1TB of physical memory to be used on 32-bit
Arm plaorms.
i
i
“chapter” 2019/8/13 18:08 page 8 #8
i
i
i
i
i
i
ptr1
ptr2
256 h/w entries
256 h/w entries
256 Linux flagsets
256 Linux flagsets
2 Arm top-level
entries (8 bytes)
4 Arm 2nd-level blocks
(4K, 1 page)
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
134
6.4.3 Page metadata
To avoid confusion, note that a ‘page table entry’ may refer to one of two dierent concepts:
1. A Linux PTE, which is a leaf in the page table, containing 512 mappings from virtual to physical
addresses.
2. A single mapping from a virtual to a physical address, along with corresponding metadata.
Throughout this chapter, when we mean (1), we will refer to it as a ‘Linux PTE’ specically.
As well as recording the page frame number, to perform the mapping from a virtual to a physical
address, a page table entry also stores appropriate metadata about the page. This includes informaon
related to memory protecon, sharing, and caching. Individual bits in the page table entry are
reserved for specic informaon, so the OS can nd aributes of pages with simple bitmask and shi
operaons.
Linux devotes a number of PTE bits to metadata. A typical layout is below, for the Raspberry Pi two-
level pagetable (consult le arch/arm/include/asm/pgtable-2level.h for details).
If a process aempts to make an illegal memory access (e.g., if it tries to execute code in a non-
executable page or to read data from an invalid page), then a page fault event occurs and the system
traps to a page fault handler, see Secon 6.5.2.
From a user perspecve, the simplest way to see memory metadata is to look at the /proc/PID/
maps le for a process. Although the informaon is not presented at page level, it is shown at the level
of segments, which are conguous page sequences in the virtual address space. For each segment, the
permissions are listed: these might include read (r), write (w), and execute (x). A further column shows
whether the memory is private (p) to this process or shared (s) between mulple processes.
Table 6.1: Metadata associated with each page table entry in Linux.
Macro Descripon Bit posion
L_PTE_VALID
Is this page resident in physical memory, or has it been swapped out? 0
L_PTE_YOUNG
Has data in this page been accessed recently? 1
4 bits associated with cache residency 2–5
L_PTE_DIRTY
Has data in this page been wrien, so the page needs to be ushed to disk? 6
L_PTE_RDONLY
Does this page contain read-only data? 7
L_PTE_USER
Can this page be accessed by user-mode processes? 8
L_PTE_XN
Does this page not contain executable code? (protecon for buer overow aaches) 9
L_PTE_SHARED
Is this page shared between mulple process address spaces? 10
L_PTE_NONE
Is this page protected from unprivileged access? 11
135
Figure 6.9: Bitmap paerns for page table entries, for a resident page to frame translaon (above) and for a non-resident (swapped out) page (below).
Figure 6.10 shows an example of this memory mapping data for a single Linux process.
Figure 6.10: Extract from a process memory mapping reported in /proc/PID/maps.
The binary le /proc/PID/pagemap records actual mapping data. Access to this le requires root
privileges, otherwise reads return zero values or cause permission errors. The pagemap le has
a 64-bit value for each page. The low 54 bits of this value correspond to the physical address or swap
locaon of that page. Higher bits are used for page metadata. The Python code below performs
a single virtual to physical address translaon using this map.
Lisng 6.4.1: Reading from the /proc pagemap le Python
1 import sys
2
3 pid = int(sys.argv[1], 10) # specify as decimal
4 vaddr = int(sys.argv[2], 16) # specify as hex
5
6 PAGESIZE=4096 # 4K pages
7 ENTRYSIZE=8
8
9 with open(("/proc/%d/pagemap" % pid), "rb") as f:
10 f.seek((vaddr/PAGESIZE) * ENTRYSIZE)
11 x = 0
i
i
“chapter” 2019/8/13 18:08 page 10 #10
i
i
i
i
i
i
31 12
page index
11 0
metadata
31 9
swap entry
8 3
swap type
000
2 0
i
i
“chapter” 2019/8/13 18:08 page 11 #11
i
i
i
i
i
i
pi@raspberrypi:/home/pi $ cat /proc/23655/maps
00010000-00011000 r-xp 00000000 b3:02 42164 /home/pi/.../a.out
00020000-00021000 rw-p 00000000 b3:02 42164 /home/pi/.../a.out
76e67000-76f92000 r-xp 00000000 b3:02 1941 /lib/arm-.../libc-2.19.so
76f92000-76fa2000 ---p 0012b000 b3:02 1941 /lib/arm-.../libc-2.19.so
76fa2000-76fa4000 r--p 0012b000 b3:02 1941 /lib/arm-.../libc-2.19.so
76fa4000-76fa5000 rw-p 0012d000 b3:02 1941 /lib/arm-.../libc-2.19.so
76fa5000-76fa8000 rw-p 00000000 00:00 0
76fa8000-76fad000 r-xp 00000000 b3:02 10133 /usr/lib/.../libarmmem.so
76fad000-76fbc000 ---p 00005000 b3:02 10133 /usr/lib/.../libarmmem.so
76fbc000-76fbd000 rw-p 00004000 b3:02 10133 /usr/lib/.../libarmmem.so
76fbd000-76fdd000 r-xp 00000000 b3:02 1906 /lib/arm-.../ld-2.19.so
address range
access permissions
mapped file
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
136
12 for i in range(ENTRYSIZE):
13 x = (ord(f.read(1))<<(8*i)) + x # little endian
14
15 # interpret entry
16 present = (x>>63) & 1
17 swapped = (x>>62) & 1
18 le_page=(x>>61)&1
19 soft_dirty =(x>>54) & 1
20
21 paddr = x & ((1<<32)-1)
22
23 print ("virtual address %x maps to **%d%d%d%d** %x" %
24 (vaddr,present,swapped,le_page,soft_dirty,(paddr*PAGESIZE)))
6.4.4 Faster translaon
Since every access to main memory requires an address translaon, it is helpful to cache frequently
used translaons to reduce overall access latency. The micro-architectural component that supports
this address translaon caching is known as a translaon look-aside buer (TLB). This is a fully
associave cache that stores a small set of virtual to physical (i.e., page to frame number) mappings.
Accessing data in the TLB is much quicker than a page table lookup; a TLB access may take only
a single cycle, at least one order of magnitude faster than a full page table walk. Figure 6.11 shows
how a TLB works. When a virtual address needs to be translated, the TLB looks up all its (page,
frame) entries in parallel. If any page tag matches then we have a TLB hit. The translaon succeeds
with minimal overhead. On the other hand, if no entry tag matches then we have a TLB miss, and an
expensive page table lookup is necessary.
Figure 6.11 Fast virtual address lookup with a translaon look-aside buer.
Eecve use of the TLB depends on the same memory access behavior as for standard caches,
i.e., spaal and temporal locality of data accesses. If we can maximize TLB hits, most memory
addresses will be translated without needing to access the page table in main memory. Thus, in the
common case, the performance will be the same as for direct physical addressing; the TLB minimizes
translaon overhead.
i
i
“chapter” 2019/8/13 18:08 page 12 #12
i
i
i
i
i
i
page number
page offset
p
o
p
1
p
2
p
n
...
f
o
f
1
f
2
f
n
...
frame number
frame offset
TLB hit
virtual address
physical address
fully associative cache
137
The Arm Cortex A53 processor in the Raspberry Pi 3 features a two-level TLB. Each core has a micro-
TLB, with 10 entries for instrucon address lookups, and a further 10 for data address lookups. This
corresponds to the Harvard architecture of the L1 cache. The main TLB is a 512 entry 4-way set
associave cache. Each entry is tagged with a process-specic address space idener (ASID) or is
global for all applicaon spaces. The hardware automacally populates and maintains the state of
the TLB; although, if the OS modies an address translaon that is cached in the TLB, it is then the
responsibility of the OS to invalidate this stale TLB entry.
Since the TLB caches virtual addresses, its data must be ushed when the virtual address space
mapping changes, perhaps at an OS context switch. The Arm system coprocessor has a TLB
Operaons Register c8, which supports TLB entry invalidaon. There are dierent opons for how
much to invalidate since a TLB ush is parcularly expensive in terms of its impact on performance.
For instance, it is not necessary to ush kernel addresses, since the kernel address space is common
across all processes in the system. Each process may be associated with a disnct ASID, and only
entries linked with the relevant ASID need to be invalidated on a context switch.
6.4.5 Architectural details
In the Arm architecture model, the system control coprocessor CP15 is responsible for the
conguraon of memory management. Translaon table base registers (TTBRs) in this unit are
congured to point to process-specic page tables by the OS, on a context switch. These registers
are only accessible in privileged mode.
To read TTBR0 into general purpose register r0, we use the instrucon:
MRC p15, 0, r0, c2, c0, 0
where p15 is the coprocessor, and c0 and c2 are coprocessor-specic registers. The dual MCR
instrucon writes from r0 into TTBR0, to update the page table base pointer.
Generally, Arm uses a one-page table base register for process-specic addresses (TTBR0) and
devotes the other for OS kernel addresses (TTBR1). The page table control register TTBCR determines
which page table base register is used for hardware page table walks; TTBCR is set when we vector
into the kernel.
When the OS performs a context switch, it updates the process page table root pointer, PGD, to
switch page tables. Since the on-chip caches are indexed using virtual addresses, it may be necessary
to ush the cache on a context switch as well. Since this is a high-overhead operaon, there are
various techniques to avoid cache ush on context switch. These opmizaons may require more
complex cache hardware (e.g., ASIDs per cache line) or more intricate OS memory management
(e.g., avoid overlaps in virtual address space ranges between concurrent processes).
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
138
6.5 Managing memory over-commitment
Since a process virtual address space may be much larger than the available physical memory, it is
possible to allocate more memory than the system contains. This supports the abstracon that the
system appears to have more memory than is physically installed. Recall that each process has
a separate virtual address space (VAS); all VASs are mapped onto a single physical address space.
This memory over-commitment is managed by the OS.
6.5.1 Swapping
When the system has more pages allocated than there are frames available in physical memory, the
OS has to swap pages out of RAM and into the backing store. The Linux swap facility handles this
overow of pages. Swapping in Linux is oen referred to as paging in other OS vocabularies.
Swap space is persistent storage, generally orders of magnitude slower than RAM. Typical swap space
is a le system paron or a large le on the root le system. Check out cat /proc/swaps to inspect
swap storage facilies on your Raspberry Pi Linux device. The Raspian default swap facility is a single
100MB le in /var/swap.
sudo hexdump -C /var/swap | less
Bash
Examine to see what is stored in the swap space currently, although much of this data may be stale
copies of old pages. Look for the SWAPSPACE2 magic header near the start of the le. In general,
the swap le is divided up into page-sized slots. Note that swapping is not parcularly common on
Raspberry Pi since access latency to SD card storage is parcularly high and frequent access can cause
device corrupon.
In a process page table, individual entries may be marked as swapped out. The pte_present()
macro checks whether a page is resident in memory or swapped out to disk. The biteld layout of the
page table entry for a swapped out page is shown in Figure 6.9, with disnct elds for the swap device
number and the device-specic index.
A process may execute when some of its pages are not resident in memory. However, the OS needs
to handle the situaon when the process tries to execute a memory access from a swapped out (non-
resident) page. The next secon describes this OS support for page faults.
6.5.2 Handling page faults
A page fault event is a processor excepon, which must be handled by an OS-installed excepon
handler. In Linux, the page fault handler is do_page_fault(), dened in arch/arm/mm/fault.c,
which calls out to non-architecture-specic rounes in mm/memory.c.
Figure 6.12 depicts a simplied ow graph for the page fault handling code. Inially, the
handler checks whether this page is a sensible page for the process to be accessing, as opposed to
a ‘wild’ access outside the process’ mapped address space. Then there is a permissions check of the
page table entry to determine whether the process is allowed to perform the requested memory
operaon (read, write, or execute). If either check fails, then there is a segmentaon fault. If the checks
139
pass, then the page fault handler will take appropriate remedial acon, swapping in a swapped out page,
reading in data from a le, performing a copy on write operaon, or allocang a fresh frame for a new page.
Figure 6.12: Flow chart for Linux page fault handling code.
Once the page fault has been handled, the OS restarts the faulng process at the instrucon that
originally caused the excepon, and user-mode execuon resumes, subject to process scheduling.
Suppose you have just launched a process with PID 6903, you can inspect the actual page faults
incurred by this process with the command:
ps-omin_t,maj_t,cmd,args6903
Bash
i
i
“chapter” 2019/8/13 18:08 page 13 #13
i
i
i
i
i
i
page fault
handler
page fault
exception
for address
x with
access
mode M
is x invalid
address?
signal a
segfault
is M
forbidden
at x?
signal a
segfault
is data for
x in swap?
swap in
page data
is data for
x in a file?
read in
file data
is page at x
resident?
do copy
on write
alloc fresh
empty page
trap
no
no
no
yes
no
yes
yes
yes
yes
no
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
140
Running the command without a PID integer argument lists stascs for all the user’s processes. To
run a program and get a total count for its page faults, use the /usr/bin/time command. (This may
require you to install the time package with sudo apt-get install time. Note you need the
full path, since time is also a bash built-in command). Now try /usr/bin/time ls and see how the
output reports the number of page faults.
Note that Linux disnguishes between minor faults — when a page is already resident, but not mapped
in this process’ VAS (e.g., code shared between mulple processes), and major faultswhen the OS
has to access the persistent store and read in data from a le.
As an example, consider the C code below. It creates a mul-page array and accesses a single byte
in each page. Because of demand paging, the pages are only mapped into the processVAS when rst
accessed. As the program is executed with larger sized arrays (use the command line parameter to
increase the size) the number of minor page faults increases. Try running it with an argument of 64000
(256MB sized array). Note if there is not enough memory, then the program will terminate.
Lisng 6.5.1: Program that induces minor page faults C
1 #include <stdlib.h>
2
3 /* assume 4KB page size */
4 #dene PAGES 1024*4
5
6 int main(int argc, char **argv) {
7 char *p = 0;
8 int i = 0, j = 0;
9 /* n is number of pages */
10 int n = 100;
11 if (argc == 2) {
12 n = atoi(argv[1]);
13 }
14 p = (char *)malloc(PAGES*n);
15 for (i=0; i<PAGES; i++) {
16 for (j=0; j<PAGES*(n-1); j+=PAGES) {
17 p[(i+j)] = 42;
18 }
19 }
20 return 0;
21 }
Now consider a similar program, but one that uses memory-mapped les, so the OS has to fetch the
data from the backing store. Grab a text le, e.g., with:
curl-oalice.txthttp://www.gutenberg.org/les/11/11-0.txt
Bash
and then compile the code shown below.
141
Lisng 6.5.2: Program that induces major page faults C
1 #include <assert.h>
2 #include <fcntl.h>
3 #include <stdio.h>
4 #include <sys/mman.h>
5 #include <sys/stat.h>
6 #include <unistd.h>
7
8 size_t get_size(const char*lename){
9 struct stat st;
10 stat(lename,&st);
11 return st.st_size;
12 }
13
14 int main(int argc, char** argv) {
15 int i, total = 0;
16 size_tlesize=get_size(argv[1]);
17 int fd = open(argv[1], O_RDONLY, 0);
18 char *data;
19 assert(fd != -1);
20 posix_fadvise(fd,0,lesize,POSIX_FADV_DONTNEED);
21 data = mmap(NULL,lesize,PROT_READ,MAP_PRIVATE|
22 MAP_NONBLOCK, fd, 0);
23 assert(data != MAP_FAILED);
24 for(i=0;i<lesize;i+=1024)
25 total += data[i];
26 printf("total = %d\n", total);
27 intrc=munmap(data,lesize);
28 assert(rc==0);
29 close(fd);
30 }
The rst me you run this program with:
/usr/bin/time -v ./a.out alice.txt 2>&1 | grep Major
Bash
noce there is at least one major fault as the le is read into memory. However, if you run it
immediately again, for a second me, there will be no major faults; the le data is already cached in
memory, so the program only causes minor faults.
6.5.3 Working set size
The working set for a process measures the number of pages that must be resident for that process to
make useful progress, i.e., to avoid constant swapping.
There are various les that track per-process memory consumpon. For instance, for a process with
id PID, the le /proc/PID/statm reports page-level memory usage. The rst column shows the
vmsize (the number of pages allocated in the virtual address space) , and the second column shows
the resident set size (the number of pages resident in physical memory for this process). The following
inequality always holds: rss < vmsize. The le /proc/PID/status shows the same informaon in
a more readable format.
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
142
For a process to execute eecvely, the RSS should be at least as large as the working set size (WSS).
Linux does not measure WSS directly; however, various third-party scripts are available to esmate
process WSS, e.g., consult hp://www.brendangregg.com/wss.html
6.5.4 In-memory caches
Physical memory frames that are not being used to store process pages could be used eecvely by
the OS for other purposes, such as caching data. Linux features several kinds of in-memory caches
that use these free frames.
The le system page cache stores page-sized chunks of les in memory, aer they are rst touched.
The OS reads ahead, to load porons of the le into memory in ancipaon of future accesses.
The fadvise funcon allows the process to specify how the le will be accessed. The page cache
is the reason why second and subsequent accesses to a le generally take much less me than the
inial access.
The swap cache keeps track of which physical frames have been wrien out to a swap le. This is
highly useful for pages shared between mulple processes, for example. Once a page has been wrien
out to the swap le, then the next me the page is swapped in, the data remains in the swap le slot.
If this page is not modied aer regaining memory residence, and then at some later stage it needs to
be swapped out again, we can avoid the writeback if it has not been modied since the last swap in.
The swap le records where the page lives in swap, so we can record this in the relevant page table
entry. On the other hand, if the page is modied in memory, then its swap cache entry is expunged
because the page becomes dirty and must be wrien back. This swap cache feature may save
unnecessary swap le writebacks.
The buer cache is used to opmize access to block devices (see later chapter on I/O). Since read and
write operaons are expensive for slow block devices, the buer cache interposes these accesses to
reduce I/O latency. For instance, individual writes from a collecon of processes could be batched up
for a block device. A buer cache will record blocks of data that have been read from or wrien to
a block device.
We can use commands like free -h or vmstat -S m to inspect how the Raspberry Pi physical
RAM is allocated between process pages, OS buers, page cache, etc. Ideally, all unused frames in
a system would be occupied by buers and caches, since this is preferable to underulizing physical
RAM. Then the caches are shrunk when the process page requirements increase as more processes
are admied.
6.5.5 Page replacement policies
The kernel swap daemon is a background process that commences running aer kernel inializaon.
ps -eo | grep kswapd
Bash
Invoke to see this daemon running on your Pi. The responsibility of kswapd is to swap out pages that
are not currently needed. This serves to maintain a set of free frames that are available for newly
allocated or swapped in pages.
143
Some pages are obvious candidates for swapping out; these are clean pages whose data is already in
the backing store, e.g., executable code, other memory-mapped les, or pages in the swap cache. Such
pages can be discarded without copying any data since the data is already stored elsewhere. On the
other hand, dirty pages have been updated since they were read in from backing store; other pages
(e.g., anonymous process pages) may never have been wrien out to the backing store. Such pages
must have their data transferred to persistent storage before they can be swapped out.
It is not ecient to swap out pages if their data may be required again in the near future since the
swap out operaon will be followed swily by a swap in of the same data.
Bélády’s opmal page replacement policy is a theorecal oracle that looks into the future, to select
a candidate page for replacement that will not be used again, or will only be used further in the future
than any other page currently resident. Since this abstracon is not implementable, Linux assumes
that, if a page has not been used in the recent past, then it is unlikely to be used again in the near
future. This is the principle of temporal locality.
Two memory manager mechanisms are used to keep track of page usage over me:
1. Each page has an associated metadata bit that may be set when the page is accessed.
2. Pages may be stored in a doubly-linked list that approximates least-recently-used (LRU) order.
Pages grow older as they are not accessed over me; old pages are ideal candidates for swapping out.
Below we review several page replacement policies.
Random
The simplest page replacement algorithm does not take page age or usage into account. It simply
selects a random vicm page to be swapped out immediately, to make space for a new page.
Not recently used
A page is not recently used (NRU) if its access metadata bit is unset. Such a page is a good candidate
for replacement. The NRU algorithm might work as follows:
1. A page p is randomly selected as a candidate.
2. If p access bit is set, go back to (1).
3. Assert p access bit is unset, and select p for replacement.
There is no guarantee of terminaon with NRU, since all pages may have access bits set. We assume
the OS will periodically unset all bits.
Clock
The clock algorithm keeps a circular list of pages. There is a conceptual ‘hand’ that points to the next
candidate page for replacement, see Figure 6.13. When a page replacement needs to take place, the
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
144
clock algorithm inspects the current candidate—if its access bit is set, then the access bit is unset and
the clock hand advances to the next page. The rst page with an unset access bit is selected as the
vicm to be swapped out. This is a ‘second chance’ algorithm.
Figure 6.13: Clock page replacement algorithm.
Least recently used
A genuine least recently used (LRU) scheme either upgrades the single access biteld to a longer, last
access mestamp eld for each page, or shues pages in a doubly-linked list to sort them in order of
access me. The vicm page is then easily selected as the page with the oldest mestamp, or the page
at the tail of the list respecvely. Both of these techniques have signicant management overhead.
The Linux memory manager actually implements a variant of the LRU page replacement scheme.
Pages allocated to processes are added to the head of a global acve pages queue. When a page
needs to be evicted, the tail of this queue is examined. If the tail page has its access bit set, then it is
moved back to the head of the queue, and its access bit is unset. However, if the tail page does not
have its access bit set, then it is a candidate for replacement, and it is moved to the inacve pages
queue from where it may be swapped out.
The page replacement algorithm is implemented in funcon do_try_to_free_pages() in source
code le linux/mm/vmscan.c, but be aware that this is a complex piece of code to trace.
Tuning the system
Linux has a kernel parameter called swappiness, which controls how aggressively the kernel swaps
pages out to the backing store. The value should be an integer between 0 and 100 inclusive. Higher
values are more aggressive at swapping pages from less acve processes out of physical memory,
which improves le-system performance (cache).
Note that, on a Raspberry Pi device, the swappiness may set at a parcularly low value, since the swap le
or paron is on an SD card, which has high access latency and may fail with excessive write operaons.
Find your current system’s swappiness value with:
cat /proc/sys/vm/swappiness
Bash
i
i
“chapter” 2019/8/13 18:08 page 14 #14
i
i
i
i
i
i
p
0
: accessed: 1
p
1
: accessed: 0
p
2
: accessed: 0
p
3
: accessed: 1
p
4
: accessed: 1
clock
145
On a desktop Linux installaon, the default value is generally 60. Try something like:
sudo sysctl -w vm.swappiness=100
Bash
and see whether this changes the performance of your system over me.
When the physical memory resource becomes chronically over-commied, acve pages must
be swapped out and swapped in again with increasing frequency. The whole system slows down
drascally since no process can make progress without incurring major page faults. All the system me
is spent servicing these page faults, so no useful work is achieved. The phenomenon is known
as thrashing, and it badly aects system performance.
6.5.6 Demand paging
Linux implements demand paging, which means physical memory is allocated to processes in a lazy,
or just-in-me, manner. A call to mmap only has an eect on the process page table; frames are not
allocated to the process directly. The process is only assigned physical memory resource when it really
needs it.
The Linux memory management subsystem records areas of virtual memory that are mapped in
the virtual address space, but for which the physical memory has not yet been allocated. (These are
zeroed-out entries in the page table.) This is the core mechanism that underlies demand paging: when
the process tries to access a memory locaon that is in this uninialized state, a page fault occurs, and
the physical memory is directly allocated. This corresponds to the boom le case (alloc fresh empty
page) in Figure 6.12.
The high-level layout of a process’ virtual address space is specied by the mm_struct data structure.
The process’ task_struct instance contains a eld that points to the relevant mm_struct. The
denion of mm_struct is in le include/linux/mm_types.h. It stores a linked list of vm_area_
struct instances, which model virtual memory areas (VMAs).
The list of VMAs encapsulates a set of non-overlapping, conguous blocks of memory. Each VMA
has a start- and end-address, which are aligned with page boundaries. The vm_area_struct, also
dened in include/linux/mm_types.h. has access permission ags, and prev and next pointers
for the linked list abstracon. Reading from /proc/PID/maps simply traces the linked list of VMAs
and prints out their metadata one-by-one, for instance, see Figure 6.10.
Each vm_area_struct also has a eld for a backing le, in case this VMA is a memory-mapped le.
If there is no le, this is an anonymous VMA which corresponds to an allocaon of physical memory.
When a page fault occurs for an address due to demand paging, the kernel looks up the relevant VMA
data via the mm_struct pointer. Each VMA has an embedded set of funcon pointers wrapped in
a vm_operations_struct. One of these entries points to a specic do_no_page funcon that
implements the appropriate demand paging behavior for this block of memory: the invoked acon
might be allocang a fresh physical frame for an anonymous VMA, or reading data from a le pointer
for a le-backed VMA.
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
146
A process may use the madvise API call to provide hints to the kernel about when data is likely to
be needed, or what kind of access paern will be used for a parcular area of memory— sequenal
or random access, for instance.
6.5.7 Copy on write
When a child process is forked, it shares its parent’s memory (although logically it has a disnct,
isolated copy of the parent’s virtual address space). The child process virtual address space maps to the
same physical frames, unl either parent or child tries to write some data. At that stage, a fresh frame
is allocated dynamically for the wring process.
This copy on write mechanism is supported through duplicated page table entries between parent and
child processes, page protecon mechanisms, and sophiscated page fault handling, as outlined above.
Copy on write leads to ecient process forking; child page allocaon is deferred unl data write
operaons occur—pages are shared between parent and child unl their data diverges through write
operaons.
For a simple example of copy on write acvity, execute the source code below and check the measured
me overheads for the buer updates. Where the me is longest, then copy on write paging acvity is
taking place.
Lisng 6.5.3: Measuring overhead of copy on write acvity C
1 #include <errno.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <time.h>
5 #include <unistd.h>
6
7 #dene PAGE_SIZE 4096
8 #dene NUM_PAGES 100000
9
10 void write_data(char*buer,int size) {
11 int i;
12 static char x = 0;
13 clock_t start, end;
14 start = clock();
15 for (i=0; i<size; i+=PAGE_SIZE)
16 buer[i]=x;
17 x++;
18 end = clock();
19 printf("time taken: %f seconds\n",
20 (double) (end-start) / CLOCKS_PER_SEC);
21 }
22
23 int main(int argc, char **argv) {
24 static charbuer[NUM_PAGES*PAGE_SIZE];
25 int res;
26
27 printf("1st test - expect high time - pages allocating\n");
28 write_data(buer,sizeofbuer);
29
30 switch(res = fork()) {
31 case -1:
147
32 fprintf(stderr,
33 "Unable to fork: %s(errno=%d)\n",
34 strerror(errno), errno);
35 exit(EXIT_FAILURE);
36 case 0: /* child */
37 printf("child[%d]: 2nd test - expect high time - copy on write\n", getpid());;
38 write_data(buer,sizeofbuer);
39 printf("child[%d]: 3rd test - expect low time - pages available\n", getpid());
40 write_data(buer,sizeofbuer);
41 exit(EXIT_SUCCESS);
42 default: /* parent */
43 printf("parent[%d]:waitingforchild[%d]tonish\n",
44 getpid(), res);
45 wait(NULL); /* child runs before parent */
46 printf("parent[%d]: 4th test - expect fairly low time - pages available"
47 "but not in processor cache\n", getpid());
48 write_data(buer,sizeofbuer);
49 exit(EXIT_SUCCESS);
50 }
51 }
Copy on write is a widely used technique. For instance, check out online informaon about ‘purely
funconal data structures’ to see how copy on write is used to make high-level algorithms and data
structures more ecient.
6.5.8 Out of memory killer
In the worst case, there is insucient physical memory available to support all running processes. The
kernel invokes a killer process (OOM-killer) at this stage, to idenfy a vicm process to be terminated,
freeing up physical memory resource. Heuriscs are used to idenfy memory hogging processes; look at
the integer value in /proc/PID/oom_scorehigher numbers indicate more memory hogging processes.
It’s possible to invoke the OOM-killer manually. Run this memory-hogging Python script:
Lisng 6.5.4: A memory-hogging script Python
1 #!/usr/bin/python
2 import time
3 megabyte = (0,) * (1024 * 1024 / 8)
4 data = megabyte*400
5 time.sleep(60)
and then execute these bash commands:
Lisng 6.5.5: Trigger the OOM killer interacvely Bash
1 sudo chmod 777 /proc/sysrq-trigger # to allow us to trigger special events
2 echo "f" > /proc/sysrq-trigger # trigger OOM killer
3 dmesg # nd out what happened
and observe the OOM-killer is triggered and kills the Python runme. Note the gruesome “kill process
or sacrice child” log message — the OOM-killer (mm/oom_kill.c) aempts to terminate child
processes rather than parents where possible, to minimize system disrupon.
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
148
Check whether the OOM-killer is frequently invoked on your system with something like:
sudo cat /var/log/messages | grep "oom-killer"
Bash
6.6 Process view of memory
A process has the abstracon of logical memory spaces, which are superimposed on the paged virtual
address space as conguous segments. The text segment contains the program code. This is generally
loaded to known addresses by the runme loader, reading data from the stac ELF le. Text generally
starts at a known address. For instance, invoke:
ld --verbose | grep start
Bash
to nd this address for your system.
Data is usually located immediately aer the text. This includes stacally allocated data which may
be inialized (the data secon) or uninialized (the bss secon). The runme heap comes aer this
data. The runme stack, which supports funcon evaluaon, parameter passing, and scoped variables,
starts near the top of memory and grows downwards.
From a process perspecve, there are three ways in which the OS might add new pages to the virtual
address space while the program is running:
1. brk or sbrk extends the program break, eecvely growing the heap.
2. mmap allocates a new block of memory, possibly backed by a le.
3. The stack can grow down, as more funcons are called in a dynamically nested scope; the stack
expands on-demand, managed as part of the page fault handler.
Figure 6.14: Evoluon of a process’s user virtual address space, dynamic changes in red, lower addresses at the boom.
i
i
“chapter” 2019/8/13 18:08 page 15 #15
i
i
i
i
i
i
stack
mmap()
heap
text
data
brk()
nested
calls
0
2GB or 3GB
149
Figure 6.14 illustrates these three ways in which a process virtual address space may evolve. For a
more concrete example of process interacon with memory, we can use a tool like valgrind to trace
memory accesses at instrucon granularity. The visualizaons in Figure 6.15 show the sequence of
memory accesses recorded by valgrind for an execuon of the ls command. The precise command
used to generate the trace is:
valgrind --tool=lackey --trace-mem=yes ls
Bash
6.7 Advanced topics
There are several memory management techniques to improve system security and defend against
buer overow aacks. Address space layout randomizaon (ASLR) introduces random noise into the
locaons of executable code and runme memory areas like the stack and heap. This unpredictability
makes it more dicult for an aacker to vector to known code. The page metadata bit NX indicates
a page is not executable. Again, this migates code injecon aacks from user-input data.
Figure 6.15: Visualizaons of memory access paerns for an invocaon of the ls command, shown for rst 100,000 instrucons; the high red/yellow line is
the stack, the low blue line is the program executable text.
As single address space systems become larger, in terms of both the number of processor cores and
the amount of installed physical memory, there is increasing variance in memory access latency. One
reason for this is that some processor cores are located closer to parcular RAM chips; perhaps
a motherboard has several sockets, and integrated packages are plugged into each socket with RAM
and processor cores. This arrangement is referred to as non-uniform memory access or NUMA.
Figure 6.16 shows an example NUMA system, based on the Cavium ThunderX2 Arm processor
family. There are two NUMA regions (one per socket). Each region has tens of cores and a local bank
of RAM. Physical memory is mapped to pages, as outlined above. There is a single address space, so
every memory locaon is accessible from every core, but with dierent access latencies. Processor
i
i
“chapter” 2019/8/13 18:08 page 16 #16
i
i
i
i
i
i
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
150
caches may hide some of the variance in memory access mes, but NUMA caching protocols are
complex. Writes to shared data can invalidate shared cache entries, forcing fresh data fetches from
main memory.
Figure 6.16 The Isambard HPC facility uses Cavium ThunderX2 NUMA processors which support mulple sockets with a shared address space; note the
disncve memory banks surrounding each processor package. Photo by Simon McIntosh-Smith.
Linux has several schemes to opmize memory access for NUMA architectures. Memory allocaon
may be interleaved, so it is placed in a round-robin fashion across all the nodes; this ensures memory
access mes are uniform on average, assuming an equal distribuon of probable accesses across the
address space. Another allocaon policy is node-local, which allocates memory close to the processor
execung the malloc; this assumes the memory is likely to be accessed by threads running on cores
in that same NUMA region.
You can determine whether your Linux system supports NUMA, by execung:
numactl --hardware
Bash
and see how many nodes are reported. Most Arm systems (in parcular, all Raspberry Pi boards) are
not NUMA. However, mulple socket motherboards will become increasingly common as core counts
increase, tracking Moore’s law in future years.
Another memory issue that aects large-scale servers, but may soon be apparent on smaller systems
is distributed memory. Protocols such as remote dynamic memory access (RDMA) enable pages to be
transferred rapidly from other machines to the local machine, copying memory from a remote buer
to a local buer with minimal OS intervenon. This is useful for migraon of processes or virtual
machines in cloud data centers. In more general terms, direct memory access (DMA) is a technique for
ecient copying of data between devices and memory buers. We will cover DMA in more detail in
Chapter 8 when we explore input/output. There is some addional complexity because many devices
work enrely in terms of physical memory addresses since they operate outside of the processor’s
virtual addressing domain.
151
Next-generaon systems may feature non-volale memory (NV-RAM). Whereas convenonal volale
RAM loses its data when the machine is powered down, NV-RAM persists data (like ash drives or
hard disks, but with faster access mes). NV-RAM is byte-addressable and oers signicant new
features for OSs, such as immediate restart of processes or enre systems, and full persistence for
in-memory databases.
6.8 Further reading
The Understanding the Linux Kernel textbook has helpful chapters on memory management, disk caches,
memory mapping, and swapping [1].
Gorman’s comprehensive documentaon on memory management in Linux [2] is a lile dated (based
on kernel version 2.6) but sll contains plenty of relevant and valuable material, including source code
commentary. It is the denive overview of the complex virtual memory management subsystem in
the Linux kernel.
Details of more recent kernel changes are available at the Linux Memory Management wiki,
hps://linux-mm.org.
To learn about Arm hardware support for memory management, consult Furber’s Arm System-on-Chip
textbook [3] for a generic overview, or the appropriate Arm architecture reference manual for specic
details.
6.9 Exercises and quesons
6.9.1 How much memory?
Calculate the size of your Raspberry Pi system’s virtual address space in megabytes by wring a short
C program.
Lisng 6.9.1: Compute the size of the virtual address space C
1 #include <stdio.h>
2 #include <sys/sysinfo.h>
3 #include <unistd.h>
4
5 int main() {
6 int pages = get_phys_pages();
7 int pagesize = getpagesize(); /* in bytes */
8 double ramGB = ((double)pages * (double)pagesize / 1024 / 1024 / 1024);
9 printf("RAM Size %.2f GB, Page Size %d B\n", ramGB, pagesize);
10 return 0;
11 }
How does the value reported compare with the system memory size stated by (a) /proc/meminfo
and (b) the ocial system documentaon? Can you account for any discrepancies?
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
152
6.9.2 Hypothecal address space
Consider a 20-bit virtual address space, with pages of size 1KB.
1. Assuming byte-addressable memory, how many bits are required for a page/frame oset?
2. How many bits does this leave for specifying the page number?
3. Assume the page index bitstring is split into two equal porons, for rst-level and second-level page
table indexing. How many rst-level page tables should there be?
4. What is the maximum number of second-level page tables?
5. How many individual entries will there be in each page table?
6. What is the space overhead of this hierarchical page table, as opposed to a single-level page table,
when all pages are mapped to frames?
7. What is the space-saving of this hierarchical page table, as opposed to a single-level page table,
when only one page is mapped to a frame, i.e., there is a single entry in the page table mapping?
6.9.3 Custom memory protecon
The mprotect library funcon allows you to set page-level protecon (read, write, execute) for
allocated memory in user space. See man mprotect for more details. Sketch a scenario when
a developer may want to change page permissions:
1. From read/write to read-only, once a data structure has been inialized;
2. To make a page executable, once its data has been populated.
6.9.4 Inverted page tables
The simplest variant of an inverted page table contains one entry per frame. Each entry stores an
address space idener (ASID) to record which process is currently occupying this frame, along with
the virtual address corresponding to this physical address. Metadata permission bits may also be
stored with each entry. To look up a virtual address, it is only necessary to check whether the address
is present in any table entry—by looking up all table entries at once. This content-addressable approach
is how TLBs work since hardware support makes it possible to check all entries simultaneously.
1. What is the main problem with supporng inverted page tables enrely in soware, using an
in-memory data structure for the table?
2. Can you think of a more ecient soluon for inverted page table storage in soware?
153
6.9.5 How much memory?
Assume an OS is running p processes, and the plaorm has an n-level hierarchical page table. Each
node (including leaf nodes) in the page table occupies a single page. The page size is large enough to
store at least n address entries in a page table node:
1. How many pages would all the page tables occupy if each process has a single page of data in its
virtual address space?
2. What is the smallest number of pages occupied by all the page tables if each process has n pages
of data in its virtual address space?
3. What is the largest number of pages occupied by all the page tables if each process has n pages
of data in its virtual address space?
6.9.6 Tiny virtual address space
Imagine a system with an 8-bit, byte-addressable physical address space:
1. How many bytes of memory will there be?
2. For this system, consider using a virtual addressing scheme with single-level paging. If each page
contains 16 bytes, how many pages will there be?
3. In the worst case, what happens to memory access latency in a virtual addressing environments
with a single-level page table, with respect to physical addressing?
4. What could be done to migate this worst-case memory access latency?
5. In pracce, why is it unlikely that 8-bit memory would feature a virtual addressing scheme?
6.9.7 Denions quiz
Match the following concepts with their denions:
1. Swap le 1. Contains translaon data and protecon metadata for one or more pages.
2. Page table entry 2. When the OS perturbs regions of memory to ensure unpredictable
addresses for key data elements.
3. Thrashing 3. When a system cannot make useful progress since almost every memory
access requires pages to be swapped from the backing store.
4. MMU 4. Backing storage for pages that are not resident in memory.
5. Address space
randomizaon
5. A specialized hardware unit that maintains the abstracon of virtual
address spaces from the point-of-view of the processor.
Chapter 6 | Memory management
Operang Systems Foundaons with Linux on the Raspberry Pi
154
References
[1] D. P. Bovet and M. Cesa, Understanding the Linux Kernel, 3rd ed. O’Reilly, 2005.
[2] M. Gorman, Understanding the Linux Virtual Memory Manager. Prence Hall, 2004, hps://www.kernel.org/doc/gorman/
[3] S. Furber, ARM System-on-Chip Architecture, 2nd ed. Pearson, 2000.
155
Chapter 6 | Memory management
Chapter 7
Concurrency
and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
158
7.1 Overview
In this chapter, we discuss how the OS supports concurrency, how the OS can assist in exploing
hardware parallelism, and how the OS support for concurrency and parallelism can be used to write
parallel and concurrent programs. We look at OS support for concurrent and parallel programming
via POSIX threads and present an overview of praccal parallel programming techniques such as
OpenMP, MPI, and OpenCL.
The exercises in this chapter focus on POSIX thread programming to explore the concepts of
concurrency, shared resource access and parallelism, and programming using OpenCL to expose the
student to praccal parallel heterogeneous programming.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Relate denions to the programmer’s view of concurrency and parallelism.
2. Discuss programming primives and APIs to handle concurrency and the OS and hardware support
for them.
3. Use the POSIX programming API to exploit parallelism and OS and hardware support for them.
4. Compare and contrast data- and task-parallel programming models.
5. Illustrate by example the popular parallel programming APIs.
7.2 Concurrency and parallelism: denions
To understand the implicaons and properes, rst of all, we need clear denions of concurrency
and parallelism.
7.2.1 What is concurrency?
Concurrency means that more than one task is running concurrently (at the same me) on the system.
In other words, concurrency is a property of the workload rather than the system, provided that the
system has support for running more than one task at the same me. In pracce, one of the key
reasons to have an OS is to support concurrency through scheduling of tasks on a single shared CPU.
7.2.2 What is parallelism?
Parallelism, by contrast, can be viewed as a property of the system: when a system has more than one
CPU core, it can execute several tasks in parallel, even if there is no scheduler to me-slice the tasks.
If the kernel supports hardware parallelism, it will try to speed up the execuon of tasks by making use
of the available parallel resources.
7.2.3 Programming model view
Another way of dening the terms parallelism and concurrency is as programming models. In praccal
terms, concurrent programming is about user experience and parallel programming about performance.
159
In a concurrent program, several threads of operaon are running at the same me because the user
expects several acons to be happening at the same me. For example, a web browser must at least
have a thread for networking, one for rendering the pages and one for user interacons (mouse clicks,
keyboard input). If these threads were not concurrent, the browser would not be usable.
By contrast, in a parallel program, what happens is that the work that would be performed on a single
CPU is split up and handed to mulple CPUs who execute each part in parallel. We can further disnguish
between task parallelism and data parallelism. Task parallelism means that every CPU core will perform a
dierent part of the computaon; for example, the steps in an image processing pipeline. Data parallelism
means that every CPU core will perform the same computaon but on a dierent part of the data. If we
run a parallel program on a single-core system, the only eect will be that it runs slower.
Because eecvely parallel programs execute concurrent threads, many of the issues of concurrent
programs are also encountered in parallel programming.
7.3 Concurrency
In this secon, we have a closer look at concurrency: the issues arising from concurrency and the techniques
to address them; support for concurrency in the hardware and the OS, and the POSIX programming API.
7.3.1 What are the issues with concurrency?
There are two factors which can lead to issues when several tasks are running concurrently: shared
resources and exchange of informaon between tasks.
Shared resources
When concurrent tasks share a resource, then access to that resource needs to be controlled to avoid
undesirable behavior. A very clear illustraon of this problem is a shared secon of railroad track, as
shown in Figure 7.1. Clearly, uncontrolled access could lead to disaster. Therefore, points in a railway
system are protected by semaphore signals. The single-track secon is the shared resource as it is
required by any trains traveling on the four tracks leading to it. When the signal indicates “Clear,” the
train can use the shared secon, at which point the signal will change to “Stop.Any train wanng to
use the shared secon will have to wait unl the train occupying it has le, and the signal is “Clear”
again. We will discuss the OS equivalent in Secon 7.3.3.
Figure 7.1: Shared railroad track secon with points and semaphores.
Semaphore
Train
Points
(a)
(b)
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
160
In a computer system, there many possible shared resources: le system, IO devices, memory. Let’s
rst consider the case of a shared le, e.g., a le in your home directory will be accessible by all
processes owned by you. Slightly simplifying, when a process opens a le, the le content is read into
the memory space of the process. When the process closes the le again, any changes will be wrien
back to disk. If two or more processes access the le concurrently for wring, there is a potenal for
conict: the changes made by the last process to write back to disk will overwrite all previous changes.
Therefore, most editors will warn if a le was modied by another process while open in the editor.
Figure 7.2: Concurrent access to a shared le using echo and vim.
For example, in Figure 7.2, I opened a le test_shared_access.txt in vim (right pane), and while
it was open, I modied it using echo (le pane). As you can see, when I then tried to save the le in
the editor, it warned me the le had been changed.
In the case of shared les, most operang systems leave the access control to the user. In the previous
example, before wring, vim checked the le on disk for changes. This is not possible for shared access
to IO devices because they are controlled by the operang system. In a way, this makes access control
easy because the OS can easily keep track of the processes using an IO resource. Shared access to
memory is more problemac. As we have seen, memory is not normally shared between processes;
each process has its own memory space. However, in a multhreaded process, the memory is shared
between all threads. In this case, again, the operang system leaves the access control to the user, i.e.,
the programmer of the applicaon. However, the operang system and the hardware provide support
for access control. The OS uses the hardware support to implement its internal mechanisms, and these
are used to implement the programmer API (e.g., POSIX pthreads).
Exchange of informaon
When concurrent tasks need to exchange informaon that is required to connue execuon, there
is a need to make sure that the sender either waits for the message (if it is sent late) or stores the
message unl needed (if it was sent early). Furthermore, if two or more tasks require informaon
161
from one another, care must be taken to avoid deadlock, i.e., the case where all tasks are waing for
the other tasks so no tasks can connue execuon. In the case of communicaon between threads
in a multhreaded process, the communicaon occurs via the shared memory. For communicaon
between processes, there are a number of possibilies, e.g., communicaon using shared les,
operang system pipes, network sockets, operang system message queues, or even shared memory.
The issues with the exchange of informaon in concurrent processes can be best explained using the
producer-consumer problem: each process is either a producer or a consumer of informaon. Ideally, any
item of informaon produced by a producer would be immediately consumed by a consumer. However,
in general, the rate of progress of producers and consumers is dierent (i.e., they are not operang
synchronously). Therefore, informaon needs to be buered, either by the consumer or by the producer.
In pracce buering capacity is always limited (the problem is therefore also known as the bounded buer
problem), so at some point, it is possible that the producer will have to suspend execuon unl there is
sucient buer capacity for the informaon to be produced. Note that in general there can be more than
one buer (e.g., it is common for each consumer to have a buer per producer).
Eecvely, the buer is a shared resource, so in terms of access control, informaon exchange, and
resource sharing are eecvely the same problem. This is also true for synchronizaon: the consumer
needs the informaon from the producer(s) in order to progress. So as long as that informaon is not
there, the producer has to wait. This is also the case with shared resources; for example, trains have
to wait unl the shared secon of track is free. In other words, control of access to shared resources
and synchronizaon of the exchange of informaon are just two dierent views on the same problem.
Consequently, there will be a single set of mechanisms that can be used to address this problem.
Note that if there are mulple producers and/or consumers, and a single shared resource, the problem is
usually known as the reader-writer problem, and has to address the concurrent access to the shared resource.
7.3.2 Concurrency terminology
When discussing synchronizaon and shared resources, it is useful to dene some addional terms
and concepts.
Crical secon
A crical secon for a shared resource is that poron of a program which accesses the resource
in such a way that mulple concurrent accesses would lead to undened or erroneous behavior.
Therefore, for a given shared resource, only one process can be execung its crical secon at
a me. The crical secon is said to be protected if the access to it is controlled in such a way that
the behavior is well-dened and correct.
Synchronizaon
In this context, by synchronizaon, we mean synchronizaon between concurrent threads of
execuon. When mulple processes need the exchange informaon, synchronizaon f the processes
results in a well-dened sequence of interacons.
Deadlock
Deadlock is the state in which each process in a group of communicang process is waing for
a message from the other process in order to proceed with an acon. Alternavely, in a group of
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
162
processes with shared resources, there will be deadlock if each process is waing for another process
to release the resource that it needs to proceed with the acon.
A classic example of how the problem can occur is the so-called dining philosophers problem.
Slightly paraphrased, the problem is as follows:
Five philosophers sit around a round table with a bowl of noodles in front of each and a chopsck
between each of them.
Each philosopher needs two chopscks to eat the noodles.
Each philosopher alternately.
thinks for a while,
picks up two chopscks,
eats,
puts down the chopscks.
It is clear that there is potenal for deadlock here because there are not enough chopscks for all
philosophers to eat at the same me. If for example, they would all rst take the le chopsck, then
try to take the right one (or vice-versa), there would be deadlock. So how do they ensure there is no
deadlock?
Edsger Dijkstra (May 11, 1930 Aug. 6, 2002) was a Dutch
computer scienst. He did his Ph.D. research from the University
of Amsterdam’s Mathemacal Center (1952–62). He taught and
researched at the Technical University of Eindhoven from 1963
to 1973 and at the University of Texas from 1984 onwards.
He was widely known for his 1959 algorithm that solves the
shortest-path problem. This algorithm is sll used to determine
the shortest path between two points, in parcular for roung
of communicaon net-works. In the course of his research on the
mutual exclusion in communicaons he suggested in 1962 the
concept of computer semaphores. His famous leer to CACM
in 1968, "Go To Statement Considered Harmful" was very inuenal
in the development of structured programming. He received the
Turing Award in 1972.
Image ©2002 Hamilton Richards
www.cs.utexas.edu/users/EWD/
163
7.3.3 Synchronizaon primives
In 1962, the famous Dutch computer scienst Edsger Dijkstra wrote a seminal – though interesngly,
technically unpublished – arcle tled “Over seinpalen” [1], i.e., “About Semaphores,in which he
introduced the concept of semaphores as a mechanism to protect a shared resource. In Dijkstra’s
arcle, a semaphore S is a special type of shared memory, storing a non-negave integer. To access the
semaphore register, Dijkstra proposes two operaons, V(S), which stands for "verhoog,” i.e., increment,
and P(S), which stands for “prolaag,” i.e., try to decrement. The P(S) operaon will block unl the value
of S has been successfully decremented. Both operaons must be atomic.
If the semaphore can only take the values 0 or 1, Dijkstra specically menons the railway analogy,
where the V-operaon means “free the rail track” and the P-operaon “try to pass by the semaphore
onto the single track”, and that this is only possible if the semaphore is set to “Safe” and passing it
implies seng it to “Unsafe”.
Dijkstra calls a binary semaphore a mutex (mutual exclusion lock) [2]; a non-binary semaphore is
somemes called a counng semaphore. Although there is no general agreement on this denion,
the denions in the Arm Synchronizaon Primives Development Arcle [3] agree with this:
Mutex A variable, able to indicate the two states locked and unlocked. Aempng to lock a mutex already in
the locked state blocks execuon unl the agent holding the mutex unlocks it. Mutexes are somemes called
locks or binary semaphores.
Semaphore A counter that can be atomically incremented and decremented. Aempng to decrement
a semaphore that holds a value of less than 1 blocks execuon unl another agent increments the semaphore.
The key requirement of Dijkstra’s semaphores is the atomicity of the operaon. Modern processors
provide special atomic instrucons that allow implemenng semaphores eciently.
7.3.4 Arm hardware support for synchronizaon primives
Exclusive operaons and monitors
The ARMv6 architecture introduced the Load-Exclusive and Store-Exclusive synchronizaon
primives, LDREX and STREX, in combinaon with a hardware feature called exclusive monitor.
Quong from the Arm Synchronizaon Primives Development Arcle [3]:
LDREX The LDREX instrucon loads a word from memory, inializing the state of the exclusive monitor(s)
to track the synchronizaon operaon. For example, LDREX R1, [R0] performs a Load-Exclusive from the
address in R0, places the value into R1 and updates the exclusive monitor(s).
STREX The STREX instrucon performs a condional store of a word to memory. If the exclusive monitor(s)
permit the store, the operaon updates the memory locaon and returns the value 0 in the desnaon
register, indicang that the operaon succeeded. If the exclusive monitor(s) do not permit the store,
the operaon does not update the memory locaon and returns the value 1 in the desnaon register.
This makes it possible to implement condional execuon paths based on the success or failure of the
memory operaon. For example, STREX R2, R1, [R0] performs a Store-Exclusive operaon to the address
in R0, condionally storing the value from R1 and indicang success or failure in R2.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
164
Exclusive monitors An exclusive monitor is a simple state machine, with the possible states open and exclusive.
To support synchronizaon between processors, a system must implement two sets of monitors, local and global
(Figure 7.3). A Load-Exclusive operaon updates the monitors to exclusive state. A Store-Exclusive operaon
accesses the monitor(s) to determine whether it can complete successfully. A Store-Exclusive can succeed only
if all accessed exclusive monitors are in the exclusive state.
Figure 7.3: Local and global monitors in a mul-core system (from [3]).
The LDREX and STREX instrucons are used by the Arm-specic Linux kernel code to implement the
kernel-specic synchronizaon primives which in their turn are used to implement POSIX synchronizaon
primives. For example, include/asm/spin lock.h implements spin lock funconality for the Arm
architecture, and this is used in the non-architecture-specic implementaon in include/linux/spin lock.h.
Shareability domains
In the context of cache-coherent symmetric mulprocessing (SMP), the Arm system architecture uses
the concept of shareability domains [4], which can be the Inner Shareable, Outer Shareable, System,
or Non-shareable , as illustrated in Figure 7.4. These domains are mainly used to restrict the range of
memory barriers, as discussed in Secon 7.3.5.
Figure 7.4: Shareability domains in an Arm manycore system, based on [5].
Cortex-A8
Local monitor
Cortex-R4
Local monitor
AXI interconnect
Global monitor
Memory
Core Core Core Core
Non-shareable
Inner shareable
Outer shareable
Processor Processor
GPU
165
The architectural denion of these domains is that they enable us to dene sets of observers for
which the shareability makes the data transparent for accesses. The Inner domain shares both code
and data, i.e., in pracce a mulcore system running an instance of an operang system will be in
the Inner domain; the Outer domain shares data but not code, and as shown in the gure could, for
example, contain a GPU, or a DSP or DMA engine. Marking a memory region as non-shareable means
that the local agent (core) does not share this region at all. This domain is not typically used in SMP
systems. Finally, if the domain is set to System, then an operaon on it aects all agents in the system.
For example, a UART interface would not normally be put in a shareable domain, so its domain would
be the full system.
7.3.5 Linux kernel synchronizaon primives
The Linux kernel implements a large number of synchronizaon primives; we discuss here only
a selecon.
Atomic primives
The Linux kernel implements a set of atomic operaons know as read-modify-write (RMW)
operaons. These are operaons where a value is read from a memory locaon, modied, and then
wrien back, with the guarantee that no other write will occur to that locaon between the read and
the write (hence the name atomic).
Most RMW operaons in Linux fall into one of two classes: those that operate on the special
atomic_t or atomic64_t data type, and those that operate on bitmaps, either stored in an
unsigned long or in an array of unsigned long.
The basic set of RMW operaons that are implemented individually for each architecture are known
as “atomic primives." As a kernel developer, you would use these to write architecture-independent
code such as a le system or a device driver.
As these primives work on atomic types or bitmaps, let’s rst have a look at these. The atomic types
are dened in types.h and they are actually simply integers wrapped in a struct:
Lisng 7.3.1: Linux kernel atomic types C
1 typedef struct {
2 int counter;
3 } atomic_t;
4
5 #ifdef CONFIG_64BIT
6 typedef struct {
7 long counter;
8 } atomic64_t;
9 #endif
The reason for this is that the atomic types should be dened as a signed integer but should also be
opaque so that a cast to a normal C integer type will fail.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
166
The simplest operaon on atomic types are inializaon, read and write, dened for the arm64
architecture in include/asm/atomic.h as:
Lisng 7.3.2: Linux kernel atomic type operaons (1) C
1 #dene ATOMIC_INIT(i) { (i) }
2
3 #dene atomic_read(v) READ_ONCE((v)->counter)
4 #dene atomic_set(v, i) WRITE_ONCE(((v)->counter), (i))
The READ_ONCE and WRITE_ONCE macros are dened in include/linux/compiler.h and are not
architecture-specic. Their purpose is to stop the compiler from merging or refetching reads or writes
or reordering occurrences of statements using these macros. We present them here purely to show
how non-trivial it is to stop a C compiler from opmizing.
Lisng 7.3.3: Linux kernel atomic type operaons (2) C
1 #include <asm/barrier.h>
2 #dene __READ_ONCE(x, check) \
3 ({ \
4 union { typeof(x) __val; char __c[1]; } __u; \
5 if (check) \
6 __read_once_size(&(x), __u.__c, sizeof(x)); \
7 else \
8 __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
9 smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
10 __u.__val; \
11 })
12 #dene READ_ONCE(x) __READ_ONCE(x, 1)
13
14 #dene WRITE_ONCE(x, val) \
15 ({ \
16 union { typeof(x) __val; char __c[1]; } __u = \
17 { .__val = (__force typeof(x)) (val) }; \
18 __write_once_size(&(x), __u.__c, sizeof(x)); \
19 __u.__val; \
20 })
Bitmaps are, in a way simpler, as they are simply arrays of nave-size words. The Linux kernel provides
the macro DECLARE_BITMAP() to make it easier to create a bitmap:
Lisng 7.3.4: Linux kernel bitmap C
1 #dene DECLARE_BITMAP(name,bits) \
2 unsigned long name[BITS_TO_LONGS(bits)]
Here, BITS_TO_LONGS returns the number of words required to store the given number of bits.
The most common operaons on bitmaps are set_bit() and clear_bit() which for the arm64
architecture are dened in include/asm/bitops.h as:
167
Lisng 7.3.5: Linux kernel bitmap operaons (1) C
1 #ifndef CONFIG_SMP
2 / *
3 * The __* form of bitops are non-atomic and may be reordered.
4 */
5 #dene ATOMIC_BITOP(name,nr,p) \
6 (__builtin_constant_p(nr) ? ____atomic_##name(nr, p) : _##name(nr,p))
7 #else
8 #dene ATOMIC_BITOP(name,nr,p) _##name(nr,p)
9 #endif
10
11 / *
12 * Native endian atomic denitions.
13 */
14 #dene set_bit(nr,p) ATOMIC_BITOP(set_bit,nr,p)
15 #dene clear_bit(nr,p) ATOMIC_BITOP(clear_bit,nr,p)
16 }
The actual atomic operaons used in these macros are dened in include/asm/bitops.h as:
Lisng 7.3.6: Linux kernel bitmap operaons (2) C
1 / *
2 * These functions are the basis of our bit ops.
3 *
4 * First, the atomic bitops. These use native endian.
5 */
6 static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p)
7 {
8 unsigned longags;
9 unsigned long mask = BIT_MASK(bit);
10
11 p += BIT_WORD(bit);
12
13 raw_local_irq_save(ags);
14 *p |= mask;
15 raw_local_irq_restore(ags);
16 }
17
18 static inline void ____atomic_clear_bit(unsigned int bit, volatile unsigned long *p)
19 {
20 unsigned longags;
21 unsigned long mask = BIT_MASK(bit);
22
23 p += BIT_WORD(bit);
24
25 raw_local_irq_save(ags);
26 *p &= ~mask;
27 raw_local_irq_restore(ags);
28 }
The interesng point here is that the atomic behavior is achieved by masking the interrupt requests
and then restoring them, through the use of the architecture-independent funcons raw_local_
irq_save() and raw_local_irq_restore(). The architecture-specic implementaon of these
funcons for AArch64 is also provided in include/asm/bitops.h:
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
168
Lisng 7.3.7: Atomic behavior through masking interrupt requests C
1 / *
2 * Aarch64 has ags for masking: Debug, Asynchronous (serror), Interrupts and
3 * FIQ exceptions, in the 'daif' register. We mask and unmask them in 'dai'
4 * order:
5 * Masking debug exceptions causes all other exceptions to be masked too/
6 * Masking SError masks irq, but not debug exceptions. Masking irqs has no
7 * side eects for other ags. Keeping to this order makes it easier for
8 * entry.S to know which exceptions should be unmasked.
9 */
10
11 / *
12 * CPU interrupt mask handling.
13 */
14 static inline unsigned long arch_local_irq_save(void)
15 {
16 unsigned longags;
17 asm volatile(
18 "mrs %0, daif // arch_local_irq_save\n"
19 "msr daifset, #2"
20 : "=r"(ags)
21 :
22 : "memory");
23 returnags;
24 }
25
26 / *
27 * restore saved IRQ state
28 */
29 static inline void arch_local_irq_restore(unsigned longags)
30 {
31 asm volatile(
32 "msr daif, %0 // arch_local_irq_restore"
33 :
34 : "r"(ags)
35 : "memory");
36 }
Masking interrupts is a simple and eecve mechanism to guarantee atomicity on a single-core
processor because the only way another thread could interfere with the operaon would be through
an interrupt. On a mulcore processor, it is in principle possible that a thread running on another core
would access the same memory locaon. Therefore, this mechanism is not useful outside the kernel.
If you use it in kernel code, it is assumed that you know what you’re doing, that is also why the rounes
have _local_ in their name, to indicate that they only operate on interrupts for the local CPU.
A nice overview of the API for operaons on atomic types can be found in the Linux kernel
documentaon in the les atomic_t.txt and atomic_bitops.txt. The operaons can be divided into
non-RMW and RMW. The former are read, set, read_acquire and set_release; the laer are arithmec,
bitwise, swap, and reference count operaons. Furthermore, each of these comes in an atomic_
and atomic64_ variant, as well as variants to indicate that there is a return value or not, and that the
fetched rather than the stored value is returned. Finally, they all come with relaxed, acquire, and
release variants, which need a bit more explanaon.
169
Memory operaon ordering
On a symmetric mulprocessing (SMP) system, accesses to memory from dierent CPUs are in principle
not ordered. We say that the memory operaon ordering is relaxed. Very oen, some degree of ordering
is required. The default for the Linux kernel is to impose a strict overall order via what is called a memory
barrier. Strictly speaking, a memory barrier imposes a perceived paral ordering over the memory operaons
on either side of the barrier. To quote from the Linux kernel documentaon (memory_barriers.txt),
Such enforcement is important because the CPUs and other devices in a system can use a variety of tricks
to improve performance, including reordering, deferral, and combinaon of memory operaons; speculave
loads; speculave branch predicon and various types of caching. Memory barriers are used to override or
suppress these tricks, allowing the code to sanely control the interacon of mulple CPUs and/or devices.
The kernel provides the memory barriers smp_mb {before,after}_atomic() and in pracce, the
strict operaon is composed of a relaxed operaon preceded and followed by a barrier, for example. Thus:
Lisng 7.3.8: Linux kernel atomic operaon through barriers C
1 atomic_fetch_add();
2 // is equivalent to :
3 smp_mb before_atomic();
4 atomic_fetch_add_relaxed();
5 smp_mb after_atomic();
Between relaxed and strictly ordered there are two other possible semancs, called acquire and release.
Acquire semancs applies to RMW operaons and load operaons that read from shared memory
(read-acquire), and it prevents memory reordering of the read-acquire with any read or write operaon
that follows it in program order.
Release semancs applies to RMW operaons and store operaons that write to shared memory
(write-release), and it prevents memory reordering of the write-release with any read or write operaon
that precedes it in program order.
Table 7.1 provides a summary of the possible cases.
Table 7.1: Memory operaon ordering semancs.
Type of operaon Ordering
Non-RMW operaons Unordered
RMW operaons
RMW operaons That have no return value Unordered
That have a return value Fully ordered
That have an explicit ordering
{operaon name}_relaxed Unordered
{operaon name}_acquire RMW read is an ACQUIRE
{operaon name}_release RMW write is a RELEASE
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
170
Memory barriers
The memory barriers smp_mb {before,after}_atomic() are not the only types of barrier
provided by the Linux kernel. We can disnguish the following types [6]:
General barrier
A general barrier (barrier() from include/linux/compiler.h) has no eect at runme; it only serves
as an instrucon to the compiler to prevent reordering of memory accesses from one side of this
statement to the other. For the gcc compiler, this is implemented in the kernel code as
Lisng 7.3.9: Linux kernel general barrier C
1 #dene barrier() __asm__ __volatile__("": : :"memory")
Mandatory barriers
To enforce memory consistency on a full system level, you can use mandatory barriers. This is most
common when communicang with external memory-mapped peripherals. The kernel mandatory
barriers are guaranteed to expand to at least a general barrier, independent of the target architecture.
The Linux kernel has three basic mandatory CPU memory barriers:
GENERAL mb() A full system memory barrier. All memory operaons before the mb() in the instrucon
stream will be commied before any operaons aer the mb() are commied. This ordering will be visible to
all bus masters in the system. It will also ensure the order in which accesses from a single processor reaches
slave devices.
WRITE wmb() Like mb(), but only guarantees ordering between read accesses: all read operaons before
an rmb() will be commied before any read operaons aer the rmb().
READ rmb() Like mb(), but only guarantees ordering between write accesses: all write operaons before
a wmb() will be commied before any write operaons aer the wmb(). [6]
For the Arm AArch64 architecture, these barriers are implemented in arm64/include/asm/barrier.h as:
with the dsb() macro implemented in arm/include/asm/barrier.h as:
Lisng 7.3.10: Arm implementaon of kernel memory barriers (1) C
1 #dene mb() dsb(sy)
2 #dene rmb() dsb(ld)
3 #dene wmb() dsb(st)
with the dsb() macro implemented in arm/include/asm/barrier.h as:
Lisng 7.3.11: Arm implementaon of kernel memory barriers (2) C
1 #dene isb(option) __asm__ __volatile__ ("isb " #option : : : "memory")
2 #dene dsb(option) __asm__ __volatile__ ("dsb " #option : : : "memory")
3 #dene dmb(option) __asm__ __volatile__ ("dmb " #option : : : "memory")
171
Here, DMB, DSB, and ISB are respecvely Data Memory Barrier, Data Synchronizaon Barrier,
and Instrucon Synchronizaon Barrier instrucons [7]. In parcular, DSB acts as a special kind of
memory barrier. No instrucon occurring in the program order aer this instrucon executes unl this
instrucon has completed. The DSB instrucon completes when all explicit memory accesses before
this instrucon have completed (and all cache, branch predictor and TLB maintenance operaons
before this instrucon have completed).
The argument SY indicates a full system DSB operaon; LD is as DSB operaon that waits only for
loads to complete and ST is a DSB operaon that waits only for stores to complete.
SMP condional barriers
The SMP condional barriers are used to ensure a consistent view of memory between dierent cores
within a cache-coherent SMP system. When compiling a kernel without CONFIG_SMP, SMP barriers
are converted into plain general (i.e., compiler) barriers. Note that this means that SMP barriers cannot
replace a mandatory barrier, but a mandatory barrier can replace an SMP barrier.
The Linux kernel has three basic SMP condional CPU memory barriers:
GENERAL smp_mb() Similar to mb(), but only guarantees ordering between cores/processors within an SMP
system. All memory accesses before the smp_mb() will be visible to all cores within the SMP system before
any accesses aer the smp_mb().
WRITE smp_wmb() Like smp_mb(), but only guarantees ordering between read accesses.
READ smp_rmb() Like smp_mb(), but only guarantees ordering between write accesses. [6]
The SMP barriers are implemented in include/asm-generic/barrier.h as:
Lisng 7.3.12: Linux kernel SMP barriers C
1 #ifdef CONFIG_SMP
2 #ifndef smp_mb
3 #dene smp_mb() __smp_mb()
4 #endif
5 #ifndef smp_rmb
6 #dene smp_rmb() __smp_rmb()
7 #endif
8 #ifndef smp_wmb
9 #dene smp_wmb() __smp_wmb()
10 #endif
11 #endif
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
172
For the Arm AArch64 architecture, the SMP barriers are implemented in arm64/include/asm/barrier.h as:
Lisng 7.3.13: Arm implementaon of kernel SMP barriers C
1 #dene __smp_mb() dmb(ish)
2 #dene __smp_rmb() dmb(ishld)
3 #dene __smp_wmb() dmb(ishst)
with the dmb() macro dened above.
DMB is the Data Memory Barrier instrucon. It ensures that all explicit memory accesses that appear
in program order before the DMB instrucon are observed before any explicit memory accesses
that appear in program order aer the DMB instrucon. It does not aect the ordering of any other
instrucons execung on the processor.
The argument ISH restricts a DMB operaon to the inner shareable domain; ISHLD is a DMB
operaon that waits only for loads to complete, and is restricted to inner shareable domain; ISHST
is a DMB operaon that waits only for stores to complete, and is restricted to the inner shareable
domain. Recall that the “inner shareable domain” is in pracce the memory space of the hardware
(SMP system) controlled by the Linux kernel.
Implicit barriers
Instead of explicit barriers, it is possible to use locking constructs available within the kernel that act
as implicit SMP barriers (similar to pthread synchronizaon operaons in user space, see Secon 7.3.6).
Because in pracce a large number of device drivers do not use the required barriers, the kernel I/O
accessor macros for the Arm architecture (readb(), iowrite32() etc.) act as explicit memory barriers
when the kernel is compiled with CONFIG_ARM_DMA_MEM_BUFFERABLE, for example in arm/
include/asm/io.h:
Lisng 7.3.14: Kernel I/O accessor macros for Arm as explicit memory barriers C
1 #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
2 #include <asm/barrier.h>
3 #dene __iormb() rmb()
4 #dene __iowmb() wmb()
5 #else
6 #dene __iormb() do { } while (0)
7 #dene __iowmb() do { } while (0)
8 #endif
(the Linux kernel code uses do { } while (0) as an architecture-independent no-op).
173
Spin locks
Spin locks are the simplest form of locking. Essenally, the task trying to acquire the lock goes into
a loop doing nothing unl it gets the lock, in pseudocode:
Lisng 7.3.15: Spin lock pseudocode C
1 while (! has_lock ) {
2 // try to get the lock
3 }
Spin locks have the obvious drawback of occupying the CPU while waing. If the wait is long, another
task should get the CPU; in other words, the task trying to obtain the lock should be put to sleep.
However, for the cases where it is not desirable to put a task to sleep, or if the user knows the wait will
be short, the kernel provides spin locks, also known as busy-wait locks (kernel/locking/spin lock.c).
The spin lock funconality for SMP systems is implemented as a macro which creates a lock funcon
for a given operaon (e.g., read or write). Essenally, the implementaon is a forever loop with a
condional break. First, preempon is disabled, then the funcon tries to atomically acquire the lock,
and exits the loop if it succeeded; otherwise, it re-enables preempon and calls the architecture-
specic relax operaon, which eecvely is an ecient way of doing a no-op, and it performs another
iteraon of the loop and tries again.
Lisng 7.3.16: Linux kernel SMP lock-building macro C
1 #dene BUILD_LOCK_OPS(op, locktype) \
2 void __lockfunc __raw_##op##_lock(locktype##_t *lock) \
3 { \
4 for (;;) { \
5 preempt_disable(); \
6 if (likely(do_raw_##op##_trylock(lock))) \
7 break; \
8 preempt_enable(); \
9 \
10 arch_##op##_relax(&lock->raw_lock); \
11 } \
12 } \
For uniprocessor systems (include/linux/spin lock_api_up.h), the spin lock is much simpler:
Lisng 7.3.17: Linux kernel uniprocessor spin lock C
1 #dene ___LOCK(lock) \
2 do { __acquire(lock); (void)(lock); } while (0)
3
4 #dene __LOCK(lock) \
5 do { preempt_disable(); ___LOCK(lock); } while (0)
6
7 // ...
8 #dene _raw_spin_lock(lock) __LOCK(lock)
In other words, the code just disables preempon; there is no actual spin lock. The references to the
lock variable are there only to suppress compiler warnings.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
174
Futexes
As discussed in Secon 7.3.3, a mutex is a binary semaphore. A futex is a “fast user-space mutex,
a Linux-specic implementaon of mutexes opmized for performance for the case when there is no
contenon for resources.
A futex (implemented in kernel/futex.c is idened by a user-space address which can be shared
between processes or threads. A basic futex has semaphore semancs: it is a 4-byte integer counter
that can be incremented and decremented only atomically; processes can wait for the value to become
posive. Processes can share this integer using mmap(2), via shared memory segments, or – if they are
threads – because they share memory space.
As the name suggests, futex operaon occurs enrely in user space for the non-contended case.
The kernel is only involved to handle the contended case. If the lock is already owned and another
process tries to acquire it then the lock is marked with a value that says, "waiter pending," and the
sys_futex(FUTEX_WAIT) syscall is used to wait for the other process to release it. The kernel
creates a ’futex queue’ internally so that it can, later on, match up the waiter with the waker – without
them having to know about each other. When the owner thread releases the futex, it noces (via the
variable value) that there were waiter(s) pending, and does the sys_futex(FUTEX_WAKE) syscall
to wake them up. Once all waiters have taken and released the lock, the futex is again back to the
uncontended state. At that point there is no in-kernel state associated with it, i.e., the kernel has no
memory of the futex at that address. This method makes futexes very lightweight and scalable.
Originally futexes, as described above, were used to implement POSIX pthread mutexes. However,
the current design is slightly more complicated due to the need to handle crashes. The problem is
that when a process crashes, it can’t clean up the mutex, but the kernel can’t do it either because it
has no memory of the futex. The changes required to address this issue are described in the kernel
documentaon in robust-futexes.txt.
Kernel mutexes
The Linux kernel also has its own mutex implementaon (mutex.h), which is intended for kernel-
use only (whereas the futex is designed for use by user-space programs). As usual, the kernel
documentaon (mutex-design.txt) is the canonical reference. Here we summarize the key points of the
implementaon. The mutex consists of the following struct:
Lisng 7.3.18: Linux kernel mutex struct C
1 struct mutex {
2 atomic_long_t owner;
3 spin lock_t wait_lock;
4 struct optimistic_spin_queue osq; /* Spinner MCS lock */
5 struct list_head wait_list;
6 };
The kernel mutex uses a three-state atomic counter to represent the dierent possible transions that
can occur during the lifeme of a lock: 1: unlocked; 0: locked, no waiters; <0: locked, with potenal
waiters
175
In its most basic form, it also includes a wait-queue and a spin lock that serializes access to it.
CONFIG_SMP systems can also include a pointer to the lock task owner as well as a spinner MCS lock
(see the kernel documentaon).
When acquiring a mutex, there are three possible paths that can be taken, depending on the state of
the lock:
1. Fastpath: tries to atomically acquire the lock by decremenng the counter. If it was already taken
by another task, it goes to the next possible path. This logic is architecture-specic but typically
requires only a few instrucons.
2. Midpath: aka opmisc spinning, tries to spin for acquision while the lock owner is running, and
there are no other tasks ready to run that have higher priority (need_resched). The raonale is that
if the lock owner is running, it is likely to release the lock soon.
3. Slowpath: if the lock is sll unable to be acquired, the task is added to the wait queue and
sleeps unl woken up by the unlock path. Under normal circumstances, it blocks as TASK_-
UNINTERRUPTIBLE.
While formally kernel mutexes are sleepable locks, it is the midpath that makes this lock aracve,
because busy-waing for a few cycles has a lower overhead than pung a task on the wait queue.
Semaphores
Semaphores (include/linux/semaphore.h) are also locks with blocking wait (sleep), they are a
generalized version of mutexes. Where a mutex can only have values 0 or 1, a semaphore can hold
an integer count, i.e., a semaphore may be acquired count mes before sleeping. If the count is zero,
there may be tasks waing on the wait_list. The spin lock controls access to the other members
of the semaphore. Unlike the mutex above, the semaphore always sleeps.
Lisng 7.3.19: Linux kernel semaphore struct C
1 struct semaphore {
2 raw_spin lock_t lock;
3 unsigned int count;
4 struct list_head wait_list;
5 };
The supported operaons on the semaphore (see kernel/locking/semaphore.c) are down (aempt to
acquire the semaphore, i.e., the P operaon) and up (release the semaphore, the V operaon). Both of
these have a number of variants, but we focus here on the basic versions.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
176
As long as the count is posive, down() simply decrements the counter:
Lisng 7.3.20: Linux kernel semaphore down() operaon (1) C
1 void down(struct semaphore *sem) {
2 unsigned long ags;
3
4 raw_spin_lock_irqsave(&sem->lock,ags);
5 if (likely(sem->count > 0))
6 sem->count--;
7 else
8 __down(sem);
9 raw_spin_unlock_irqrestore(&sem->lock,ags);
10 }
If no more tasks are allowed to acquire the semaphore, calling down() will put the task to sleep unl
the semaphore is released. This funconality is implemented in down() which simply calls down_
common (sem,TASK_UNINTERRUPTIBLE,MAX_SCHEDULE_TIMEOUT). The variable state refers
to the state of the current running process, as discussed in Chapter 5. The funcon adds the current
process to the semaphore’s wait list and goes into a loop. The trick here is that specifying a meout
value of MAX_SCHEDULE_TIMEOUT on schedule_meout() will call schedule() without a bound on the
meout. So this will simply put the current task to sleep. The return value will be MAX_SCHEDULE_
TIMEOUT.
Lisng 7.3.21: Linux kernel semaphore down() operaon (2) C
1 / *
2 * Because this function is inlined, the 'state' parameter will be
3 * constant, and thus optimized away by the compiler. Likewise the
4 * 'timeout' parameter for the cases without timeouts.
5 */
6 static inline int __sched __down_common(struct semaphore *sem, long state,
7 long timeout)
8 {
9 struct semaphore_waiter waiter;
10
11 list_add_tail(&waiter.list, &sem->wait_list);
12 waiter.task = current;
13 waiter.up = false;
14
15 for (;;) {
16 if (signal_pending_state(state, current))
17 goto interrupted;
18 if (unlikely(timeout <= 0))
19 goto timed_out;
20 __set_current_state(state);
21 raw_spin_unlock_irq(&sem->lock);
22 timeout = schedule_timeout(timeout);
23 raw_spin_lock_irq(&sem->lock);
24 if (waiter.up)
25 return 0;
26 }
27
28 timed_out:
29 list_del(&waiter.list);
177
30 return -ETIME;
31
32 interrupted:
33 list_del(&waiter.list);
34 return -EINTR;
35 }
The up() funcon is much simpler. It checks if there are no waiters, if so increments count, if not wakes
up the waiter at the head of the queue (using_up()).
Lisng 7.3.22: Linux kernel sempahore up() operaon (1) C
1 void up(struct semaphore *sem) {
2 unsigned longags;
3
4 raw_spin_lock_irqsave(&sem->lock,ags);
5 if (likely(list_empty(&sem->wait_list)))
6 sem->count++;
7 else
8 __up(sem);
9 raw_spin_unlock_irqrestore(&sem->lock,ags);
10 }
Lisng 7.3.23: Linux kernel semaphore up() operaon (2) C
1 static noinline void __sched __up(struct semaphore *sem)
2 {
3 structsemaphore_waiter*waiter=list_rst_entry(&sem->wait_list,
4 struct semaphore_waiter, list);
5 list_del(&waiter->list);
6 waiter->up = true;
7 wake_up_process(waiter->task);
8 }
7.3.6 POSIX synchronizaon primives
Unless you are a kernel or device driver programmer, you would not use the Linux kernel
synchronizaon primives directly. Instead, for userspace code, you would use the synchronizaon
primives provided by the POSIX API. These are implemented using the kernel primives discussed
above. The most important POSIX synchronizaon primives are mutexes, semaphores, spin lock,
and condion variables. The majority of the API is dened in <pthread.h>, with most of the types
in <sys/types.h>. The actual implementaon for Linux is the GNU C library glibc source code for,
see glibc.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
178
Mutexes
POSIX mutexes are dened as an opaque type pthread_mutex_t (eecvely a small integer). The API is
small and simple:
Lisng 7.3.24: POSIX mutex API C
1 //To create mutex:
2 pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
3 // or
4 int pthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr)
5
6 // To destroy a mutex:
7 int pthread_mutex_destroy(pthread_mutex_t *mutex);
8
9 //To lock/unlock the mutex:
10 int pthread_mutex_lock(pthread_mutex_t *lock);
11 int pthread_mutex_unlock(pthread_mutex_t *lock);
Semaphores
POSIX semaphores (dened in <semaphore.h>) are counng semaphores as introduced above, i.e.,
the block on an aempt to decrement them when the counter is zero. The Linux man page sem_-
overview(7) provides a good overview. The semaphore is dened using the opaque type semt_t.
The P and V operaons are called sem_wait() and sem_post():
Lisng 7.3.25: POSIX semaphore API C
1 // Separate header le, not in <pthread.h>
2 #include <semaphore.h>
3 // V operation
4 int sem_post(sem_t *sem);
5 // P operation
6 int sem_wait(sem_t *sem);
7 // Variants
8 int sem_trywait(sem_t *sem);
9 int sem_timedwait(sem_t *sem, const struct timespec *abs_timeout);
The sem_wait() variant sem_trywait() returns an error if the decrement cannot be immediately
performed instead of blocking. The variant sem_medwait() allows to set a meout on the waing
me. If the meout expires while the semaphore is sll blocked, an error is returned.
POSIX semaphores come in two forms: named semaphores and unnamed semaphores.
Named semaphores
A named semaphore is idened by a name of the form “/somename,” i.e., a null-terminated string
consisng of an inial slash, followed by one or more characters, none of which are slashes. Two
processes can operate on the same named semaphore by passing the same name to sem_open(). The
API consists of three funcons. The sem_open() funcon creates a new named semaphore or opens an
exisng named semaphore. When a process has nished using the semaphore, it can use sem_close()
to close the semaphore. When all processes have nished using the semaphore, it can be removed
from the system using sem_unlink().
179
Lisng 7.3.26: POSIX named semaphore API C
1 sem_t *sem_open(const char *name, intoag);
2 int sem_close(sem_t *sem);
3 int sem_unlink(const char *name);
Unnamed semaphores (memory-based semaphores)
An unnamed semaphore is placed in a region of memory that is shared between mulple threads
or processes. The API consists of three funcons. An unnamed semaphore must be inialized using
sem_init(). When the semaphore is no longer required, the semaphore should be destroyed using
sem_destroy().
Lisng 7.3.27: POSIX unnamed semaphore API C
1 int sem_init(sem_t *sem, int pshared, unsigned int value);
2 int sem_destroy(sem_t *sem);
Spin locks
POSIX spin locks are dened as an opaque type pthread_spin lock_t (eecvely a small integer). The
API consists of calls to inialize, destroy, lock, and unlock a spin lock. The trylock call tries to obtain
the lock and returns an error when it fails, rather than blocking.
Lisng 7.3.28: POSIX spin lock API C
1 // To create a spin lock
2 int pthread_spin_init(pthread_spin lock_t *, int);
3 // To destroy a spin lock
4 int pthread_spin_destroy(pthread_spin lock_t *);
5 // Get the lock
6 int pthread_spin_lock(pthread_spin lock_t *);
7 int pthread_spin_trylock(pthread_spin lock_t *);
8 // Release the lock
9 int pthread_spin_unlock(pthread_spin lock_t *);
Condion variables
Finally, the POSIX pthread API provides a more advanced locking construct called a condion variable.
Condion variables allow threads to synchronize based upon the actual value of data. Without
condion variables, the program would need to use polling to check if the condion is met, similar to
a spin lock. A condion variable allows the thread to wait unl a condion is sased, without polling.
A condion variable is always used in conjuncon with a mutex lock.
Below is a typical example of the use of condion variables. The code implements the basic operaons
for a thread-safe queue using an ordinary queue (Queue_t with methods enqueue(), dequeue() and
empty() and an aribute status), a mutex lock and a condion variable. The wait_- for_data() funcon
blocks on the queue as long as it is empty. The lock protects the queue q, and the pthread_cond_wait()
call blocks unl pthread_cond_signal() is called, in enqueue_data().
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
180
Note that the call to pthread_cond_wait() automacally and atomically unlocks the associated mutex;
the mutex is automacally and atomically unlocked when receiving a signal.
The dequeue_data() method similarly protects the access to the queue with a mutex and uses the
condion variable to block unl the queue is non-empty. The funcons init(), and clean_up() are used
to create and destroy the mutex and condion variable.
Lisng 7.3.29: POSIX condion variable API C
1 pthread_mutex_t q_lock;
2 pthread_cond_t q_cond;
3
4 void init(pthread_mutex_t* q_lock_ptr,q_cond_ptr) {
5 pthread_mutex_init(q_lock_ptr,NULL);
6 pthread_cond_init(q_cond_ptr,NULL);
7 }
8
9 void wait_for_data(Queue_t* q) {
10 pthread_mutex_lock(&q_lock);
11 while(q->empty()) {
12 pthread_cond_wait(&q_cond, &q_lock);
13 }
14 q->status=1;
15 pthread_mutex_unlock(&q_lock);
16 }
17
18 void enqueue_data(Data_t* data, Queue_t* q) {
19 pthread_mutex_lock(&q_lock);
20 bool was_empty = (q->status==0);
21 q->enqueue(data);
22 q->status=1;
23 pthread_mutex_unlock(&q_lock);
24 if (was_empty)
25 pthread_cond_signal(&q_cond);
26 }
27
28 Data_t* dequeue_data(Queue_t* q) {
29 pthread_mutex_lock(&RXlock);
30 while(q->empty()) {
31 pthread_cond_wait(&RXcond, &RXlock);
32 }
33 Data_t* t_elt=q->front();
34 q->pop_front();
35 if (q->empty()) q->status=0;
36 pthread_mutex_unlock(&RXlock);
37 return t_elt;
38 }
39
40
41 void clean_up(pthread_mutex_t* q_lock_ptr,q_cond_ptr) {
42 pthread_mutex_destroy(q_lock_ptr);
43 pthread_cond_destroy(q_cond_ptr);
44 }
There is an addional API call pthread_cond_broadcast(). The dierence with pthread_cond_- signal() is
that the broadcast call unlocks all threads blocked on the condion variable, whereas the signal only
unlocks one thread.
181
POSIX condion variables are implemented in glibc for linux using futexes. The implementaon is
quite complex. The source code (nptl/pthread_cond_wait.c) contains an in-depth discussion of the
issues and design decisions. However, essenally, the implementaon can be wrien in Python
pseudocode as follows:
Lisng 7.3.30: POSIX condion variable pseudocode Python
1 def Condition(lock):
2 lock = Lock()
3 waitQueue = ThreadQueue()
4
5 def wait():
6 DisableInterrupts()
7 lock.release()
8 waitQueue.sleep()
9 lock.acquire()
10 RestoreInterrupts()
11
12 def signal():
13 DisableInterrupts()
14 waitQueue.wake()
15 RestoreInterrupts()
16
17 def broadcast():
18 DisableInterrupts()
19 waitQueue.wake-all()
20 RestoreInterrupts()
7.4 Parallelism
In this secon, we look at the hardware parallelism oered by modern architectures, the implicaons
for the OS and the programming support. For clarity, we will refer to one of several parallel hardware
execuon units as a “compute unit.” For example, in the Arm system shown in Figure 7.5, there would
be four quad-core A72 clusters paired with four quad-core A53 clusters, so a total of 32 compute units.
7.4.1 What are the challenges with parallelism?
The main challenge in exploing parallelism is in a way similar to scheduling: we want to use all
parallel hardware threads in the most ecient way. From the OS perspecve, this means control
over the threads to run on each compute unit. But whereas scheduling of threads/processes means
mulplexing in me, parallelism eecvely means the placement of tasks in space. The Linux kernel
has for a long me supported symmetric mulprocessing (SMP), which means an architecture where
mulple idencal compute units are connected to a single shared memory, typically via a hierarchy
of fully-shared, parally-shared and/or per-compute-unit caches. The kernel simply manages
a scheduling queue per core.
With the advent of systems like Arm’s big.LITTLE (of which the system in Figure 7.5 is an example,
with “big” A72 cores and “lileA53 cores), this model is no longer adequate, because tasks will spend
a much longer me running if they are scheduled on a “lile” core than on a “big” core. Therefore
eorts have been started towards “global task scheduling” or “Heterogeneous mulprocessing” (HMP),
which require modicaons of the scheduler in the Linux kernel.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
182
Figure 7.5: Extensible Architecture for Heterogeneous Mul-core Soluons (from ARM Tech Forum talk by Brian Je, September 2015).
Apart from these issues, there is also the issue of the control the user has over the placement of
tasks: ideally, the programmer should be able to decide on which compute unit a task should run.
This feature is known as “thread pinning” and supported by a POSIX API, and we will see how it is
implemented. Finally, parallel tasks on a shared memory system eecvely communicate via the
memory, which means that mulple concurrent accesses to the main memory are possible. This
poses challenges for cache coherency and TLB management, which are topics of Chapter 6, “Memory
management.” But even ignoring caches, communicaon eecvely means that the issues discussed
under the previous Secon on concurrency have to be addressed in parallel programs as well. The
main challenge is to ensure that there is no unnecessary sequenalizaon of tasks, while at the same
me guaranteeing that the resulng behavior is correct.
7.4.2 Arm hardware support for parallelism
When a processor comprises mulple processing cores, the hardware must be designed to support
parallel processing on all cores. Apart from supporng cache-coherent shared memory and the
features to support concurrency as discussed above, there are a few other ways in which Arm
mulcore processors support parallel programming. The rst is through SIMD (Single Instrucon
Mulple Data) instrucons, also known as vector processing. This type of parallelism does not require
intervenon from the OS as it is instrucon-based per-core parallelism, i.e., it is handled by the
compiler. The Arm Cortex-A53 MPCore Processor used in the Raspberry Pi 3 supports “Advanced
SIMD” extensions, as discussed in [8].
Next, the handling of interrupts must also be mul-core-aware. The Arm Generic Interrupt Controller
Architecture [9] provides support for soware control the delivery of hardware interrupts to a parcular
processing element (“Targeted distribuon model”) as well as to a one PE out of a given set (“1 of N
model”); and to control the delivery of soware interrupts to mulple PEs (“Targeted list model”).
Then we have support for processor anity through the Mulprocessor Anity Register, MPIDR. This
feature allows the OS to idenfy the PE on which a thread is to be scheduled.
DSP
DSP
ACE
Network Interconnect
NIC-400
Flash
NIC-400
USB
Memory
Controller
DMC-520
x72
DDR4-3200
AHB
Snoop Filter
1-32MB L3 cache
PCIe
10-40
GbE
DPI Crypto
CoreLink™ CCN-512 Cache Coherent Network
DSP
SATA
Memory
Controller
DMC-520
x72
DDR4-3200
Cortex-A72
Memory
Controller
DMC-520
x72
DDR4-3200
Memory
Controller
DMC-520
x72
DDR4-3200
PCIe
DPI
I/O Virtualisation CoreLink MMU-500
SRAM
Network Interconnect
NIC-400
GPIO
PCIe
GIC-500
Cortex CPU
or CHI
master
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
®
Extensible Architecture for Heterogeneous Multi-core Solutions
Up to 4
cores per
cluster
Up to 12
coherent
clusters
Integrated
L3 cache
Up to 24 I/O
coherent
interfaces for
accelerators
and I/O
Peripheral address space
Heterogeneous processors – CPU, GPU, DSP and
accelerators
Virtualized Interrupts
Up to Quad
channel
DDR3/4 x72
183
Finally, there are two hint instrucons [10] to improve mulprocessing, YIELD, and SEV. Soware
with a multhreading capability can use a YIELD instrucon to indicate to the PE that it is performing
a task, for example, a spin-lock, that could be swapped out to improve overall system performance.
The PE can use this hint to suspend and resume mulple soware threads if it supports the capability.
The Send Event (SEV) hint instrucon causes an event to be signaled to all PEs in the mulprocessor
system (as opposed to SEVL, which only signals to the local PE). The receipt of a singled SEV or SEVL
event by a PE sets the Event Register on that PE. The Event Register can be used by the Wait For
Event (WFE) instrucon. If the Event Register is set, the instrucon clears the register and completes
immediately; if it is clear, the PE can suspend execuon and enter a low-power state. It remains in that
state unl and SEV instrucon is executed by any of the PEs in the system.
7.4.3 Linux kernel support for parallelism
As menoned above, the Linux kernel supports parallelism through symmetric mulprocessing (SMP)
(ever since kernel version 2.0). What this means is that every compute unit runs a separate scheduler,
and there are mechanisms to move tasks between scheduling queues on dierent compute units.
SMP boot process
The boot process is, therefore extended from the boot sequence discussed in Chapter 2, as illustrated
in Figure 7.6. Essenally, the kernel boots on a primary CPU and when all common inializaon
is nished, the primary CPU sends interrupt requests to the other cores which result in running
secondary_start_kernel() (dened in arm/kernel/smp.c).
Figure 7.6: Boong owchart for the ARM Linux Kernel on SMP systems (from [11]).
Load balancing
The main mechanism to support parallelism in the Linux kernel is automac load balancing, which aims
to improve the performance of SMP systems by ooading tasks from busy CPUs to less busy or idle
Image Decompression
Primary CPU0 Secondary CPU 1, 2, 3,...
IDLE thread
IRQ
smp_init
Boot all secondary
CPUs by sending a
sync signal
Kernel_init (Process 1):
Initialize the SMP
environment and boot
all the secondary CPUs
start_kernel
(Process 0):
Initialization of all the
kernel components
Initialize the main
memory, cache, and
MMU specific to CPU0
cpu_up(cpu)
Wakeup and wait for
sync signal from CPU0
Execute WFI
(Wait For Interrupt)
Finalize initialization
and execute init
process to display
the console window
secondary_startup:
Initialize the memory,
cache, and MMU
specific to CPUx
secondary_start_kernel:
Initialize resources and
kernel structures specific
to CPUx
Wait for cpu_up(x)
from CPU0, where
x=current CPU
Wait for a signal
from CPU0
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
184
ones. The Linux scheduler regularly checks how the task load is spread throughout the system and
performs load balancing if necessary [12].
To support load balancing, the scheduler supports the concepts of scheduling domains and groups
(dened in include/linux/sched/topology.h). Scheduling domains allow the grouping one or more
processors hierarchically for purposes load balancing. Each domain must contain one or more groups,
such that the domain consists of the union of the CPUs in all groups. Balancing within a domain occurs
between groups. The load of a group is dened as the sum of the load of each of its member CPUs,
and only when the load of a group becomes unbalanced are tasks moved between groups. The groups
are exposed to the user via two dierent mechanisms. The rst is autogroups, an implicit mechanism
in the sense that if it is enabled in the kernel (in /proc/sys/kernel/sched_autogroup_
enabled), all members of an autogroup are placed in the same kernel scheduler group. The second
mechanism is called control groups or cgroups (see cgroups(7)). These are not the same as the
scheduling task groups, but a way of grouping processes and control the resource ulizaon (including
CPU scheduling) at the level of the cgroup rather than at individual process level.
Processor anity control
The funconality used to move tasks between CPUs is exposed to the user using a kernel API dened
in include/linux/sched.h. This API consists of two calls (see sched_setanity(2)), sched_- setanity()
and sched_getanity().
Lisng 7.4.1: Linux processor anity control C
1 #dene _GNU_SOURCE
2 #include <sched.h>
3
4 int sched_setainity(pid_t pid, size_t cpusetsize,
5 const cpu_set_t *mask);
6
7 int sched_getainity(pid_t pid, size_t cpusetsize,
8 cpu_set_t *mask);
These calls control the thread’s CPU anity mask, which determines the set of CPUs on which it
can be run. On mulcore systems, this can be used to control the placement of threads. This allows
user-space applicaons to take control over the load balancing instead of the scheduler. Usually,
a programmer will not use the kernel API but the corresponding POSIX thread API (Secon 7.6.1),
which is implemented using the kernel API.
7.5 Data-parallel and task-parallel programming models
7.5.1 Data parallel programming
Data parallelism means that every compute unit will perform the same computaon but on a dierent
part of the data. This is a very common parallel programming model, supported for example by CPU
cores with SIMD vector instrucons, manycore systems, and GPGPUs.
Full data parallelism: map
Purely in terms of performance, in an ideal data-parallel program, the threads working on dierent
secons of the data would not interact at all. This type of problems is known as “embarrassingly
185
parallel.” In computaonal terms (especially in the context of funconal programming) this paern
is known as a map, a term which has become well-known through the popularity of map-reduce
frameworks. In principle, a map operaon can be executed on all elements of a data set in parallel,
so given unlimited parallelism, the complexity is O(1). In pracce, parallelism is never unlimited, and in
terms of implementaon of the map in programming languages, you cannot assume any parallelism,
for example, Python’s map funcon does not operate in parallel. However, we use the term here to
refer to the computaonal paern that allows full data parallelism.
Reducon
On the opposite side of the performance spectrum, we have purely sequenal computaons, i.e.,
where it is not possible at all to perform even part of the computaon in parallel. In computaonal
terms, this is the case for non-associave reducon operaons. Reducon (the second part in map-
reduce) means a computaon which combines all elements of a data set to produce its nal result.
In funconal programming, reducons are also known as folds. In Python, the corresponding funcon
is reduce. Unless a reducon operaon is associave, it cannot be parallelized and will have linear
me complexity O(N) for a data set of N elements.
Associavity
In formal terms, a funcon of two arguments is associave if and only if
f(f(x,y),z)=f(x,(f(y,z))
For example, addion and mulplicaon are associave:
x+y+z=(x+y)+z=x+(y+z)
but division and modulo are not:
(x/y)/zx/(y/z)
Binary tree-based parallel reducon
In pracce, many of the common operaons on sets are associave: sum, product, min, max,
concatenaon, comparison, ...
If the reducon operaon is associave, the computaon can sll be parallelized, not using a map
paern but through a binary tree-based parallelizaon (tree-based fold). For example, to sum 8
numbers, we can perform 4 pairwise sums in parallel, then sum the 4 results in two parallel operaons,
and then compute the nal sum.
1+2+3+4+5+6+7+8
= 3+7+11+15
= 10+26
= 36
Another example is merge sort, where the list to be sorted is split into as many chunks as there
are threads, then each chunk is sorted in parallel, and the chunks are merged pairwise. Whereas
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
186
sequenal merge sort is O(N log N), if there are at least half as many threads as elements to sort, then
the sorng can be done in O(log N).
In general, for a data set of size N, with unlimited parallelism, an associave operaon can be reduced
in O(log N) steps.
7.5.2 Task parallel programming
Instead of parallelizing the computaon by performing the same operaon on dierent parts of the
data, we can also perform dierent computaons in parallel. For example, we can split a Sobel lter
for edge detecon in a vercal and horizontal part and perform these in parallel on the image data.
In pracce, this approach is parcularly eecve if the input data is a stream, e.g., frames from a video,
as in that case, we can create a pipeline which performs dierent operaons in parallel on dierent
frames. Figure 7.7 shows the complete task graph for a Sobel edge detecon pipeline [13]. In this
example, if a node has a fan-out of more than one, copies of the frame are sent to each downstream
node. In general, of course, a node could send dierent data to each of its downstream nodes.
Figure 7.7: Task graph for a Sobel edge detecon pipeline.
7.6 Praccal parallel programming frameworks
7.6.1 POSIX Threads (pthreads)
We have already covered the POSIX synchronizaon primives in Secon 7.3.6, but we did not
discuss the API for creang and managing threads. The POSIX thread (pthreads) API provides data
types and API calls to manage threads and control their aributes. A good overview can be found
in pthreads(7). The thread is represented by an opaque type pthread_t which represents the thread
ID (i.e., it is a small integer). Each thread has a number of aributes managed via the opaque type
pthread_ar_t, which is accessed via a separate set of API calls.
The most important thread management calls are pthread_create(), pthread_join() and pthread_- exit().
The pthread_create() call takes a pointer to the subroune to be called in the thread and a pointer to its
arguments. Inside the thread, pthread_exit() can be called to terminate the calling thread. The pthread_
join() call waits for the thread indicated by its rst argument to terminate, if that thread is in a joinable
state (see below). If that thread called pthread_exit() with a non-NULL argument, then this argument
will be available as the second argument in pthread_join().
video
frame
in
convert
to YCbCr
mirror
mirror
add
add
add
convert
from
YCbCr
Sobel-Hor
Sobel-Vert
Sobel-Hor
Sobel-Vert
video
frame
out
187
Lisng 7.6.1: POSIX pthread API: create and join C
1 #include <pthread.h>
2
3 int pthread_create(
4 pthread_t *thread, const pthread_attr_t *attr,
5 void *(*start_routine)(void*), void *arg);
6
7 int pthread_join(pthread_t thread, void **value_ptr);
8
9 // inside the thread
10 int pthread_exit(void *retval)
Another convenient call is pthread_self(), it simply returns the thread ID of the caller:
Lisng 7.6.2: POSIX pthread API: self C
1 pthread_t pthread_self(void);
In many cases, it is not necessary to specify the thread aributes, but we can use the aributes for
example to control the processor anity or the detached state of the thread, i.e., if a thread is joinable
or detached. Detached means that you know you will not use pthread_join() to wait for it, so on exit
the thread’s resources will be released immediately.
The aribute is created and destroyed using the following calls:
Lisng 7.6.3: POSIX pthread API: init and destroy C
1 int pthread_attr_init(pthread_attr_t *attr);
2 int pthread_attr_destroy(pthread_attr_t *attr);
For example, to set or get the anity, we can use the following calls:
Lisng 7.6.4: POSIX pthread API: anity C
1 #dene _GNU_SOURCE
2 int pthread_attr_setainity_np(pthread_attr_t *attr,
3 size_t cpusetsize, const cpu_set_t *cpuset);
4
5 int pthread_attr_getainity_np(pthread_attr_t
6 *attr, size_t cpusetsize, cpu_set_t *cpuset);
Similar, if we want to get or set the detach state we can use:
Lisng 7.6.5: POSIX pthread API: aributes C
1 int pthread_attr_setdetachstate(pthread_attr_t *attr, int detachstate);
2 int pthread_attr_getdetachstate(const pthread_attr_t *attr, int *detachstate);
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
188
There are many more API calls both to manage the threads and the aributes, see the man page
pthreads(7) for more details.
Below is an example of typical use of pthreads to create a number of idencal worker threads to
perform work in parallel.
Lisng 7.6.6: POSIX pthread API example C
1 #include <pthread.h>
2
3 struct thread_info { /* Used as argument to thread_start() */
4 pthread_t thread_id; /* ID returned by pthread_create() */
5 // Any other eld you might need
6 // ...
7 };
8
9 // This is the worker which will run in each thread
10 void* thread_start(void *vtinfo) {
11 struct thread_info *tinfo = vtinfo;
12 // do work
13 // ...
14 pthread_exit(NULL); // no return value
15 }
16
17 int main(int argc, char *argv[]) {
18 int st;
19 struct thread_info *tinfo;
20 unsigned int num_threads = NTH; // macro
21
22 /* Allocate memory for pthread_create() arguments */
23 tinfo = calloc(num_threads, sizeof(struct thread_info));
24 if (tinfo == NULL)
25 handle_error("calloc");
26
27 /* Create threads (attr is NULL) */
28 for (unsigned int tnum = 0; tnum < num_threads; tnum++) {
29 // Here you would populate other elds in tinfo
30 st = pthread_create(&tinfo[tnum].thread_id, NULL,
31 &thread_start, &tinfo[tnum]);
32 if (st != 0)
33 handle_error_en(st, "pthread_create");
34 }
35
36 /* Now join with each thread */
37 for (unsigned int tnum = 0; tnum < num_threads; tnum++) {
38 st = pthread_join(tinfo[tnum].thread_id, NULL);
39 if (st != 0)
40 handle_error_en(st, "pthread_join");
41 }
42
43 // do something with the results if required
44 // ...
45
46 free(tinfo);
47 exit(EXIT_SUCCESS);
48 }
In this program, we create num_threads threads by calling pthread_create() in a for-loop (line 28).
Each thread is provided with a struct thread_info which contains the arguments for that thread.
189
Each thread takes a funcon pointer &thread_start to the subroune that will run in the thread.
The thread_info struct could, for example, contain a pointer to a large array, and each thread would
work on a poron of that array. As these threads are joinable (this is the default) we wait on them by
calling pthread_join() in a loop (line 37). Because the threads work on shared memory, the results of the
work done in parallel will be available in the main roune when all threads have been joined.
7.6.2 OpenMP
OpenMP is the de-facto standard for shared-memory parallel programming. It is based on a set
of compiler direcves or pragmas, combined with a programming API to specify parallel regions,
data scope, synchronizaon, etc.. OpenMP is a portable parallel programming approach, and the
specicaon supports C, C++, and Fortran. It has been historically used for data-parallel programming
through its compiler direcves. Since version 3.0, OpenMP also supports task parallelism [14]. It is
now widely used in both task and data parallel scenarios. Since OpenMP is a language enhancement,
every new construct requires compiler support. Therefore, its funconality is not as extensive as
library-based models. Moreover, although OpenMP provides the user with a high level of abstracon,
the onus is sll on the programmer to ensure proper synchronizaon.
A typical example of OpenMP usage is the parallelizaon of a for-loop, as shown in the following code
snippet:
Lisng 7.6.7: OpenMP example C
1 #include <omp.h>
2 // ...
3 #pragma omp parallel \
4 shared(collection,vocabulary) \
2 private(docsz_min,docsz_max,docsz_mean)
6 {
7 // ...
8 #pragma omp for
9 for (unsigned int docid = 1; docid<NDOCS; docid++) {
10 // ...
11 }
12 }
The #pragma omp for direcve will instruct the compiler to parallelize the loop (using POSIX
threads), treang it eecvely as a map. The shared() and private() clauses in the #pragma omp parallel
direcve let the programmer idenfy which variables are to be treated as shared by all threads or
private (per-thread). However, this clause does not regulate access to the variables, so we require
some kind of access control. OpenMP provides a number of direcves to control access to secons
of code, the most important of which correspond to concepts introduced earlier:
#pragma omp crical indicates a crical secon, i.e., it species a region of code that must be
executed by only one thread at a me.
#pragma omp atomic indicates that a specic memory locaon must be updated atomically, rather
than leng mulple threads aempt to write to it. Essenally, this direcve provides a single-
statement crical secon.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
190
#pragma omp barrier indicates a memory barrier; a thread will wait at that point unl all other threads
have reached that barrier. Then, all threads resume parallel execuon of the code following aer
the barrier.
For a full descripon of all direcve-based OpenMP synchronizaon constructs, see the OpenMP
specicaon [15]. In some cases, the direcve-based approach might not be suitable. Therefore
OpenMP also provides an API for synchronizaon, similar to the POSIX API. The following snippet
illustrates the use of locks to protect a crical secon.
Lisng 7.6.8: OpenMP lock example C
1 omp_lock_t writelock;
2 #pragma omp parallel \
3 shared(collection,vocabulary) \
4 private(docsz_min,docsz_max,docsz_mean)
5 {
6 omp_init_lock(&writelock);
7 #pragma omp for
8 for (unsigned int docid = 1; docid<NDOCS; docid++) {
9 // ...
10 omp_set_lock(&writelock);
11 // shared access
12 // ...
13 omp_unset_lock(&writelock);
14 }
15 omp_destroy_lock(&writelock);
16 }
7.6.3 Message passing interface (MPI)
The Message Passing Interface (commonly known under its acronym MPI) [16] is an API specicaon
designed for high-performance compung. Since MPI provides a distributed memory model for
parallel programming, its main targets have been clusters and mulprocessor machines. The message
passing model means that tasks do not share any memory. Instead, every task has its own private
memory, and any communicaon between tasks is via the exchange of messages.
In MPI, the two basic rounes for sending and receiving messages are MPI_Send and MPI_- Recv:
Lisng 7.6.9: MPI send and receive API C
1 int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
2 MPI_Comm comm)
3 int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
4 MPI_Comm comm, MPI_Status *status)
The buer buf contains the data to send or receive, the count its size in mulples of the specied
datatype. Further arguments are the desnaon (for send) or source (for receive). These are usually
called ranks, i.e. “a sender with rank X sends to a receiver with rank Y.” The two remaining elds, tag,
and communicator, require a bit more detail.
The communicator is essenally an object describing a group of processes that can communicate
with one another. For simple problems, the default communicator MPI_COMM_WORLD can be used,
but custom communicators allow, for example, collecve communicaon between subsets of all
191
processes. An important point is that the rank of a process is specic to the communicator being used,
i.e., the same process will typically have dierent ranks in dierent communicators.
The tag is an arbitrary integer that is used for matching of point-to-point messages like send and
receive: if a sender sends a message to a given desnaon with rank dest with a communicator
comm and a tag tag, then the receiver must match all of these specicaons in order to receive the
message, i.e., it must specify comm as its communicator, tag for its tag (or the special wildcard MPI_
ANY_TAG), and the rank of the sender as the source (or the special wildcard MPI_ANY_- SOURCE).
The MPI specicaon has evolved considerably since its inial release in 1994. For example, MPI-
1 already provided point-to-point and collecve message communicaon. Messages could contain
either primive or derived data types in packed or unpacked data content. MPI-2 added dynamic
process creaon, one-sided communicaon, remote memory access, and parallel I/O.
Since there are lots of MPI implementaons with emphasizes on dierent aspects of high-performance
compung, Open MPI [17], an MPI-2 implementaon, evolved to combine these technologies and
resources with the main focus on the components concepts. The specicaon is very extensive, with
almost 400 API calls.
MPI is portable, and in general, an MPI program can run on both shared memory and distributed
memory systems. However, for performance reasons and due to the distributed nature of the model,
there might exist mulple copies of the global data in a shared memory machine, resulng in an
increased memory requirement. Message buers also introduce the overhead of MPI on shared-
memory plaorms [18]. Furthermore, because the API is both low level and very extensive, MPI
programming, especially for performance, tends to be complicated.
7.6.4 OpenCL
OpenCL is an open standard for parallel compung using heterogeneous architectures [19]. Arm provides
an implementaon as part of the Compute Library One of the main objecves of OpenCL is to increase
portability across dierent plaorms and devices, e.g., GPUs, mulcore processors, and other accelerators
such as FPGAs, as well as across operang systems. OpenCL provides an abstract plaorm model and an
abstract device model [20]. The plaorm (Figure 7.8) consists of a host and a number of compute devices.
Figure 7.8: OpenCL plaorm model (from [20]).
Compute Unit
...
...
...
...
...
...
...
...
...
...
...
...
Compute Device
Processing
Element
Host
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
192
Each compute device (Figure 7.9) consists of a number of compute units which each comprise a number
of processing elements. All compute units can access the shared compute device memory (which consists
of a global and a constant memory), oponally via a shared cache; each compute unit has local memory
accessible by all processing elements, and private memory per processing element.
Figure 7.9: OpenCL device model (from [20]).
The programming framework of OpenCL consists of an API for controlling the operaon of
the devices and transfer of data and programs between the host memory and the device memory, and
a language for wring kernels (the programs running on the devices) based on C99, with the following
restricons: no funcon pointers; no recursion; no variable-length arrays; no irreducible control ow.
Furthermore, as it is assumed that memory space of the compute device is not under control of the
host OS and that it does not run its own OS, system calls are not supported either. These restricons
originate from the nature of typical OpenCL devices, in parcular, GPUs.
Figure 7.10: NDRanges, work-groups and work-items (from [20]).
Compute Device
Compute unit 1 Compute unit N
Private
memory 1
Private
memory M
PE 1 PE M
Private
memory 1
Private
memory M
PE 1 PE M
Local
memory 1
Local
memory N
Global/Constant Memory Data Cache
Global Memory
Constant Memory
Compute Device Memory
...
...
...
193
Although OpenCL supports task-parallel programming, its main model is data parallelism. To divide
a data space over the compute units and processing elements, OpenCL provides the concepts of the
n-dimensional range (NDRange), work-groups and work-items, as illustrated in Figure 7.10 for a 2-D
space. The NDRange species how many threads will be used to process the data set. Note that this
can be larger than the actual number of hardware threads, in which case OpenCL will schedule the
threads on the available hardware. The NDRange can be further split into a global range and a local
range. To illustrate this usage, consider a 1-D case for a device with 16 compute units which each have
128 threads, and we want to map exactly one hardware thread per element in the NDRange. In that
case, the global NDRange will be 16*128 and the local NDRange 128. Now assume that the data to
be processed is an array of 64M words, then we have to process 32,768 elements per hardware
thread. We can use the global NDRange index and global size to idenfy which poron of the array
a thread must process, as shown in the following code snippet:
Lisng 7.6.10: OpenCL example C
1 // aSize is the size of array, i.e. 64M
2 __kernel square(__global oat* a, __global oat a_squared, const int aSize) {
3
4 int gl_id = get_global_id(0); // 0 .. 16*128-1
5 int gSize = get_global_size(0); // 16*128
6 // alternatively
7 int n_groups = get_num_groups(0); // 16
8 int l_id = get_local_id(0); // 0 .. 127
9 int gr_id = get_group_id(0); // 0 .. 15
10
11 int wSize = aSize/gSize; // 32,768
12
13 int start = gl_id*wSize;
14 int stop = (gl_id+1)*wSize;
15 for (int idx = start; idx<stop; idx++) {
16 a_squared[idx]=a[idx]*a[idx];
17 }
18 }
Alternavely we could use the local NDRange index, group index and number of workgroups, the
relaonship is as follows:
work_group_size = global_size/number_of_work_groups
global_id = work_group_id*work_group_size+local_id
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
194
The OpenCL host API is quite large and ne-grained; we refer the reader to the specicaon [20].
We have created a library called oclWrapper
1
to simplify OpenCL host code for the most common
scenarios. Using this wrapper, typically a program looks like this:
Lisng 7.6.11: OpenCL wrapper example C
1 // Create wrapper for default device and single kernel
2 OclWrapper ocl(srclename,kernelname,opts);
3
4 // Create read and write buers
5 cl::Buerrbuf=ocl.makeReadBuer(sz);
6 cl::Buerwbuf=ocl.makeWriteBuer(sz);
7
8 // Transfer input data to device
9 ocl.writeBuer(rbuf,sz,warray);
10
11 // Set up index space
12 ocl.enqueueNDRange(globalrange, localrange);
13
14 // Run kernel
15 ocl.runKernel(wbuf,rbuf ).wait();
16
17 // Read output data from device
18 ocl.readBuer(wbuf,sz,rarray);
First, we create an instance of the OclWrapper class, which is our abstracon for the OpenCL host
API. The constructor takes the kernel le name, kernel name and some opons, e.g., to specify which
device to use. Then we create buers, these are objects used by OpenCL to manage the transfer of
data between host and device. Then we transfer the input data for the device (via what in OpenCL is
called the read buer). Then we set up the NDRange index space, and run the kernel. Finally, we read
the output data (through what OpenCL calls the write buer).
7.6.5 Intel threading building blocks (TBB)
Intel threading building blocks (TBB) is an open-source, object-oriented C++ template library
for parallel programming originally developed by Intel [21, 22]. It is not specic to the Intel CPU
architecture and works well on the Arm architecture
2
, because it is implemented using the POSIX
pthread API. Intel TBB contains several templates for parallel algorithms, such as parallel_for and
parallel_reduce. It also contains useful parallel data structures, such as concurrent_vector
and concurrent_queue. Other important features of the Intel TBB are its scalable memory allocator
as well as its primives for synchronizaon and atomic operaons.
TBB abstracts the low-level threading details. However, the tasking comes along with an overhead.
Conversion of the legacy code to TBB requires restructuring certain parts of the program to t the
TBB templates. Moreover, there is a signicant overhead associated with the sequenal execuon
of a TBB program, i.e., with a single thread [23].
1
hps://github.com/wimvanderbauwhede/OpenCLIntegraon
2
There is currently no tbb package in Raspbian for the Raspberry Pi 3. However, it is easy to build tbb from source, using the following command: make tbb CXXFLAGS="-
DTBB_USE_GCC_BUILTINS=1 -D TBB_64BIT_- ATOMICS=0"
195
A task is the central unit of execuon in TBB, which is scheduled by the library’s runme engine. One
of the advantages of TBB over OpenMP is that it does not require specic compiler support. TBB is
based enrely on runme libraries.
7.6.6 MapReduce
MapReduce, originally developed by Google [24], has become a very popular model for processing
large data sets, especially on large clusters (cloud compung). The processing consists of paroning
the dataset to be processed and dening map and reduce funcons. The map funconality is
responsible for parallel processing of a large volume of data and generang intermediate key-value
pairs. The role of the reduce funconality is to merge all the intermediate values with the same
intermediate key.
Because of its simplicity, MapReduce has quickly gained in popularity. The paroning,
communicaon and message passing, and scheduling across dierent nodes are all handled by the
runme system so that the user only has to express the MapReduce semancs. However, its use is
limited to scenarios where the dataset can be operated on in embarrassingly parallel fashion. The
MapReduce specicaon does not assume a shared or distributed memory model. Although most of
the implementaons have been on large clusters, there has been work on opmizing it for mulcores
[25]. Popular implementaons of the MapReduce model are Spark and Hadoop.
7.7 Summary
In this chapter, we have introduced the concepts of concurrency and parallelism, explained the
dierence between them and looked at why both are essenal in modern computer systems. We have
studied at how the Arm hardware architecture and the Linux kernel handle and support concurrency
and parallelism. In parcular, we have discussed the synchronizaon primives in the kernel (atomic
operaons, locks, semaphores, barriers, etc.) and how they rely on hardware features; we have also
looked at the kernel support for parallelism, in parcular in terms of the scheduler and the control over
the placement of threads.
We have introduced the data-parallel and task-parallel programming models and briey discussed
a number of popular praccal parallel programming frameworks.
7.8 Exercises and quesons
1. Implement a soluon to the dining philosophers problem in C using the POSIX threads API.
2. Create a system of N threads that communicate via stac arrays of size N dened in each thread,
using condion variables and mutexes.
3. Write a data-parallel program that produces the sum of the squares of all values in an array, using
pthreads and using OpenMP.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
196
7.8.1 Concurrency: synchronizaon of tasks
1. What is a crical secon? When is it important for a task to enter a crical secon?
2. Could a task be pre-empted while execung its crical secon?
3. What is the dierence between a semaphore and a mutex?
4. What is a spin lock, and what are its properes, advantages and disadvantages?
5. Sketch the operaons required for two tasks using semaphores to perform mutual exclusion of
a crical secon, including semaphore inializaon.
6. Sketch the operaons required to synchronize two tasks, including semaphore inializaon.
7. Specify the possible order of the code executed by two tasks synchronized using semaphores,
running on a uniprocessor system.
8. What Pthreads concept is provided to enable meeng such synchronizaon requirements? Sketch
how a typical task uses this concept in pseudocode.
9. Sketch the pseudocode for the typical use of POSIX condion variables and mutexes to implement
a thread-safe queue.
10. Explain the concept of shareability domains in the Arm system architecture
7.8.2 Parallelism
1. Discuss the hardware support for parallelism in Arm mulcore processors.
2. What is processor anity, and how can controlling it benet your parallel program?
3. Given unlimited parallelism, what is the big-O complexity for a merge sort? And what is it given
limited parallelism?
4. Explain the OpenCL model of data-parallelism.
5. When would you call pthread_exit() instead of exit()?
197
References
[1] E. W. Dijkstra, “Over seinpalen,” 1962, circulated privately. [Online].
Available: hp://www.cs.utexas.edu/users/EWD/ewd00xx/EWD74.PDF
[2] ——, “A tutorial on the split binary semaphore,” Mar. 1979, circulated privately. [Online].
Available: hp://www.cs.utexas.edu/users/EWD/ewd07xx/EWD703.PDF
[3] Arm Synchronizaon Primives Development Arcle, Arm Ltd, 8 2009, issue A. [Online].
Available: hp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0008a/index.html
[4] L. Lindholm, “Memory access ordering part 2 - memory access ordering in the Arm Architecture,” 2013. [Online].
Available: hps://community.arm.com/processors/b/blog/posts/memory-access-ordering-part-3---memory-access-ordering-in-
the-arm-architecture
[5] Arm Cortex-A Series - Programmer’s Guide for ARMv8-A - Version: 1.0, Arm Ltd, 3 2015, issue A. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf
[6] L. Lindholm, “Memory access ordering part 2 - barriers and the Linux kernel,, 2013. [Online].
Available: hps://community.arm.com/processors/b/blog/posts/memory-access-ordering-part-2---barriers-and-the-linux-kernel
[7] Arm Compiler Version 6.01 armasm Reference Guide, Arm Ltd, 12 2014, issue B. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.dui0802b/ARMCT_armasm_reference_guide_v6_01_DUI0802B_en.pdf
[8] ARM
®
Cortex
®
-A53 MPCore Processor Advanced SIMD and Floang-point Extension Technical Reference Manual Revision: r0p4,
Arm Ltd, 1 2016, revision: G. [Online].
Available: hp://infocenter.arm.com/help/topic/com.arm.doc.ddi0502g/DDI0502G_cortex_a53_fpu_trm.pdf
[9] Arm Generic Interrupt Controller Architecture Specicaon GIC architecture version 3.0 and version 4.0, Arm Ltd, 8 2017, issue D.
[Online]. Available: hps://silver.arm.com/download/download.tm?pv=1438864
[10] ARM
®
Architecture Reference Manual – ARMv8, for ARMv8-A architecture prole, Arm Ltd, 12 2017, issue: C.a. [Online].
Available: hps://silver.arm.com/download/download.tm?pv=4239650&p=1343131
[11] Migrang a soware applicaon from ARMv5 to ARMv7-A/R Version: 1.0 Applicaon Note 425, Arm Ltd, 7 2014, issue A.
[Online]. Available: hp://infocenter.arm.com/help/topic/com.arm.doc.dai0425/DAI0425_migrang_an_applicaon_from_
ARMv5_to_ARMv7_AR.pdf
[12] G. Lim, C. Min, and Y. Eom, “Load-balancing for improving user responsiveness on mulcore embedded systems,” in
Proceedings of the Linux Symposium, 2012, pp. 25–33.
[13] W. Vanderbauwhede and S. W. Nabi, “A high-level language for programming a NoC-based dynamic reconguraon
infrastructure,” in 2010 Conference on Design and Architectures for Signal and Image Processing (DASIP), Oct 2010, pp. 7–14.
Chapter 7 | Concurrency and parallelism
Operang Systems Foundaons with Linux on the Raspberry Pi
198
[14] E. Ayguadé, N. Copty, A. Duran, J. Hoeinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang, “The design of
OpenMP tasks,Parallel and Distributed Systems, IEEE Transacons on, vol. 20, no. 3, pp. 404–418, 2009.
[15] OpenMP Applicaon Programming Interface Version 4.5, OpenMP Architecture Review Board, 11 2015. [Online].
Available: hp://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
[16] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: portable parallel programming with the message-passing interface. MIT Press,
1999, vol. 1.
[17] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barre,
A. Lumsdaine et al., “Open MPI: Goals, concept, and design of a next-generaon MPI implementaon,” in Recent Advances in
Parallel Virtual Machine and Message Passing Interface. Springer, 2004, pp. 97–104.
[18] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman, “High-performance compung using MPI and
OpenMP on mul-core parallel systems,Parallel Compung, vol. 37, no. 9, pp. 562–575, 2011.
[19] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous compung systems,
Compung in science & engineering, vol. 12, no. 3, p. 66, 2010.
[20] The OpenCL Specicaon Version: 2.2, Khronos OpenCL Working Group, 5 2017. [Online].
Available: hps://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
[21] J. Reinders, Intel threading building blocks: outng C++ for mul-core processor parallelism. O’Reilly Media, Inc., 2007.
[22] Intel, “Threading building blocks,” 2015, hps://www.threadingbuildingblocks.org/
[23] L. T. Chen and D. Bairagi, “Developing parallel programs–a discussion of popular models,Technical report, Oracle
Corporaon, Tech. Rep., 2010.
[24] J. Dean and S. Ghemawat, “Mapreduce: simplied data processing on large clusters,Communicaons of the ACM, vol. 51,
no. 1, pp. 107–113, 2008.
[25] Y. Mao, R. Morris, and M. F. Kaashoek, “Opmizing mapreduce for mulcore architectures,” in Computer Science and Arcial
Intelligence Laboratory, Massachuses Instute of Technology, Tech. Rep. Citeseer.
199
Chapter 7 | Concurrency and parallelism
Chapter 8
Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
202
8.1 Overview
While the conceptual von Neumann architecture only presents processor and memory as computer
components, in fact, there is a wide variety of devices that users hook up to computers. These devices
facilitate input and output (IO) to enable the computer system to interact with the real world. This
chapter explores the OS structures and mechanisms that are used to communicate with such devices
and to control them.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Sketch the hardware organizaon and datapaths supporng device interacon.
2. Comprehend the raonale for the disncve Linux approach to supporng devices.
3. Implement simple device driver and interrupt handler rounes.
4. Jusfy the need for direct-memory access for certain classes of devices.
5. Idenfy buering strategies in various parts of the system.
6. Appreciate the requirement to minimize expensive block memory copy operaons between
data regions.
8.2 The device zoo
A vast variety of devices may be connected to your Raspberry Pi, using a range of connecon ports
and protocols.
Modern devices vary wildly in size, price, bandwidth, and purpose. Input devices receive informaon
from the outside world, digize it, and enable it to be processed as data on the computer. In terms
of input devices, a push-buon (perhaps aached to the Raspberry Pi GPIO pins) is a simple input
device, with a binary {0, 1} value. A high-resoluon USB webcam is a more complex input device,
with a large pixel array of inputs to be sampled. Output devices take data from the machine and
represent this, or respond to it in some way. A simple output device is an LED, which is either on
or o. The green on-board acvity LED may be turned on or o with a simple shell command, as
shown below.
Lisng 8.2.1: Controlling the on-board LED Bash
1 ## these commands must be executed with root privileges
2 # turn on the green LED
3 echo 1 >/sys/class/leds/led0/brightness
4 # turn o the green LED
5 echo 0 >/sys/class/leds/led0/brightness
203
A more complex output device might be a printer, connected via USB, which is capable of producing
pages of text and graphics at high-speed. Figure 8.1 shows an indoor environmental sensor node at
the University of Glasgow, with a range of input and output devices aached to a Raspberry Pi.
Figure 8.1: Raspberry Pi sensor node deployed at the University of Glasgow. Photo by Krisan Hentschel.
8.2.1 Inspect your devices
It is possible to inspect some of the devices that are aached to your Raspberry Pi. The lsusb
command will display informaon about devices that are connected to your Pi over USB. Observe that
each device has a unique ID. Also noce that the Ethernet adapter is connected via USB, which is the
reason for slow network performance on Raspberry Pi.
The lsblk command will display informaon about block devices, which are generally, storage
devices, connected to your Pi. Figure 8.2 shows the reported block devices on a Raspberry Pi 3 with
an 8GB SD card. File system mount points for each paron are given. Note that sda1 and mmcblk0
alias to the same physical device. The next chapter covers le systems, presenng a more in-depth
study of block storage facilies in Linux.
Figure 8.2: Typical output from the lsblk command.
8.2.2 Device classes
Look at the /proc/devices le to see devices that are registered on your system. This le shows
that Linux disnguishes between two fundamental classes of devices: character and block devices.
A character device transfers data at byte granularity in arbitrary quanes. Data is accessed as a
stream of bytes, like a le, although it may not be possible to seek to a new posion in the stream.
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
204
Example character devices include /dev/tty, which is the current interacve terminal and /dev/
watchdog, which is a countdown mer.
A block device transfers data in xed-size chunks called blocks. These large data transfers may
be buered by the OS. A block device supports a le system that can be mounted, as described in
Chapter 9. Example block devices include storage media like a RAM disk or an SD card (which may
be known as /dev/mmcblk0 on your system).
Other classes of device include network devices, which operate on packets of data, generally
exchanged with remote nodes. See Chapter 10 for more details.
8.2.3 Trivial device driver
To present the typical Linux approach to devices, this secon implements a trivial character device
driver. A driver is a kernel module that provides a set of funcons enabling the device to be mapped
to a le abstracon. Once the module is loaded, we can add a device le for it and interact with the
device via the le.
The C code below implements the trivial device driver as a kernel module. This is a character-level
device that returns a string of characters when it is read. In homage to the inimitable Douglas Adams,
our device is called ‘The Meaning of Life,and it supplies an innite stream of * characters, which have
decimal value 42 in ASCII or UTF8 encoding.
The key Linux API call is register_chrdev, which allows us to provide a struct of le operaons to
implement interacon with the device. The only operaon we dene is read, which returns the *
characters. The registraon funcon returns an int, which is the numeric idener the Linux kernel
assigns to this device.
We use this idener to ‘aach’ the driver to a device le, via the mknod command. See the bash code
below for full details of how to compile and load the kernel module, aach the driver to a device le,
then read some data.
The stream of characters appears fairly slowly when we cat the device le. This is because our code is
highly inecient; we use the copy_to_user call to transfer a single character at a me from kernel
space to user space.
Lisng 8.2.2: Example device driver C
1 #include <linux/cdev.h>
2 #include <linux/errno.h>
3 #include <linux/fs.h>
4 #include <linux/init.h>
5 #include <linux/kernel.h>
6 #include <linux/module.h>
7 #include <linux/uaccess.h>
8
9 MODULE_LICENSE("GPL");
10 MODULE_DESCRIPTION("Example char device driver");
11 MODULE_VERSION("0.42");
205
12
13 static const char *fortytwo = "*";
14
15 static ssize_t device_le_read(structle*le_ptr,
16 char__user*user_buer,
17 size_t count,
18 lo_t*position){
19 int i = count;
20 while (i--)
21 if(copy_to_user(user_buer,fortytwo,1)!=0)
22 return -EFAULT;
23 return count;
24 }
25
26 static structle_operationsdriver_fops={
27 .owner = THIS_MODULE,
28 .read =device_le_read,
29 };
30
31 static intdevice_le_major_number=0;
32 static const char device_name[] = "The-Meaning-Of-Life";
33
34 static int register_device(void) {
35 int result = 0;
36 result = register_chrdev(0, device_name, &driver_fops);
37 if( result < 0 ) {
38 printk(KERN_WARNING "The-Meaning-Of-Life: "
39 "unable to register character device, error code %i", result);
40 return result;
41 }
42 device_le_major_number=result;
43 return 0;
44 }
45
46 static void unregister_device(void) {
47 if(device_le_major_number!=0)
48 unregister_chrdev(device_le_major_number,device_name);
49 }
50
51 static int simple_driver_init(void) {
52 int result = register_device();
53 return result;
54 }
55
56 static void simple_driver_exit(void) {
57 unregister_device();
58 }
59
60 module_init(simple_driver_init);
61 module_exit(simple_driver_exit);
Lisng 8.2.3: Using the new device Bash
1 sudo make -C /lib/modules/`uname -r`/build M=`pwd` modules
2 sudoinsmodmeaningoife.ko
3 DEVNUM=`cat /proc/devices | grep Meaning | cut -d' ' -f 1`
4 sudo mknod /dev/meaning c $DEVNUM 0
5 cat /dev/meaning
6 ^C
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
206
8.3 Connecng devices
8.3.1 Bus architecture
Since the Raspberry Pi is built around a commercial system-on-chip soluon, which is also used
for mobile phone devices, it has a rich set of direct IO connecons. Figure 8.3 presents this IO
connecvity at an abstract level.
Some connecons are point-to-point, such as the UART (universal asynchronous transmier/receiver)
for direct device to device communicaon. Others allow mulple devices to share a bus, i.e., signals
travel along shared wires and are directed to the appropriate device. The I
2
C bus supports over 1000
devices; these share data, clock and power wires, with each device having a unique address to direct
message packets.
Some IO interfaces are principally for output, such as HDMI for video output to screen. Other
interfaces are for input, such as the CSI (Camera Serial Interface) for digital cameras. Many interfaces,
like the Ethernet network connecon, are bidireconal in that they support both input and output.
Generally, IO is encoded as digital signals. A small number of interfaces use analog signals, such as
the audio-out port. The GPIO signals are all digital; unlike Arduino devices, the Raspberry Pi does not
include a built-in analog-to-digital converter.
Figure 8.3: IO architectural diagram for Raspberry Pi.
In terms of bandwidth, low bandwidth connecons (like those on the right-hand side of SoC in
Figure 8.3) operate around 10kbps. High bandwidth connecons (like those at the boom of SoC
in Figure 8.3) operate around 100Mbps. One peculiarity of the Raspberry Pi architecture is that the
ethernet piggybacks onto the USB interface, which somemes restricts network bandwidth.
More convenonal, larger computers may have higher performance buses such as PCI Express. These
are useful for powerful devices such as graphics cards that need to process and transfer bulk data
extremely rapidly.
i
i
“chapter” 2019/8/13 21:04 page 3 #3
i
i
i
i
i
i
Raspberry Pi
System-on-Chip
USB
Ethernet
Video out
Audio out
UART (serial comms)
SPI
I
2
C
GPIO
Camera serial interface
207
8.4 Communicang with Devices
8.4.1 Device Abstracons
From user space, devices generally appear like les, and processes interact with devices using standard
le API calls like open and read. Some devices support special commands, accessed using the generic
ioctl system call on Linux. We use ioctl for device-specic commands that cannot be mapped
easily onto the le API.
A simple example involves the console. It is possible to set the status LEDs for an aached keyboard
using ioctl calls. The Python script below ashes the scroll lock on then o for two seconds. Try this on
your Raspberry Pi with a USB keyboard aached.
Lisng 8.4.1: Flash Keyboard LEDs with ioctl Python
1 import fcntl
2 import os
3 import time
4
5 KDSETLED = 0x4b32
6 SCROLL_LED = 0x01
7 NUMLK_LED = 0x02
8 CAPSLK_LED = 0x04
9 RESET_ALL = 0x08
10
11 console_fd = os.open('/dev/console', os.O_NOCTTY)
12 fcntl.ioctl(console_fd, KDSETLED, SCROLL_LED)
13 time.sleep(2)
14 fcntl.ioctl(console_fd, KDSETLED, 0)
15 time.sleep(2)
16 fcntl.ioctl(console_fd, KDSETLED, RESET_ALL)
From kernel space in Linux on Arm, devices are memory-mapped. The kernel device handling code
writes to memory addresses to issue commands to devices and uses memory accesses to transfer data
between device and machine memory.
8.4.2 Blocking versus non-blocking IO
From user space, when you issue an IO command, it may return immediately (non-blocking), or it may
wait (blocking) unl the operaon completes when all the data is transferred. The key problem with
blocking is that IO can be slow, so waing for IO to complete may take a long me. The thread that
iniated the blocking IO is unable to do any other useful work while it is waing.
On the other hand, a non-blocking IO call returns immediately, performing as much data transfer as is
currently possible with the specied device. If no data transfer can be performed, an error status code
is returned.
In terms of Unix le descriptor ags, the O_NONBLOCK ag indicates that an open le should support
non-blocking IO calls. We illustrate this in the source code below, by reading bytes from the /dev/
random device. This device generates cryptographically secure random noise, seeded by interacons
with the outside world such as human interface events and network package arrival mes.
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
208
If there is insucient entropy in the system, then reads to /dev/random can block waing for more
random interacons to occur. Execute the Python script shown below for several mes; see how long
it takes to complete. You might be able to speed up execuon by moving and clicking your USB mouse
if it is connected to your Raspberry Pi.
Lisng 8.4.2: Reading data from /dev/random Python
1 import os
2
3 r = os.open('/dev/random', os.O_RDONLY)
4 x = os.read(r, 100)
5 print('read %d bytes' % len(x))
6 if len(x) > 0:
7 print(ord(x[len(x)-1]))
The script drains randomness from the system; we top up the randomness with user events like mouse
movement. When there is lile randomness, the call to read blocks, waing for data from/dev/random.
Now modify the Python script to make read operaons to be non-blocking. Do this by changing the
ags in the open call to be os.O_RDONLY | os.O_NONBLOCK. When we execute the script again,
it always returns immediately. If there is no random data available, then it reports an OSError.
8.4.3 Managing IO interacons
There are three general approaches to interacng with IO devices, in terms of structuring a
conversaon’ or communicaon session:
1. Polling;
2. Interrupts;
3. Direct memory access (DMA).
The parcular approach is generally implemented at device driver level; it is not directly visible to the
end-user. Rather, the approach is a design decision made by the manufacturer of the hardware device
in collaboraon with the developer of the soware driver.
Subsequent paragraphs explain the three mechanisms and their relave merits. The idea is that a
device has the informaon we want to fetch into memory, and we need to manage this data transfer.
(Alternavely, the device may require the informaon we have in memory, and we need to handle this
transfer.)
The cartoon illustraon in Figure 8.4 presents an analogy to compare the dierent approaches. The
customer (on the le-hand side) is like a CPU requesng data; the delivery depot (on the right-hand
side) is like a device; the package delivery is like the data transfer from device to CPU. In each of the
three cases, this transfer is coordinated dierently.
209
Figure 8.4: Parcel delivery analogy for IO transfer mechanisms. Image owned by the author.
Polling
Device polling is acve querying of the hardware status by the client process, i.e., the device driver.
This is used for low-level interacons with simple hardware. Generally, there is a device status ag or
word and the process connually fetches this status data in a busy/wait loop. The pseudo-code below
demonstrates the polling mechanism.
Lisng 8.4.3: Typical device polling code C
1 while (num_bytes) {
2 while (device_not_ready())
3 busy_wait();
4 if (device_ready()) {
5 transfer_byte_of_data();
6 num_bytes--;
7 }
8 }
Soware support for polling is straighorward, as outlined above. It is also easy to implement the
appropriate hardware. However, polling may be inecient in terms of wasted CPU cycles during the
busy/wait loops, parcularly when there is a signicant disparity in speed between CPU and device.
Interrupts
Imagine your phone is ringing right now. You stop reading this book to answer the call. You have been
interrupted! That’s precisely how IO interrupts work. Normal process execuon is temporarily paused,
and the system deals with the IO event before resuming the task that was interrupted.
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
210
Interrupt handlers are like system event handlers. A handler roune may be registered for a parcular
interrupt. When the interrupt occurs (physically, when a pin on the processor goes high) the system
changes mode and vectors to the interrupt handler.
Secon 8.5 explains the details regarding how to dene and install an interrupt handler in Linux. This
is probably the most common way to deal with IO device interacon.
Direct memory access
The movaon underlying direct memory access (DMA) is to minimize processor involvement in IO
data transfer. For polling and interrupts (collecvely known as programmed IO) the processor explicitly
receives each word of data from the device and writes it to a local memory buer, or vice versa for
data transfer to the device.
With DMA, the processor merely iniates the transfer of a large block of memory, then receives
a nocaon (via an interrupt) when the enre transfer is completed. This reduces context switching
overhead from being linear in the data transfer size to a small, constant cost.
The key complexity of DMA is that the hardware device must be much more intelligent since it needs
to interface directly with the memory controller to copy data into the relevant buer. DMA is most
useful for high-bandwidth devices such as GPUs and hard disk controllers, not for smaller-scale
embedded systems.
The Raspberry Pi has 16 DMA channels, which may be used for high-bandwidth access to IO
peripherals. Various open-source libraries exploit this facility.
8.5 Interrupt handlers
There are three kinds of events that are managed by the OS using the handler paern. These are:
1. Hardware interrupts, which are triggered by external devices.
2. Processor excepons, which occur when undened operaons (like divide-by-zero) are executed.
3. Soware interrupts, which take place when user code issues a Linux system call, encoded as an Arm
SWI instrucon.
This secon focuses on hardware interrupts, but the mechanisms are similar for all three kinds of events.
Interrupt-driven IO can be more ecient than polling, given relave speed disparity between
processor and IO device. A context switch occurs (from user mode to kernel mode) only when an
interrupt is generated, indicang there is IO acvity to be serviced by the processor. There is minimal
busy-waing with interrupts. Figure 8.5 presents a sequence diagram to show the interacons
between CPU and device for interrupt-driven programmed IO.
211
Figure 8.5: Sequence Diagram to show communicaon between CPU and Device during Interrupt-Driven IO.
8.5.1 Specic interrupt handling details
Look at the /proc/interrupts le on your Raspberry Pi. This lists the stascs for how many
interrupts have been seen by the system. Figure 8.6 shows an example from a Raspberry Pi 2 Model
B that has been running for several hours. Each interrupt has an integer idener (le-most column),
a count of how many mes it has been handled by CPU0 (second le column) and other CPUs (in
subsequent columns), and a name for the event that device that triggered the interrupt (right-most
column). The timer and dwc_otg devices are likely to have the highest interrupt counts.
Figure 8.6: Sample /proc/interrupts le.
Interrupt handlers, also known as interrupt service rounes, are generally registered during system
boot me, or when a module is dynamically loaded into the kernel. An interrupt handler is registered
with the request_irq() funcon, from include/linux/interrupt.h. Required parameters
include the interrupt number, the handler funcon, and the associated device name. An interrupt
handler is unregistered with the free_irq() funcon.
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
212
In a mul-processor system, interrupt handlers should be registered for all processors, and
interrupts should be distributed evenly. Check /proc/interrupts to verify this if you have
a mulcore Raspberry Pi board.
It is conceivable that, while the system is servicing one interrupt, another interrupt may arrive
concurrently. Some interrupt handlers may be interrupted, i.e., they are re-entrant. Others may not be
interrupted. It is possible to disable interrupts while an interrupt handler is execung, using a funcon
like local_irq_disable() to prevent cascading interrupon.
8.5.2 Install an interrupt handler
The C code below implements a trivial interrupt handler for USB interrupt events. This is a shared
interrupt line, so mulple handlers may be registered for the same interrupt id. Check the /proc/
interrupts le to idenfy the appropriate integer interrupt number on your Pi, and modify the
source code INTERRUPT_ID denion accordingly.
Lisng 8.5.1: Trivial interrupt handler C
1 /* ih.c */
2
3 #include <linux/interrupt.h>
4 #include <linux/module.h>
5
6 MODULE_LICENSE("GPL");
7 MODULE_DESCRIPTION("Example interrupt handler");
8 MODULE_VERSION("0.01");
9
10 #dene INTERRUPT_ID 62 /* this is dwc_otg interrupt id on my pi */
11
12 static int count = 0; /* interrupt count */
13 static char* dev = "unique name";
14
15 static irqreturn_t custom_interrupt(int irq, void *dev_id) {
16 if (count++%100==0)
17 printk("My custom interrupt handler called");
18 return IRQ_HANDLED;
19 }
20
21 static int simple_driver_init(void) {
22 int result = 0;
23 result = request_irq(INTERRUPT_ID, custom_interrupt, IRQF_SHARED,
24 "custom-handler", (void *)&dev);
25 if (result < 0) {
26 printk(KERN_ERR "Custom handler: cannot register IRQ %d\n", INTERRUPT_ID);
27 return -EIO;
28 }
29 return result;
30 }
31
32 static void simple_driver_exit(void) {
33 free_irq(INTERRUPT_ID, (void *)&dev);
34 }
35
36 module_init(simple_driver_init);
37 module_exit(simple_driver_exit);
213
Compile this module as ih.ko, then install it with sudo insmod ih.ko. Then check dmesg to see
whether the module installed successfully and whether custom interrupt handler messages are being
reported in the kernel log. You can also look at /proc/interrupts to see whether your handler is
registered against the appropriate interrupt. Finally, execute sudo rmmod ih to uninstall the module.
A useful ‘real’ interrupt handler example is in linux/drivers/char/sysrq.c, which handles the
magic SysRq key combinaons to recover from Linux system freezes. This code is well worth a careful
inspecon.
8.6 Ecient IO
One of the issues that makes IO slow is the constant need for context switches. When IO occurs, kernel-
level acvity must take place. User-invoked system calls will vector into the kernel, so too do interrupts
generated by the hardware. Switching into the kernel takes me, switching processor mode and saving
user context. DMA minimizes kernel intervenons in IO, which is why it is so much more ecient.
Another ineciency in IO is excessive memory copying. Recall from our simple device driver example
that we used the copy_to_user funcon call, to transfer data from kernel memory to user memory.
The problem is that user code cannot access data stored in kernel memory.
The technique of buering improves performance. The objecve is to batch small units of data into
a larger unit and process this in bulk. Buering quanzes data processing. Eecvely, a buer is a
temporary storage locaon for data being transferred from one place to another.
The technique of spooling is useful for contended resources. A spool is like a queue; jobs wait in the
queue unl they are ready. The canonical example is the printer spooler, but the technique also applies
to other slow peripheral devices. There may be mulple producers and a single consumer, with the
producers wring each job to the spooler much faster than the consumer can perform that job. These
techniques are used to accelerate IO by avoiding the need for processes to wait for slow IO devices.
8.7 Further reading
For a user-friendly introducon to interfacing devices with your Raspberry Pi, check out Molloy’s highly
praccal textbook [1] with its companion website. There are lots of ideas for simple projects involving
small-scale hardware components, building up to a Linux kernel module implementaon task.
The Linux Device Drivers textbook from O’Reilly presents a comprehensive view of IO and the Linux
approach to device drivers [2]. The book is available online for free. Although it is fairly old, dealing
with Linux kernel version 2.6, the concept coverage is wide-ranging and sll highly relevant.
8.8 Exercises and quesons
8.8.1 How many interrupts?
Produce a simple script that parses the /proc/interrupts le and monitor the number of interrupts per
second. Why might it be sensible to check the le at minute intervals and divide by 60 to get the per-
second interrupt rate?
Chapter 8 | Input/output
Operang Systems Foundaons with Linux on the Raspberry Pi
214
8.8.2 Comparave complexity
Draw a table with the following rows and columns:
Fill in this table for the following devices, esmang the relave costs and complexies for each device:
1. USB mouse;
2. Depth-sensing USB camera;
3. SATA disk controller;
4. Scrolling LED text display screen.
8.8.3 Roll your own Interrupt Handler
Develop a more interesng interrupt handler, based on the trivial example in Secon 8.5.2. See
whether you can write a handler for a dierent interrupt event. Search online for helpful tutorials.
8.8.4 Morse Code LED Device
Imagine an LED that has a character device driver in Linux, so that when you write characters to the
device, the LED ashes the corresponding leers in Morse code.
You could choose to use your scroll lock key or Pi on-board status LED, as outlined in this chapter.
Alternavely, you might aach an external LED component to the GPIO pins.
You will need to implement a device driver with a denion for the write funcon, but you could use
the trivial character device driver from Secon 8.2.3 as a template. You want the rate of Morse code
ashing to be readable, but it would be nice to allow the write operaons to return while the Morse
code message is being (slowly) broadcast. What would you do if another write request occurs while
the rst message is sll in progress?
low / med / high
Device driver implementaon complexity.
Device hardware complexity.
Typical device cost.
Typical device speed.
215
References
[1] D. Molloy, Exploring Raspberry Pi: Interfacing to the Real World with Embedded Linux. Wiley, 2016.
[2] J. Corbet, A. Rubini, and G. Kroah-Hartman, Linux Device Drivers, 3rd ed. O’Reilly, 2005,
hps://www.oreilly.com/openbook/linuxdrive3/book/
Chapter 8 | Input/output
Chapter 9
Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
218
9.1 Overview
Where does data go when your machine is powered down? Volale data, stored in RAM, will be lost;
however, data saved on persistent storage media is retained for future execuon. A le system is a key OS
component that supports the consolidaon of persistent data into discrete, manageable units called les.
The Linux design philosophy is oen summarized as, ‘Everything is a le.All kinds of OS enes,
including processes, devices, pipes, and sockets, may be treated as les. For this reason, it is important
to have a good understanding of the Linux le system since it underpins the enre OS.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Illustrate the directed acyclic graph nature of the Linux le system.
2. Appreciate how the user-visible le system maps onto OS-level le system concepts and primives.
3. Explain how le system directories work, to index and locate le contents.
4. Analyze the trade-os involved in dierent le system design decisions, with reference to parcular
implementaons such as FAT and ext4.
5. Understand the need for le system consistency and integrity, idenfying approaches to preserve
or repair this integrity.
6. Idenfy appropriate techniques for le system operaons on a range of modern persistent storage
media.
9.2 User perspecve on the le system
9.2.1 What is a le?
A le is a collecon of data that is logically related; it somehow ‘belongs’ together. A le is a ne-
grained container for data; conceptually, it is the smallest discrete unit of data in a le system. Regular
les may contain textual data (read with ulies like cat or less) or binary data (read with ulies
like hexdump or strings). The le command will report details about a single le. It uses the built-in
stat le system call to determine basic informaon about the target le, and then it checks a set of
‘magic’ heuriscs to guess the actual type of the le based on its contents.
Other ulies infer the type of a le from its extension (the leers aer the dot in the lename).
However, this is not always a reliable guide to the le type, since the extension is simply a part of the
lename and can be modied by users.
In Linux, everything is a le (at least, everything appears to be a le). In simplest terms, this means
everything is addressable via a name in the le system, and these names can be the target of le
system calls such as stat. Enes that aren’t actually regular les have disnct types. For instance,
if you execute ls -l in a directory, you will see the rst character on each line species the disnct
219
type. For directories, this is d, for character devices, it is c, and for symbolic links it is l. The full set of
types is specied in /usr/include/arm-linux/sys/stat.h — look at this header le and search
for Testmacrosforletypes.
9.2.2 How are mulple les organized?
Collecons of les can be grouped together into directories, somemes called folders. A directory contains
les, including other directories. The le system abstracon is a skeuomorphism, designed to resemble
the familiar paper ling cabinet, as shown in Figure 9.1. Each le corresponds to a paper document;
a directory corresponds to a card folder; the enre le system corresponds to the ling cabinet.
Figure 9.1: Tradional ling cabinet containing folders with paper documents. Photo by author.
Linux has a single, top-level root directory, denoted as /, which is the ancestor directory of all other
elements in the le system. We might assume this rooted, hierarchical arrangement leads to a
tree-based structure, and this is oen the graphical depicon of the hierarchy, e.g., in the Midnight
Commander le manager layout shown in Figure 9.2.
Figure 9.2: Midnight commander le manager shows a directory hierarchy as a tree.
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
220
However, les can belong to mulple directories due to hard links. For example, consider this sequence
of commands:
Lisng 9.2.1: File creaon example Bash
1 cd /tmp
2 mkdir a; mkdir b
3 echohello>a/thele.txt
4 lna/thele.txtb/samele.txt
where /tmp/a/thele.txt and /tmp/b/samele.txt are actually the same le. Try eding one of them,
and then viewing the other. You will observe that the changes are carried over; also that the two
lenames have common metadata when viewed with ls -l. Maybe lenames should be considered
more like pointers to les, rather than the actual le themselves. This leads to a graph-like structure,
see Figure 9.3. However, if you try to remove one of the les, e.g., rm/tmp/a/thele.txt, then the
link is removed, but the le is sll present. It can be accessed via the other link.
Figure 9.3: Graphical view of mulple linked les that map to the same underlying data.
Note that les can belong to mulple directories, (i.e., have mulple hard links) but directories cannot
have extra links. For example, try to do ln /tmp/a /tmp/b/another_a and noce the error that
occurs. Addional hard links for directories are not allowed. This is because we want to prevent cycles
into the directory hierarchy. If we consider a link to be a directed edge in the directory graph, then we
want to enforce a directed acyclic graph. If the only nodes that can have mulple incoming edges are
regular les (i.e., nodes with no successors), then it is impossible to introduce cycles into the graph.
Directory cycles are undesirable since they make it more complex to traverse the directory hierarchy.
Also, it is possible to create cycles of ‘garbage’ directories that are unreachable from the root directory.
There is a further restricon on hard links created with the ln command: such links cannot span
across dierent devices. Although Linux presents the abstracon of a unied directory namespace
with a single root directory, actually mulple devices (disks and parons) may be incorporated into
i
i
“chapter” 2019/8/13 20:50 page 3 #3
i
i
i
i
i
i
/tmp
a
b
thefile.txt samefile.txt
hello
221
this unied namespace. Because of the way in which hard links are encoded (see later secon on
inodes) Linux only supports hard links within a single device.
So links or symbolic links (abbreviated as symlinks) are much more exible. These are textual pointers
to paths in the le system. Use ln -s to set up a symlink. These links can be cyclical and can span
mulple devices, unlike hard links. The key property of symlinks is that they are merely strings, like the
lenames and paths you use for interacve commands on the terminal. The symlink strings are not
veried and may be ‘dangling’ links to non-existent les or directories.
9.3 Operaons on les
There is a standard set of le-related acons that every Unix-derived OS must support, known as the
POSIX library funcons for les and directories.
First, to operate on the data stored in a le, it is necessary to open the le, acquiring a le descriptor
which is an integer idener. The OS maintains a table of open les across the whole system; use the
lsof command to list currently open les.
When we open a le, we state our usage intenons: are we only reading? or wring? or appending to
the end of a le? These intenons are checked against relevant le permissions. The operaon fails,
and an error returned (which the programmer must check) if there is a permission violaon.
This le descriptor should be closed when the process has nished operang on the le data. Too
many open le descriptors can cause problems for OS. There are strict limits imposed on the number
of open les, for performance reasons, to avoid kernel denial-of-service style aacks.
The ulimit -n command will display the open le limit for a single process. On your Raspberry Pi,
this might be set to 1024.
You can check that this limit is enforced with a simple Python script that repeatedly opens les and
retains the le descriptors:
Lisng 9.3.1: Open many les in rapid succession Python
1 i = 0
2 les=[]
3 while True:
4 les.append(open("le"+str(i)+".txt", "w+"))
5 i += 1
Noce this fails before creang 1024 les; some les are already open (such as the Python interpreter
and standard input, output and error streams).
There is also a system-wide open le limit, cat/proc/sys/fs/le-max to inspect this value.
The le /proc/sys/fs/le-nr shows the current number of open les across the whole system.
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
222
Once a process has acquired a le descriptor, as a result of a successful open call, it is possible to
operate on that le’s data content. This may involve reading data from or wring data to the le.
There is the implicit noon of a posion within a le, tracking where the pointer associated with the
le descriptor is ‘at.The pointer is implicitly at the beginning of the le with open (unless we specify
append mode when it starts at the end). As we read and write bytes of data, we advance the pointer.
We can reset the pointer to an arbitrary posion in the le with the lseek call. It is also possible to
change the size of an open le with the truncate call. Figure 9.4 shows the state transions of a le
descriptor as these calls occur.
Figure 9.4: State machine diagram showing sequence of le system calls (in red) for a single le.
File metadata updates, such as name, ownership, and permissions, are atomic. There is no need to
open the le for these operaons; le system calls simply use the name of the le.
9.4 Operaons on directories
Although directories appear to be like les, they are opened with a disnct API call opendir, to allow
a program to iterate through the directory contents.
Directory modicaon operaons are atomic, from the programmer perspecve. Operaons like
moving, copying, or deleng les have le system API calls, but these require string lename paths
rather than open le descriptors. Note that all the standard bash le manipulaon commands like mv
and rm have API equivalents for programmac use.
In the same way, metadata updates can be performed programmacally, and appear to be atomic.
Again, these operaons require string lename paths.
9.5 Keeping track of open les
For each process, the Linux kernel maintains a table to track les that have been opened by that
process. The integer le descriptor associated with the open le, also known as a handle, corresponds
to an index into this per-process table. The table is called les_struct, dened in include/
linux/fdtable.h; it is a eld of the task_struct process control block.
Each entry in the les_struct table has a pointer to a structle object, which is dened in
include/linux/fs.h. These objects reside in the system-wide le table, dened in fs/le_
i
i
“chapter” 2019/8/13 20:50 page 4 #4
i
i
i
i
i
i
closed
opened
pos=X
open
close
read/write/lseek/ ...
223
table.c. The structle data structure maintains the current le posion within the open le,
the permissions for accessing the le, and a pointer to a dentry object.
The dentry (short for directory entry’) encodes the lename in the directory hierarchy and links the
name with the locaon of the le on a device, represented as an inode (see Secon 9.10). This data
structure is dened in include/linux/dcache.h.
A single process may have mulple le descriptors, corresponding to mulple entries in the
les_struct table, that point to the same system-wide structle object. This is possible with
the dup system call that creates a fresh copy of a le descriptor.
Mulple processes may have their own disnct le descriptors, in their own les_struct table, that
point to the same system-wide structle object. This is possible because the per-process state is
cloned when a new process is forked, so the forked process will inherit open le descriptors from its
parent process.
In both the above situaons, there is a single le oset. This means that if the le oset is modied via
one of the aliased le descriptors, then the oset is also changed for the other(s).
It is also possible that dierent entries in the system-wide le table might point to the same directory
entry. This happens if mulple processes open the same le, or even if a single process opens the
same le several mes. In these cases, each disnct structle has its own associated le oset.
Figure 9.5: Open les are tracked in a per-process le table (le), which contains pointers into the system-wide le table (center), which references
directory locaon informaon to access the underlying le contents.
9.6 Concurrent access to les
The previous secon introduced the noon of mulple processes accessing the same open le.
In general, mulple readers are straighorward. If each reading process has a disnct le descriptor
mapping onto a disnct structle, then each reader has its own unique posion in the le.
Although Linux permits concurrent wring processes, there may be problems and inconsistencies.
If a le is opened with the O_APPEND ag set, then the OS guarantees that writes will always safely
append even with mulple writers. The issue here is that, while the two processes may append their
writes to the le in the correct order, this data may be interleaved between the processes.
i
i
“chapter” 2019/8/13 20:50 page 5 #5
i
i
i
i
i
i
file descriptor
offset
per-process
files_struct
table
table of
struct file
objects
fget(int fd)
dentry
+filename
+parent
+location
f_path.dentry
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
224
It is possible to lock a le to prevent concurrent access by mulple processes. There are various ways
to perform le-based locking. The C code below demonstrates the use of lockf, which relies on the
underlying fcntl system call.
Lisng 9.6.1: Lock the log.txt le for single writer access C
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <string.h>
4 #include <sys/le.h>
5 #include <unistd.h>
6
7 /* takes a single integer command-line
8 * parameter, specifying how long to
9 * sleep after each write operation
10 */
11 int main(int argc, char **argv) {
12
13 int t = atoi(argv[1]);
14 int i;
15 char msg[30];
16
17 int fd = open("log.txt", O_WRONLY|O_CREAT|O_APPEND, 0666);
18 if(fd == -1){
19 perror("unabletoopenle");
20 exit(1);
21 }
22 /* lock the open le */
23 if (lockf(fd, F_LOCK, 0) == -1) {
24 perror("unabletolockle");
25 exit(1);
26 }
27
28 for (i=0; i<10; i++) {
29 sprintf(msg, "sleeping for %d seconds\n", t);
30 write(fd, msg, strlen(msg));
31 sleep(t);
32 }
33
34 /* unlock le */
35 if (lockf(fd, F_ULOCK, 0) == -1) {
36 perror("unabletounlockle");
37 exit(1);
38 }
39 close(fd);
40 return 0;
41 }
If a parcular le is already locked, then a subsequent call to lockf blocks unl that le has been
unlocked. Try compiling this C code, then running two instances of the executable concurrently to
observe what happens—the rst process should complete all its writes to the log before the second
process is allowed to write anything.
Note that this kind of le-based locking on Linux is only advisory. Processes may ‘ignore’ le locks
enrely and proceed to read from or write to open les without respecng locks.
225
9.7 File metadata
Metadata describes the properes of each le. There is a standard set of aributes that the Linux le
system supports directly, so these items are recorded for all les.
This includes user-centric metadata, such as the textual name and type of the le. The type is
convenonally encoded as part of the name, a sux aer the nal period character in the name.
Parcular le system formats may impose restricons on names, such as their length or permied
characters.
The le name is a human-friendly label for the user to specify the le of interest. However, the le
system maintains a unique numeric idener for each le, which is used internally. It is the case that
mulple names may actually map to the same le (i.e., the same numeric id) in the directed acyclic
graph directory structure of Linux, as explained in Secon 9.2.2.
The size of the le is specied in bytes, i.e., its length. The le occupies some number of blocks on
a device, but these blocks may not be full if the le size is not a precise mulple of the block size.
Unlike null-terminated C-style strings, there is no explicit end-of-le (EOF) marker. Instead, we must
use the length of the le to determine when we reach the end of its data.
File access permissions metadata is supported in Linux. Each le has an owner (generally the creator
of the le, although the chown command can modify the owner). Each le has a group (to which the
owner may or may not belong; note the chgrp command can modify the group). The owner and group
are encoded as integer ideners, which may be looked up in the relevant tables in /etc/passwd
and /etc/group les.
For permissions, there are nine bits of metadata, three each for the owner, the group, and
everyone else. Each triple of bits (from most signicant to least signicant bit) encodes read, write and
execute permission respecvely. Figure 9.6 illustrates these permission bits. This metadata can be set
using the chmod command, followed by three octal numbers for the three triplets. More advanced
capabilies and ne-grained permissions are supported by the SElinux system.
Figure 9.6: The 9-bit permissions bitstring is part of each le’s metadata—in this example, the owner can read and write to the le, all other users can only
read the le.
Timestamp elds record creaon me, most recent edit me, and most recent access me for each
le. These are recorded as seconds since 1970, the start of the Unix epoch. Since they are signed
32-bit integers, the maximum mestamp that can be encoded is some me on 19 January 2038.
i
i
“chapter” 2019/8/13 20:50 page 6 #6
i
i
i
i
i
i
user
1 1 0
r w x
group
1 0 0
r w x
other
0 1 0
r w x
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
226
Recent Linux patches have extended the mestamp elds to 64 bits, with support for nanosecond
granularity and a longer maximum date.
The most important administrave metadata is the actual locaon of the le data on the disk. The precise
details depend on the specic nature of the le system implementaon, which we will cover in later secons.
Somemes extra metadata is supported by graphical le managers like Naulus (for Gnome) or
Dolphin (for KDE). These might include per-le applicaon associaons or graphical icons).
For specic kinds of les, applicaon-specic metadata may be included within the le itself, e.g., MP3 audio
les include id3 tags for arst and tle, PDF les include page counts. While this is not navely supported
within the Linux le system, it might be parsed and rendered by custom le managers, e.g., see Figure 9.7.
Figure 9.7: Naulus le manager parses and displays custom metadata for a PDF le.
9.8 Block-structured storage
A le system is an abstracon built on top of a secondary storage facility such as a hard disk or,
on a Raspberry Pi, an SD card.
Typical le systems depend on persistent, block-structured, random access storage.
Persistent means the data is preserved when the machine is powered o.
Block-structured means the storage is divided into xed-size units, known as blocks. Each block may
be accessed via a unique logical block address (LBA).
Random access means the blocks may be accessed in any order, as opposed to constraining access
to a xed sequenal order (which would be the case for magnec tape storage, for instance).
While magnec hard disks have physical geometries, and data is stored in locaons based on tracks
(circular strips on the disk) and sectors (sub-divisions of tracks), more recent storage media such as
227
solid-state storage do not replicate these layouts. In this presentaon, we will deal in terms of logical
blocks; which is an abstracon that can be supported by all modern storage media.
So, a storage device consists of idencally sized blocks, each with a logical address. This is similar
to pages in RAM (see Chapter 6) only blocks are persistent. Oen the block size is the same as the
memory page size, to facilitate ecient in-memory caching of disk accesses.
We can examine the block size and the number of blocks for the Raspbian OS image installed on your
Raspberry Pi SD card. In a terminal, type
Lisng 9.8.1: Simple stat command Bash
1 stat -fc %s /
to show the block size (in bytes) of your le system. This should be 4096, i.e., 4KB. To see the details
of free and used blocks in your le system, type
Lisng 9.8.2: Another simple stat command Bash
1 stat -f /
and you should get a data dump like that shown in Figure 9.8. This displays the space occupied by
metadata (the inodes) and by actual le data (the data blocks).
Figure 9.8: Output from the stat command, showing le system block usage.
A block is the smallest granular unit of storage that can be allocated to a le. So a le containing just
10 bytes of data, e.g., hi.txt in the example below, actually occupies 4K on disk.
Lisng 9.8.3: Dierent ways to measure le size Bash
1 echo "hello you!" > hi.txt
2 ls -l hi.txt # shows actual data size
3 du -h hi.txt # shows data block usage
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
228
This wasted space is internal fragmentaon overhead, caused by xed block sizes. The 4K block size
is generally a good trade-ovalue for general purpose le systems.
This secon has outlined block-structured storage at the device level; we present more details on
devices in the chapter covering IO. Next, we will explore how to build a logical le system on top of
these low-level storage facilies.
9.9 Construcng a logical le system
Given this block-based storage scheme, how do we build a high-level le system on top?
Some blocks must be dedicated to indexing, allowing us to associate block addresses with high-level
les and directories. Other blocks will be used to store user data, the contents of les. As outlined
above, the smallest space a le can occupy is a single block. Depending on the block size and the
average le size, this may cause internal fragmentaon, where space is allocated to a le but unused
by that les.
A le system architect makes decisions about how to arrange sets of blocks for large les. There
are trade-os to consider, such as avoiding space fragmentaon and minimizing le access latency.
Possible strategies include:
Conguous blocks: a large le occupies a single sequence of consecuve blocks. This is ecient
when there is lots of space, but can lead to external fragmentaon problems (i.e., awkwardly sized,
unusable holes) when les are deleted, or les need to grow in size.
Indexed blocks: a large le occupies a set of blocks scaered all over the disk, with an
accompanying index to maintain block ordering. This reduces locality of disk access and requires
a large index overhead (like page tables for virtual memory). However, there are no external
fragmentaon issues.
Linked blocks: a large le is a linked list of blocks, which may be scaered across the disk. There is
no fragmentaon issue and no requirement for a complex index. However, it is now inecient to
access the le contents in anything other than a linear sequence from the beginning.
Every concrete le system format incorporates such design decisions. First, we consider an abstracon
that allows Linux to handle mulple le systems in a scalable way.
9.9.1 Virtual le system
There are many concrete le systems, such as ext4 and FAT, which are reviewed later in this chapter.
These implementaons have radically dierent approaches to organizing persistent data as les on
disks, reecng diverse design decisions. In general, an OS must support a wide range of le systems,
to enable compability and exibility.
The Linux virtual le system (VFS) is a kernel abstracon layer. The key idea is that the VFS denes a
common le system API that all concrete le systems must implement. The VFS acts as a proxy layer,
229
in terms of soware design paerns. All le-related system calls are directed to the VFS, and then it
redirects each call to the appropriate concrete underlying le system.
Linux presents the abstracon of a unied le system, with a single root directory from which all other
les and directories are reachable. In fact, the VFS integrates a number of diverse le systems, which
are incorporated into the unied directory hierarchy at dierent mount points. Inspect /etc/mtab to
see the currently mounted le systems, their concrete le system types, and their locaons within the
unied hierarchy.
The pseudo-le /proc/lesystems maintains a list of le systems that are supported in your Linux
kernel. Note that the nodev ag indicates the le system is not associated with a physical device.
Instead, the pseudo-les on such le systems are synthesized from in-memory data structures,
maintained by the kernel.
A concrete le system is registered with the VFS via the register_lesystem call. The supplied
argument is a le_system_type, which provides a name, a funcon pointer to fetch the superblock
of the le system and a next pointer. All le_system_type instances are organized as a linked list.
The global variable le_systems in fs/lesystems.c points to the head of this linked list.
The superblock, in this context, is an in-memory data structure that contains key le system metadata.
There is one superblock instance corresponding to each mounted device. Some of this data comes
from disk (where there may be a le system block also called the superblock). Other informaon, in
parcular, the vector of funcon pointers named struct super_operations, is populated from
the concrete le system code base directly. See the lisng below for details of the funcon pointers
that will be lled in by le system-specic implementaons.
Lisng 9.9.1: Vector of le system operaons, from include/linux/fs. C
1 struct super_operations {
2 struct inode *(*alloc_inode)(struct super_block *sb);
3 void (*destroy_inode)(struct inode *);
4 void (*dirty_inode) (struct inode *, intags);
5 int (*write_inode) (struct inode *, struct writeback_control *wbc);
6 int (*drop_inode) (struct inode *);
7 void (*evict_inode) (struct inode *);
8 void (*put_super) (struct super_block *);
9 int (*sync_fs)(struct super_block *sb, int wait);
10 int (*freeze_super) (struct super_block *);
11 int (*freeze_fs) (struct super_block *);
12 int (*thaw_super) (struct super_block *);
13 int (*unfreeze_fs) (struct super_block *);
14 int (*statfs) (struct dentry *, struct kstatfs *);
15 int (*remount_fs) (struct super_block *, int *, char *);
16 void (*umount_begin) (struct super_block *);
17 int (*show_options)(structseq_le*,struct dentry *);
18 int (*show_devname)(structseq_le*,struct dentry *);
19 int (*show_path)(structseq_le*,struct dentry *);
20 int (*show_stats)(structseq_le*,struct dentry *);
21 // ...
22 };
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
230
When a device is mounted, the le system is incorporated into the VFS le hierarchy at the specied
locaon, using locaon specied in the superblock, which is read via the appropriate funcon pointer,
as specied in the named le system’s le_system_type. The mount system call performs this task,
specifying the device to be mounted, the appropriate concrete le system type, and the directory path
at which the le system should be mounted. Try inserng a USB sck in your Raspberry Pi and mounng
it manually. Use strace to trace system call execuon. You may need to disable auto-mounng
temporarily; also use dmesg to nd out the path of the device corresponding to your USB sck.
Lisng 9.9.2: Mounng a USB sck Bash
1 sudo strace mount /dev/sda1 /mnt 2>&1 | grep mount
The superblock handles VFS interacons for an enre le system. Individual les are handled using
structures called inodes and dentries, which are introduced in subsequent secons.
9.10 Inodes
The inode (which stands for index node) is a core data structure that underpins Linux le systems.
There is one inode per enty (e.g., le or directory) in a le system. You can study the denion of
struct inode in the VFS source code at linux/fs.h. A simplied class diagram view of the inode
data structure is shown in Figure 9.9.
Figure 9.9: Class diagram representaon of the inode data structure.
Each inode stores all the metadata associated with a le, including on-device locaon informaon
for the le data. Typical metadata items (e.g., owner identy, le size, and permissions) are stored
directly in the struct. Extended metadata (such as access control lists for enhanced security) are stored
externally, with pointers in the inode structure.
231
As outlined so far, the inode is a VFS-level, in-memory data structure. Other Unix OSs refer to these
structures as vnodes. Concrete le systems may have specialized versions of the inode. For instance,
compare the VFS struct inode denion in include/linux/fs.h with the ext4 variants
struct ext4_inode and struct ext4_inode_info in fs/ext4/ext4.h.
As well as being an in-memory data structure, the inode data is serialized to disk for persistent
storage. Generally, when a nave Linux le system is created, a dedicated conguous poron of the
block storage is reserved for inodes. Each inode is a xed size, so there is a known limit on the number
of inodes (which implies a maximum number of les). Oen the inode table is at the start of the
device. Each inode associated with the device has a unique index number, which refers to its entry in
the inode table. You can inspect the inode number for each le with the ls -i command. Look at the
inode numbers for les in your home directory:
Lisng 9.10.1: Inspecng inode numbers for new les Bash
1 cd ~
2 ls -i
3 echo "hello" > foo
4 echo "hello again" > bar
5 ls -i foo bar
Note the large integer value associated with each le. Generally, newly created les will receive
consecuve inode numbers, as you might be able to see with the newly created foo and bar les
(presuming you do not already have les with these names in your home directory).
9.10.1 Mulple links, single inode
As outlined above, a le name is really just a pointer (a hard link) to an inode. Mulple le names (from
dierent paths in the le system) may map onto the same inode. The num_links eld in the inode
keeps track of how many le names refer to this inode; eecvely this is a reference count.
The reference count is incremented with an ln command and decremented with a corresponding rm
command. When the reference count reaches zero, the inode is orphaned and may be deleted by the
OS, freeing up this slot in the table for fresh metadata associated with a new le.
Note that the inode does not contain links back to the lenames that are associated with this inode.
That info is stored in the directories, separately. The inode simply keeps a count of the number of live
links (valid lenames) that reference it.
9.10.2 Directories
A directory, in abstract terms, is a key/value store or a diconary. It associates le system enty names
(which are strings) onto inode numbers (which are integers). An enty name might refer to a le or a
directory. There is a system-imposed limit on the length of an enty name, which is set to 255 (8-bit)
characters). Use getconf NAME_MAX / to conrm this. If you try to create a le name longer than
this limit, you will fail with a File name too long error.
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
232
VFS does not impose a maximum number of entries in a single directory. The only limit on directory
entries is that each entry requires an inode, and there is a xed number of inodes on the le system.
(Use df -i to inspect the number of free inodes, labeled as IFree.)
In every directory, there are two disnguished entries, namely . (pronounced ‘dot’) and ..
(pronounced ‘dot dot’).
. refers to the directory itself, i.e., it is a self-edge in the reference graph. The command cd . is
eecvely a null operaon.
.. refers to the parent directory. The command cd .. allows us to traverse up the directory
hierarchy to the root directory, /. Note that the parent of the root directory is the root directory
itself, i.e., root’s parent is also a self-edge in the reference graph.
Each process has a current working directory (cwd). For instance, you can discover the working
directory of a bash process with the pwd command, or the working directory of an arbitrary process
with PID n by execung the command readlink /proc/n/cwd. Relave path names (i.e., those not
starng with the root directory /) are interpreted relave to the process’s current working directory.
Figure 9.10: Simplied ow chart for Linux directory path lookup, based on code in fs/namei.c
i
i
“chapter” 2019/8/13 20:50 page 10 #10
i
i
i
i
i
i
start at
well-known
directory,
e.g. root
or cwd
add info to
nameidata
and update
cache
is this last
element in
path?
extract next
directory
name
from path
exists?
mounted?
accessible?
at end, do
final action,
return
handle
directory
lookup error
yes
no
no
yes
233
Absolute path names (i.e., those starng with the root directory /) are based on a path to a le that
starts from the root directory of the VFS le system. Note that paths may span mulple concrete le
systems since these are mounted at various osets from the VFS root.
One of the key facilies provided by the directory is to map from a lename string to an inode number.
The algorithm presented in Figure 9.10 is a high-level overview of this translaon process. You may
invoke this behavior on the command line by using the namei ulity with a lename string argument.
Translaon of lename strings to inode numbers is an expensive acvity. A data structure called a
directory entry (dentry) stores this mapping. The struct dentry denion is in include/linux/
dcache.h. VFS features a dentry entry cache (dcache) for frequently used translaons. This cache
is described in detail in the kernel documentaon, see Documentation/lesystems/vfs.txt,
along with other VFS structures and mechanisms.
9.11 ext4
The extended le system is the nave le system format for Linux. The current incarnaon is ext4,
although it has much in common with the earlier versions ext2 and ext3.
Look at le /etc/fstab, which shows the le systems that are mounted (i.e., reachable from the
root directory) as part of the OS boot sequence. On your Raspbian system, the default le system is
formaed as ext4 and mounted directly at the root directory.
9.11.1 Layout on disk
The way a le system is laid out on a disk (or a disk paron) is known as its format. For any format,
the rst block is always the boot block. This may contain executable code, in the event that this disk is
used as a boot device. Generally, this boot code sequence is very short and jumps to another locaon
for larger, more complex boong behavior.
Immediately aer the boot block, ext4 has a number of block groups. Each block group is idencal in
size and structure; a single block group is illustrated in Figure 9.11.
Figure 9.11: Schemac diagram showing the structure of an ext4 block group on disk.
The rst two elements of a block group are at xed osets from the start of the block group. The
superblock records high-level le system metadata, such as the overall size of the le system, the size of
each block, number of blocks per block group, device pointers to important areas, and mestamps for
most recent mount and write operaons. The struct ext4_super_block is dened in fs/ext4/
ext4.h. Note the superblock only occupies a single block on disk. This on-disk superblock is disnct
i
i
“chapter” 2019/8/13 20:50 page 11 #11
i
i
i
i
i
i
superblock blockgroup
descriptors
data blocks
bitmap
11010
inodes
bitmap
1101100
inode
table
data
blocks
YX Z
1 ext4_inode
2
ext4_inode
...
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
234
from the VFS in-memory superblock structure outlined in Secon 9.9.1, although some data is shared
between them.
Ideally, the superblock is duplicated at the start of each block group. This provides redundancy in case
of disk corrupon. If there are many block groups, then the superblock is only sparsely duplicated at
the start of every nth block group.
The block group descriptor table is global; i.e., it covers all blocks. There is one entry in the table for
each block group, which is a struct ext4_group_desc as dened in the ext4.h header. Figure
9.12 presents this data structure as a UML class diagram. Again, the block group descriptor table is
duplicated across mulple block groups for redundancy, like the superblock.
Figure 9.12: Class diagram representaon of a block group descriptor.
All other le system structures are pointed to by the block group descriptor. The block and inode
bitmaps use one bit to denote each data block and inode, respecvely. A bit is set to 1 if the
corresponding enty is used, or 0 if free. These bitmaps may be cached in memory for access speed.
The inode table has one entry, a struct ext4_inode, per le. This table is stacally allocated, as
outlined in Secon 9.10, so there is a xed limit on the number of les. Special inode table entries are
at well-known slots; generally, inode 2 is the root directory for the le system.
The data blocks region of the block group is the largest; these blocks actually store le data content.
Generally, all blocks belonging to a le will be located in a single block group, for the locality.
You can inspect all the details of your ext4 le system on your Raspberry Pi with a command like:
Lisng 9.11.1: Inspect ext4 le system Bash
1 sudo dumpe2fs -h /dev/mmcblk0p2 # for default Raspbian image on SD card
9.11.2 Indexing data blocks
As noted above, there are two dierent data structures for an inode. The generic VFS struct inode
may be converted into a specic ext4_inode with a macro call, EXT4_I(inode).
The key addional informaon is the locaon pointer for the data blocks that comprise the le
content. Actually, an ext4 inode support three dierent techniques for encoding data locaon,
235
unioned over its 60 bytes for physical locaon informaon. Look at fs/ext4/ext4.h and nd the
relevant i_block[EXT4_N_BLOCKS] eld in struct ext4_inode.
The rst approach is direct inline data storage. If the le contents are smaller than 60 bytes, they can
be stored directly in the inode itself. This generally only happens for symlinks with short pathnames,
although it can be used for other short les. (You may need to enable the inline_data feature
when you format the ext4 le system.) This is the only way actual le data is stored in the inode table
secon of the le system.
The second approach is the tradional Unix hierarchical block map structure. This is the same as
in earlier ext2 and ext3 le systems. The 60 bytes of locaon data are split into 15 4-byte pointers
to logical block addresses. The rst 12 pointers are direct pointers to the rst 12 blocks of the le
data. The next pointer is a single indirect pointer; it points to a block containing a table of pointers
to subsequent data blocks for the le. Convenonally, the block size is 4KB, and each pointer is
4B, so it is possible to have 1K pointers in the single indirect block. The next pointer is to a double
indirect pointer, and the nal pointer is to a triple indirect pointer. In total, this allows us to address
over 1 billion blocks, making for a maximum le size of over 4TB. Figure 9.13 illustrates this structure
schemacally, showing pointers for up to the second level of indirecon. Note that the tables of
pointers, for indirect blocks, are stored in data blocks, rather than in the inode table.
Figure 9.13: Schemac diagram of ext4 map indirecon scheme for data block locaons.
This mul-level indirecon scheme has several benets. There are the advantages of direct block
addressing, for short les or the rst few blocks of a long le. There is the advantage of hierarchical
metadata, like mul-level page tables, to avoid wasted space for medium size les. There is the
advantage of indirecon, to avoid bloang inodes directly, for very large les. Check out fs/ext4/
indirect.c for further implementaon details.
i
i
“chapter” 2019/8/13 20:50 page 13 #13
i
i
i
i
i
i
direct block 1
direct block 2
direct block 3
direct block 12
...
single indirect
double indirect
triple indirect
15 pointers
(60 bytes)
data data data
data data data
...
data data
...
...
data
...
data
ext4_inode
data blocks
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
236
The third approach, which is new in ext4, involves extents. An extent is a conguous area of storage
reserved for a le (or a part of a le, since a le may consist of mulple extents). An extent is
characterized by a starng block number, a length measured in blocks, then the logical block address
corresponding to the starng block number. See struct ext4_extent in fs/ext4/ext4_
extents.h for details.
The chief benet of an extent-based le system is a reducon in meta-data. Whereas a block
map system requires a metadata entry (a block locaon) for every block, each extent only records
a single logical block address for a conguous run of blocks. It is good pracce to make le data as
conguous as possible, to improve access mes.
The 60 bytes of an ext4_inode may be used to store a 12-byte extent header followed by an array
of up to four 12-byte extent structures.
If there are more than four extents in the le, then the extent data can be arranged as a mul-level
N-ary tree, up to ve levels. Extent data spills out from the inode into data blocks. A data block storing
extent data begins with an ext4_extent_header which states how many extent entries there
are in this block and the tree depth. The entries follow in an array layout. If the tree depth is 0, then
the entries are leaf nodes, i.e., ext4_extent entries poinng directly to extents of data blocks on
disk. If the tree depth is greater than 0, then the entries are index nodes, i.e., ext4_extent_idx
entries, poinng to further blocks of extent data. These structures are all dened in fs/ext4/ext4_
extents.h. Figure 9.14 gives an example, with a two-level extent tree. This is similar to the indirect
block addressing used in the previous approach.
Figure 9.14: Schemac diagram of ext4 extents scheme for data block locaons.
i
i
“chapter” 2019/8/13 20:50 page 14 #14
i
i
i
i
i
i
extent_header
n=2, depth=1
extent_index
extent_index
extent_header
n=..., depth=0
extent
extent
...
extent_tail
extent_header
n=...,depth=0
extent
extent
...
extent_tail
data
data data
data
ext4_inode
data blocks
237
9.11.3 Journaling
The ext4 le system supports a journal, a dedicated log le that records each change that is to occur
to the le system. The journal tracks whether a change has started, is in progress, or has completed.
An append-only log le like this is vital when complex le system operaons may be scheduled, which
would cause le system corrupon if they start but do not complete, e.g., due to power failure. The log
may be consulted on system restart to recover the le system to a consistent state, either replaying or
undoing the paral acons.
Transacon records are added to the log rapidly and atomically. The journal le is eecvely a
circular buer, so older entries are overwrien when it lls up. Examine the le /proc/fs/jbd2/
<partition>/info to see stascs about the number of transacon entries in the log.
9.11.4 Checksumming
A checksum is a bit-level error detecon mechanism. This is highly useful for persistent storage, where
there is a possibility of data corrupon.
The most popular algorithm is CRC32c, which generates a 32-bit checksum for an arbitrary size input
byte array. The CRC32c algorithm may be implemented eciently in soware, although some Arm
CPUs have a built-in instrucon to perform the calculaon directly.
The ext4 le system supports CRC32c checksums for metadata. Try the following grep command to
explore elds that have checksums aached:
Lisng 9.11.2: Find checksum elds Bash
1 cd linux/fs/ext4/
2 grep crc *.h
Some checksums are embedded in top-level le system metadata. These include checksum elds in the
superblock and the block group descriptor, although each 32-bit value may be split across two 16-bit
elds in these structures. Checksums are ideal when there are redundant copies of such metadata.
If superblock corrupon is detected, then the metadata can easily be restored by cloning a reserve copy.
Each block group bitmap that tracks free inodes or data blocks also has a checksum to ensure its
integrity. Further, there are checksums for individual le metadata. Each inode has its own checksum
eld. There are checksums for some data locaon metadata (extent trees) and extended aributes.
Some le systems like btrfs and zfs store checksums for all blocks, including data blocks. On the other
hand, ext4 only supports checksums for le system metadata.
9.11.5 Encrypon
Encrypon involves protecng data by means of a secret key, usually a text string. The data appears
to be gibberish without this key. Encrypon is used for portable devices or scenarios where
untrusted individuals can access the le system. Data security is parcularly important for corporate
organizaons, given recent developments in data protecon legislaon.
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
238
The ext4 le system supports encrypon of empty directories, which can then have les added to
them. The le names and contents are encrypted. The e4crypt ulity is an appropriate tool to handle
ext4 encrypted directories.
Note that ext4 encrypon is not supported on Raspbian kernels by default. Encrypon requires the
following conguraon:
1. The kernel build opon EXT4_ENCRYPTION must be set.
2. The target ext4 paron must have encrypon enabled.
There are alternave, single-le encrypon tools. For instance, zip archives can be encrypted with
a passphrase. In general, the gpg command-line tool allows single le payloads to be encrypted.
Lower-level encrypon techniques on Linux include dm-crypt which operates at the block device
level.
9.12 FAT
The File Allocaon Table (FAT) le system is named aer its characterisc indexing data structure.
Originally, FAT was a DOS-based le system used for oppy disks. Although it is not ‘nave’ to Linux,
it is well-supported since FAT is ubiquitous in removable storage media such as USB ash drives and
SD cards. Due to its simplicity and long history, FAT is highly compable with other mainstream and
commercial OSs, as well as hardware devices such as digital cameras. For SD cards, the default format
is FAT. Observe that your Raspberry Pi SD card has a FAT paron for boong.
Lisng 9.12.1: A Raspberry Pi SD card has a FAT paron Bash
1 cat /etc/mtab | grep fat
The key idea behind the FAT format is that a le consists of a linked list of sequenal data blocks.
A directory entry for a le simply needs a pointer to the rst data block, along with some associated
metadata. Rather than storing the data block pointers inline, where they might easily be corrupted,
the FAT system has a disnct table of block pointers near the start of the on-disk le system. This is
the le allocaon table (FAT). Oen the FAT may be duplicated for fault tolerance.
Given a xed number of data blocks on a FAT le system, say N, then the le allocaon table should
have N entries, for a one-to-one correspondence between table entries and data blocks.
If the data block is used as a part of a le, then the corresponding table entry contains the pointer
to the next block.
If this is the last block of the le, the table entry contains an end of le marker.
239
There are other special-purpose values for FAT entries, as listed in Figure 9.15. All entries have the
same xed length, as specied by the FAT variant. FAT12 has 12-bit entries, FAT16 has 16-bit, FAT32
has 32-bit.
Figure 9.15: FAT entry values and their interpretaon.
In FAT systems, a directory is a special type of le that consists of mulple entries. Each directory
entry occupies 32 bytes and encodes the name and other metadata of a le (although not as rich as
an inode, generally) along with the FAT index of the rst block of le data.
In earlier FAT formats (i.e., 12 and 16) the root directory occupies the special root directory region.
It is stacally sized when the le system is created, so there is a limited number of entries in the
root directory. For FAT32, the root directory is a general le in the data region so it can be resized
dynamically.
Figure 9.16 shows how the various regions of the FAT le system are laid out on a device. The inial
reserved block is for boot code. The FAT is a xed size, depending on the number of blocks available
in the data region. Each FAT entry is n bytes long, for the FAT-n variant of the le system, where n may
be 12, 16 or 32. For FAT12 and FAT16, as shown in the diagram, the root directory is a disnct, xed-
size region. This is followed by the data region, which contains all other directories and les.
Figure 9.16: Schemac diagram showing regions for a FAT format disk image.
9.12.1 Advantages of FAT
Due to the linked list nature of les, they must be accessed sequenally by chasing pointers from the
start of the le. Since the pointers are all close together (in the FAT, rather than inline in the data blocks)
they have high spaal locality. The FAT is oen cached in RAM, so it is ecient to access and traverse.
FAT is a simple le system in terms of implementaon complexity. Its simplicity, along with its
longevity, explain its widespread deployment.
i
i
“chapter” 2019/8/13 20:50 page 15 #15
i
i
i
i
i
i
entry meaning FAT16 value
0 free block 0x0000
1 temporarily non-free 0x0001
2
! (MAXWORD - 16) next block pointer
MAXWORD - 15 ! MAXWORD - 9 reserved values 0xFFF6
MAXWORD - 8
bad block 0xFFF7
MAXWORD - 7
!
MAXWORD end of file marker 0xFFF8
!
0xFFFF
i
i
“chapter” 2019/8/13 20:50 page 16 #16
i
i
i
i
i
i
reserved
(boot sector)
file allocation
table
root
directory
data
blocks
logical disk addresses
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
240
9.12.2 Construct a Mini File System using FAT
The best way to understand how a le system works is to construct one for yourself. In this praccal
secon, we will programmacally build a disk image for a simple FAT16 le system, then mount the
image le on a Raspberry Pi system and interact with it.
The Python program shown below will create a blank le system image. Read through this source code
to understand the metadata details required for specifying a FAT le system.
Lisng 9.12.2: Programmacally create a FAT disk image Python
1 # create a binary le
2 f = open('fatexample.img', 'wb')
3
4 ### BOOT SECTOR, 512B
5 # rst 3 bytes of boot sector are 'magic value'
6 f.write( bytearray([0xeb, 0x3c, 0x90]) )
7
8 # next 8 bytes are manufacturer name, in ASCII
9 f.write( 'TEXTBOOK'.encode('ascii') )
10
11 # next 2 bytes are bytes per block - 512 is standard
12 # this is in little endian format - so 0x200 is 0x00, 0x02
13 f.write( bytearray([0x00, 0x02]) )
14
15 # next byte, number of blocks per allocation unit - say 1
16 # An allocation unit == A cluster in FAT terminology
17 f.write( bytearray([0x01]) )
18
19 # next two bytes, number of reserved blocks -
20 # say 1 for boot sector only
21 f.write( bytearray([0x01, 0x00]) )
22
23 # next byte, number of File Allocation tables - can have multiple
24 # tables for redundancy - we'll stick with 1 for now
25 f.write( bytearray([0x01]) )
26
27 # next two bytes, number of root directory entries - including blanks
28 # let's say 16 les for now, so root dir is contained in single block
29 f.write( bytearray([0x10, 0x00]) )
30
31 # next two bytes, number of blocks in the entire disk - we want a 4 MB disk,
32 # so need 8192 0.5K blocks == 2^13 == 0x00 0x20
33 f.write( bytearray([0x00, 0x20]) )
34
35 # single byte media descriptor - magic value 0xf8
36 f.write( bytearray([0xf8]) )
37
38 # next two bytes, number of blocks for FAT
39 # FAT16 needs two bytes per block, we have 8192 blocks on disk
40 # 512 bytes per block - i.e. can store FAT metadata for 256 blocks in
41 # a single block, so need 8192/256 blocks == 2^13/2^8 == 2^5 == 32
42 f.write( bytearray([0x20, 0x00]) )
241
Lisng 9.12.3: Connuaon of FAT disk image creaon Python
1 # next 8 bytes are legacy values, can all be 0
2 f.write( bytearray([0,0,0,0,0,0,0,0]) )
3
4 # next 4 bytes are total number of blocks in entire disk -
5 # ONLY if it overows earlier 2 byte entry otherwise 0s
6 f.write( bytearray([0x00, 0x00, 0x00, 0x00]) )
7
8 # next 2 bytes are legacy values
9 f.write( bytearray([0x80,0]) )
10
11 # magic value 29 - for FAT16 extended signature
12 f.write( bytearray([0x29]) )
13
14 # next 4 bytes are volume serial number (unique id)
15 f.write( bytearray([0x41,0x42,0x43,0x44]) )
16
17 # next 11 bytes are volume label (name) - pad with trailing spaces
18 f.write( "TEST_DISK ".encode('ascii'))
19
20 # next 8 bytes are le system identier - pad with trailing spaces
21 f.write( "FAT16 ".encode('ascii'))
22
23
24 # pad with '\0'
25 for i in range(0,0x1c0):
26 f.write( bytearray([0]) )
27
28 # end of boot sector magic marker
29 f.write( bytearray([0x55, 0xaa]) )
30
31
32 ## FILE ALLOCATION TABLE
33 # each entry needs 2 bytes for FAT16
34 # We need 8192 entries (== 32 blocks of 512B)
35
36 # (a) rst two entries are magic values 0xf8 0x
37 f.write( bytearray([0xf8,0x,0x,0x]))
38
39 # (b) subsequent 8190 FAT entries should be 0x00
40 f.write( bytearray([0x00,0x00]*8190) )
41
42 ## ROOT DIRECTORY AREA
43 # There are 16 les in the root directory
44 # Each le entry occupies 32 bytes - we just no entries for now - all zeros.
45 # Root directory takes 16*32 bytes == 512B == 1 block
46 f.write( bytearray([0x00]*512) )
47
48 ## DATA REGION
49 # create 8192 blank blocks, each containing 512 bytes of zero values
50 for i in range(8192):
51 f.write( bytearray([0x00]*512) )
52
53 ## All done - nally close le
54 f.close()
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
242
Lisng 9.12.4: Interacng with the FAT disk image Bash
1 # Step 1: mount the le system
2 sudo mount -t vfat -o loop fatexample.img /mnt
3 # Step 2: add a multiple block le
4 sudo dd if=/usr/share/dict/words of=/mnt/words.txt count=5 bs=512
5 # Step 3: unmount the le system
6 umount /mnt
7 # Step 4: hexdump the le to nd the new le's cluster addresses
8 hexdump -x fatexample.img |less
Inspect the FAT data, which starts at address 0x200 in the le; aer the inial magic vales, subsequent
entries are sequenal cluster numbers, nishing with an end-of-cluster marker.
Because the FAT le system we created is inially blank, the newly allocated le is stored in
consecuve clusters on the disk. Over me, as a FAT le system becomes more used and fragmented,
les may not be consecuve.
When free data blocks are required, the system scans through the FAT to nd the index numbers of
blocks that are marked as free and uses these for fresh le data.
If the Python code above is too long to aempt, you could also use the mkfs tool to create a blank
FAT16 disk image, as shown below.
Lisng 9.12.5: Automacally create a FAT image Bash
1 sudo apt-get install dosfstools # for manipulating FAT images
2 dd if=/dev/zero of=./fat.img bs=512 count=8192 # blank image
3 mkfs.fat -f 1 -F 16 -i 41424344 -M 0xF8 -n TEST_DISK \
4 -r 32 -R 1 -s 1 -S 512 ./fat.img
9.13 Latency reducon techniques
To minimize the overhead of accessing persistent storage, which can have relavely high latency, Linux
maintains an in-memory cache of blocks recently read from or wrien to secondary storage. This is
known as the buer cache. It is sized to occupy free RAM, so it grows and shrinks as other processes
require more or less memory. The contents of the buer cache are ushed to disk at regular intervals,
to ensure consistency.
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200 fff8 ffff 0000 0004 0005 0006 0007 ffff
0000210 0000 0000 0000 0000 0000 0000 0000 0000
*
243
Another technique to reduce latency is the use of a RAM disk. This involves dedicang a poron of
memory to be handled explicitly as part of the le system. It makes sense for transient les (e.g., those
resident in /tmp) or log les that will be accessed frequently. The kernel has specic support for this
mapping of memory to le system, called tmpfs. Create a RAM disk of 50MB size as follows:
Lisng 9.13.1: Create a RAM disk Bash
1 sudo mkdir /mnt/ramdisk
2 sudo mount -t tmpfs -o size=50M newdisk /mnt/ramdisk
Note les in the directory /mnt/ramdisk are not persistent. This directory is lost when the system is
powered down. RAM disks are parcularly useful for embedded devices like the Raspberry Pi, for
which repeated high frequency writes to disk can cause SD card corrupon.
9.14 Fixing up broken le systems
Persistent storage media may be unreliable. Bad blocks should be detected and avoided. File systems
have mechanisms for recording bad blocks to ensure data is not allocated to these blocks. For
instance, FAT has a bad block marker.
Somemes, le systems are in an inconsistent state if the system is powered down unexpectedly, or
devices are removed without unmounng. Some data may have been cached in RAM, but not wrien
back to disk before the shutdown or removal.
Fixup ulies like fsck can check and repair le system glitches. They check for directory integrity and
make alteraons (e.g., to inode reference counts) as appropriate. File System journals may be used to
replay incomplete acons on le systems.
These general-purpose tools can handle many common le system problems. For more serious issues,
expert cyber forensic tools are available. These facilitate paral data recovery from damaged devices.
9.15 Advanced topics
Some storage media are read-only, such as opcal disks. On Linux, any le system may be mounted
for read-only access. This implies that only certain operaons are permied, and no metadata updates
(even access mestamps) are possible. Generally, read-only media have specialized le system formats
such as the universal disk format (UDF) for DVDs.
Specialized Raspberry Pi Linux distribuons may mount the root le system as read-only, with any le
writes directed to transient RAM disk storage. This is an aempt to guarantee system integrity, e.g.,
for public display terminals in museums.
Network le systems are commonplace, parcularly given widespread internet connecvity.
In addion to the issues outlined above for local le systems, network le system protocols must
also handle:
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
244
1. Distributed access control: global user idenes are managed and authencated in the system.
2. High and variable latency: underlying data may be stored in remote locaons over a wide area
network, with clients experiencing occasional lack of connecvity.
3. Consistency: mulple users may concurrently access and modify a shared, possibly replicated,
resource.
A union le system, also known as an overlay le system, is a transparent composion of two disnct
le systems. The base layer, oen a read-only system like a live boot CD, is composed with an upper
layer, oen a writeable USB sck. Overlays are also extensively used for containerizaon, in systems
like Docker. From user space, the union appears to be a single le system. The lisng below shows
how you can set up a sample union le system on your Raspberry Pi. If you inspect the lower layer, it
is not aected by modicaons in the merged layer. The upper layer acts like a ‘le system di’ applied
to the lower layer.
Lisng 9.15.1: A sample union le system Bash
1 cd /tmp
2 # set up directories
3 mkdir lower
4 echo "hello" > a.txt
5 touch b.txt
6 mkdir upper
7 mkdir work
8 mkdir merged
9 sudo mount -t overlay overlay -olowerdir=/tmp/lower,\
10 upperdir=/tmp/upper,workdir=/tmp/work /tmp/merged
11 cd merged
12 echo "hello again" >> b.txt
13 touch c.txt
14 ls
15 ls ../upper
16 ls ../lower
Standard, concrete le system implementaons are built into the kernel, or loaded as kernel modules.
The goal of the FUSE project is to enable le systems in user space. FUSE consists of:
1. A small kernel module that mediates with VFS on behalf of non-privileged code.
2. An API that can be accessed from user space.
FUSE enables more exibility for development and deployment of experimental le systems. Mulple
high-level language bindings available, allowing developers to create le systems in languages as
diverse as Python and Haskell.
245
9.16 Further reading
The ext4 le system is introduced, movated, and empirically evaluated in a paper [1] by some of its
development team. There are a number of helpful illustraons in this paper. It also includes details on
high-level design decisions that underpin ext4.
The wiki page at hp://ext4.wiki.kernel.org features a comprehensive collecon of online resources
about ext4.
The detailed coverage of VFS and legacy ext2/ext3 le systems in the O’Reilly textbook on
Understanding the Linux Kernel [2] is well worth reading. The authors provide much more detail,
including relevant commentary on kernel source code data structures and algorithms.
9.17 Exercises and quesons
9.17.1 Hybrid Conguous and linked le system
Consider a block-structured le system where the rst N blocks of a le are arranged conguously,
and then subsequent blocks are linked together in a linked list data structure (like FAT). What are the
advantages of this le system organizaon? What are the potenal disadvantages?
9.17.2 Extra FAT le pointers
Consider a linked le system, like FAT. The directory entry for each le has a single pointer to the rst
block of the le. Why might it be a good idea to keep a second pointer, to the nal block of the le?
Which operaons would have their eciency improved?
Imagine a FAT style system with doubly linked lists, i.e., each FAT entry has pointers to both the next
and previous blocks. Would this improve le seek mes, in general? Do you think the space overhead
is acceptable?
9.17.3 Expected le size
Inspect your ext4 root le system. See how much space is available on it with df -h. Then see how
many inodes are free with df -i. Use these results to calculate the expected space to be occupied by
each future le, assuming a single inode per le (i.e., no mulple links).
9.17.4 Ext4 extents
This queson concerns the relave merits of ext4 style extents in comparison to tradional block
map indexing. Consider creang and wring data to an N-block le, where the data blocks are laid out
conguously on disk. How many bytes would need to be wrien for extent-based locaon metadata?
How many bytes would need to be wrien for a block map index? When might a block map index be
more ecient than extent-based metadata?
9.17.5 Access mes
Create a RAM disk, using the commands outlined above. Now plug in a USB drive. Compare the write
latencies for both devices, by wring a 100MB le of random data to them. Use the dd command with
source data from /dev/urandom. Which device has lower latency, and why? You might also compare
these mes with wring 100MB to your Pi SD card.
Chapter 9 | Persistent storage
Operang Systems Foundaons with Linux on the Raspberry Pi
246
9.17.6 Database decisions
Imagine you have to architect a big data storage system to run on the Linux plaorm. You can choose
between:
1. A massive single monolithic data dump le
2. A set of small les, each of which stores a single data record
Discuss the implementaon trade-os involved in this decision. Which alternave would you select,
and why?
References
[1] A. Mathur, M. Cao, S. Bhaacharya, A. Dilger, A. Tomas, and L. Vivier, “The new ext4 le system: current status and future
plans,” in Proceedings of the Linux Symposium, vol. 2, 2007, pp. 21–33.
[2] D. P. Bovet and M. Cesa, Understanding the Linux Kernel, 3rd ed. O’Reilly, 2005.
247
Chapter 9 | Persistent storage
Chapter 10
Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
250
10.1 Overview
This chapter will introduce networking from an operang systems perspecve. We discuss why
networking is treated dierently from other types of I/O and what the operang system requirements
are to support networking. We introduce POSIX socket programming both in terms of the role the
OS plays (e.g., socket buers, le abstracon, supporng mulple clients,.) as well as from a praccal
perspecve.
The focus of this book is not on networking per se; we refer the reader to the standard textbooks by
Peterson and Davies [1] or Tanenbaum [2] or the open source book [3] by Bonaventure
1
.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Explain the role of the Linux kernel in networking.
2. Discuss the relaonship between and structure of the Linux networking stack and the kernel
networking architecture.
3. Use the POSIX API for programming networking applicaons: data types, common API and ulity
funcons.
4. Build TCP and UDP client/server applicaons and handle mulple clients.
10.2 What is networking
When we say "networking," we refer to the interacon of a computer system with other computer
systems using an intermediate communicaon infrastructure. In parcular, our focus will be on the
TCP/IP protocol and protocols implemented on top of TCP/IP such as HTTP; and to a lesser extent
on the wired (802.3) and wireless (802.11) Ethernet media access control (MAC) protocols.
10.3 Why is networking part of the kernel?
The network interface controller (NIC, aka network adapter) is a peripheral I/O device . Therefore,
as with all peripherals, access to this device is controlled via a device driver which must be part of
the kernel. However, why does the kernel also implement the TCP/IP protocol suite? Why does it
not leave this to user space and simply deliver the data as received by the NIC straight to the user
applicaon?
And indeed, there are a number of user space TCP/IP implementaons [4, 5, 6, 7]. Some of these
claim to outperform the Linux kernel TCP/IP implementaon, but the performance of the Linux kernel
network stack has improved considerably, and version 4.16 (the current kernel at the me of wring)
contained a lot of networking changes.
1
Available at hp://cnp3book.info.ucl.ac.be/
251
However, there are two main reasons to put networking in the kernel:
If we would not do this, only a single process at a me could have access to the network card. By
using the kernel network stack, we have the ability to run mulple network applicaons, servers
as well as clients. Achieve the same result eciently in user space is impossible because a process
cannot preempt another process like the OS scheduler can.
Furthermore, there is the issue of control over the incoming packets. Unlike other peripherals,
which are typically an integral part of the system and enrely under control of the user, the NIC
delivers data from unknown sources. If we would delegate the networking funconality to user
space, then the kernel couldn’t act as the controller of the incoming (and outgoing) data.
10.4 The OSI layer model
Communicaon networks have tradionally been represented as layered models. In parcular, the OSI
(Open Systems Interconnecon) reference model [8], ocially the ITU standard X.200, is very widely
known. A shown in Table 10.1), this model consists of seven layers. The protocol data unit (PDU) is
informaon that is transmied as a single unit among peer enes of a computer network.
Table 10.1: OSI layer model.
The upper four layers, Applicaon, Presentaon, Session, and Transport, are known as the “Host
layers.They are responsible for accurate and reliable data delivery between applicaons in computer
systems. They are called “host” layers because their funconality is implemented—at least in
principle—solely by the host systems, and the intermediate systems in the network don’t need to
Layer Protocol data unit Funcon
Host layers 7. Applicaon Data The sole means for the applicaon process to access
the OSI environment, i.e., all OSI services directly
usable by the applicaon process.
6. Presentaon Representaon of informaon communicated
between computer systems. This could, for example,
include encoding, compression, and encrypon.
5. Session Control of the connecons between computer
systems. Responsible for session management,
including checkpoinng and recovery.
4. Transport Segment, Datagram Transparent transfer of data, including reliability,
ow control, and error control.
Media layers 3. Network Packet Funconality to transfer packets between computer
systems. In pracce, this means the roung protocol
and the packet format.
2. Data link Frame Funconality to manage data link (i.e. node-to-node)
connecons between computer systems.
1. Physical Symbol Actual hardware enabling the communicaon
between computer systems as raw bitstreams.
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
252
implement these layers. The lower three layers, Network, Data Link and Physical, are known as “Media
layers” (short for communicaons media layer). The media layers are responsible for delivering the
informaon to the desnaon for which it was intended. The funconality of these layers is typically
implemented in the network adapter.
10.5 The Linux networking stack
In pracce, the Session and Presentaon layers are not present as disnct layers in the typical TCP/IP
based networking stack. A praccal layer model for the TCP/IP protocol suite is shown in Figure 10.1.
Figure 10.1: Layer model for the TCP/IP protocol suite.
The Linux kernel provides the link layer, network layer, and transport layer. The link layer is
implemented through POSIX-compliant device drivers; the network and transport layers (TCP/IP)
are implemented in the kernel code. In the next secons, we provide an overview of the Linux kernel
networking architecture (Figure 10.2).
Figure 10.2: Linux kernel networking architecture.
Hardware layer
Link layer
Network layer
Transport layer
Application layer
User/Application
Ethernet
Ethernet driver
IP, IPv6 protocols
TCP, UDP protocol
HTTP, SMTP, SSL protocols
Web browser, email client
ExampleNetwork stack
User space
Kernel space
Network Interface Controller (NIC)
Device-agnostic interface
Network protocols (INET)
Application layer
User space
Kernel space
System call interface
Hardware
Device drivers
Protocol-agnostic interface
253
10.5.1 Device drivers
The physical network devices (NIC) are managed by device drivers. For what follows, we assume the
NIC is an Ethernet device. The device driver is a soware interface between the kernel and the device
hardware. On the kernel side, it uses a low-level but standardized API so that any driver for a dierent
NIC can be used in the same way. In other words, the device driver abstracts away as much as possible
the specic hardware.
The normal le operaons (read, write, ...) do not make sense when applied to the interacon between
a driver and a NIC, so they do not follow the "everything is a le" philosophy. The main dierence is
that a le, and by extension a le storage device is passive, whereas a network device acvely wants
to push incoming packets toward the kernel. So NIC interrupts are not a result of a previous kernel
acon (as is the case with, e.g. le operaons), but of the arrival of a packet. Consequently, network
interfaces exist in their own namespace with a dierent API.
10.5.2 Device-agnosc interface
The network protocol implementaon code interfaces with the driver code through an agnosc
interface layer which allows us to connects various protocols to a variety of hardware device drivers.
To achieve this, the calls work on a packet-by-packet basis so that it is not necessary to inspect the
packet content or keep protocol-specic state informaon at this level. The interface API is dened in
linux/net/core/dev.c. The actual interface is a struct of funcon pointers called net_- device_ops,
dened in include/linux/netdevice.h. In the driver code, the applicable elds are populated using
driver-specic funcons.
10.5.3 Network protocols
Packets get handed over the actual network protocol funconality in the kernel. For our purpose, we
focus on the TCP/IP protocol, known in the Linux kernel as inet. This is a whole suite of protocols, the
best-known of which are IP, TCP, and UDP. The code for this can be found in net/ipv4 for IP v4 and in
net/ipv6 for IP v6.
In parcular, IPv4 protocols are inialized in a inet_init() (dened in linux/net/ipv4/af_inet.c). This
funcon registers each of the built-in protocols using the proto_register() funcon (dened in linux/net/
core/sock.c). It adds the protocol to the acve protocol list and also oponally allocates one or more
slab caches. The Linux kernel implements a caching memory allocator to hold caches (called slabs)
of idencal objects. A slab is a set of one or more conguous pages of memory set aside by the slab
allocator for an individual cache.
10.5.4 Protocol-agnosc interface
The network protocols interface with a protocol-agnosc layer that provides a set of common
funcons to support a variety of dierent protocols. This layer is called the sockets layer, and it
supports not only the common TCP and UDP transport protocols but also the IP roung protocol,
various Ethernet protocols, and others, e.g., Stream Control Transmission Protocol (SCTP). We will
discuss the POSIX socket interface in more detail in secon 10.6.
The socket interface is an abstracon for the network connecon. The socket datastructure contains
all of the required state of a parcular socket, including the parcular protocol used by the socket
and the operaons that may be performed on it. The networking subsystem knows about the
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
254
available protocols through a special structure that denes its capabilies. Each protocol maintains a
(large) structure called proto (dened in include/net/sock.h). This struct denes the parcular socket
operaons that can be performed from the sockets layer to the transport layer (for example, how to
create a socket, how to establish a connecon with a socket, how to close a socket, etc.).
10.5.5 System call interface
We have covered the Linux system call interface in Chapter 5. Essenally, this is the interface
between user space and kernel space. Recall that Linux system calls are idened by a unique
number and take a variable number of arguments. When a networking call is made by the user,
the system call interface of the kernel maps it to a call to sys_socketcall (dened as SYSCALL_-
DEFINE2(socketcall,...) in net/socket.c), which then further demulplexes the call to its
intended target, e.g., SYS_SOCKET, SYS_BIND, etc.
It is also possible to use the le abstracon for networking I/O. For example, typical read and write
operaons may be performed on a networking socket (which is represented by a le descriptor, just
as a normal le). Therefore, while there exist a number of operaons that are specic to networking
(creang a socket with the socket call, connecng it to a desnaon with the connect call, and so on),
there are also a number of standard le operaons that apply to networking objects just as they do
to regular les.
10.5.6 Socket buers
A consequence of having many layers of network protocols, each one using the services of another,
is that each protocol needs to add protocol headers (and/or footers) to the data as it is transmied and
to remove them as packets are received. This could make passing data buers between the protocol
layers dicult as each layer would need to nd where its parcular protocol headers and footers are
located within the buer. Copying buers between layers would, of course, work, but it would be very
inecient. Instead, the Linux kernel uses socket buers (a.k.a. sk_bus) (struct sk_bu) to pass data
between the protocol layers and the network device drivers. Socket buers contain pointer and length
elds that allow each protocol layer to manipulate the applicaon data via standard funcons.
Figure 10.3: Socket buer structure.
struct sk_buff *next
struct sk_buff *prev
struct sock *sk
struct net_device *dev
...
__u16 mac_header
__u16 network_header
__u16 transport_header
*sk_buff
*sk_buff
struct sk_buff
sock
structure
device (NIC)
sk_buff_data_t tail
sk_buff_data_t end
unsigned char *head
unsigned char *data
...
MAC
IP
TCP
Packet
255
Essenally, an sk_bu combines a control structure with a block of memory. Two main sets of
funcons are provided in the sk_bu library: the rst set consists of rounes to manipulate doubly
linked lists of sk_bus; the second set of funcons for controlling the aached memory. The buers
are stored in linked lists opmized for the common network operaons of append to end and remove
from start. In pracce, the structure is quite complicated (the complete struct comprises 66 elds).
Figure 10.3 shows a simplied diagram of the sk_bu struct.
10.6 The POSIX standard socket interface library
In this secon, we present some of the most useful POSIX standard socket interface library funcons
and the related internet data types and constants. The selecon focuses on IPv4 TCP stream sockets.
10.6.1 Stream socket (TCP) communicaons ow
The Transmission Control Protocol is a core protocol in the TCP/IP stack and implements one of
the two transport layer (OSI layer 4) protocols (the other being UDP, the User Datagram Protocol).
All incoming IP network layer packets marked with the relevant TCP idener in the IP protocol ID header
eld are passed upwards to TCP, and all outgoing TCP packets are passed down to the IP layer for sending.
In turn, TCP is responsible for idenfying the (16-bit) port number from the TCP packet header and
forwarding the TCP packet payload to any acve socket associated with the specied port number.
TCP is reliable and connecon-oriented and as such employs various handshaking acvies in the
background between the TCP layers in the communicang nodes to handle the setup, reliability
control and shutdown of the TCP connecon. The socket API provides a simplied programming
model for the TCP to applicaon interface, and the connected stream sockets can be considered
as the communicaon endpoints of a virtual data circuit between two processes.
To establish a socket connecon, one of the communicang processes (the server) needs to be acvely
waing for a connecon on an acve socket and the other process (the client) can then request
a connecon and if successful the connecon is made. The meline of the various socket library
funcon calls required in a typical (simple) stream socket connecon is shown below:
Server meline Client meline Descripon
1. Socket(. . . ) Server creates a socket le descriptor
2. Setsockopt(. . . ) Congure server socket protocol opons (1 call per opon)
3. Bind(. . . ) Associate the server socket with a predened local port number
4. Listen(. . . ) Allow client connecons on the server socket
5. Accept(. . . ) Wait for client connecon request
1. Socket(. . . ) Client creates a socket le descriptor
2. Connect(. . . ) Client requests connecon to the server socket
6. Recv(. . . )/send(. . . ) 3. Recv(. . . )/send(. . . ) Client/Server data communicaons
7. Close(. . . ) 4. Close(. . . ) Either process can close the stream socket connecon rst
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
256
Treang stream sockets as standard system devices: read()/write()
The read() and write() low level I/O library funcons are not part of the standard socket
library; however stream sockets behave in much the same manner as any other operang
system device (standard input/output, le, etc) and low-level system device I/O operaons
are therefore compable with stream socket I/O. The use of these funcons in place of the
standard socket library funcons send(), and recv() (used for stream sockets only) is a common
programming nicety that will allow the simple redirecon of process communicaons from
network to any other available I/O device in the host OS. In comparison; the standard socket
library funcons sendto() and recvfrom() used for datagram sockets (UDP) are not compable
with the low-level stream I/O due to their unreliable and conneconless characteriscs and
therefore cannot be treated in the same way.
Note
A read from a stream socket (using the read() or recv() funcons) may not return all of the
expected bytes in the rst go, and the read operaon may need to be repeated an unspecied
number of mes with the read results concatenated unl the full number of expected bytes has
been received. If the expected number of bytes is not known in advance, the stream should be
read a small block of bytes (possible 1 byte) at a me unl the receive count is idened using
a data size eld within the received data or a predened data terminator sequence. It is up to
the individual internet applicaon to dene any data size eld syntax and/or data terminators
used. Aempng to read more data that has been sent will block the read() or recv() funcon
call which will hang waing for new data.
10.6.2 Common Internet data types
As menoned previously — only the stream socket related library funcons and associated data types
and constants are listed here. Due to target dierences in the fundamental integer data types ulised
between various implementaons of the standard socket interface — the POSIX dened types ‘u_char
(8-bit), ‘u_short’ (16-bit) and u_long’ (32-bit) (normally declared in sys/types.hfor UNIX systems)
are used here to signify xed word length integer data types and may be ulized in any required
programming type casts.
The following secons discuss some reference material for useful standard socket interface library
funcons and internet data types.
Socket address data type: struct sockaddr
The socket address data structure used in various socket library funcon calls is dened in sys/socket.h as:
Lisng 10.6.1: socket address struct C
1 struct sockaddr {
2 u_char sa_family; /* address family */
3 char sa_data[14]; /* value of address */
4 };
257
Internet socket address data type: struct sockaddr_in
The members of the socket address data structure do not seem to relate much to what we would
expect for an internet address (and port number). This is due to the fact that the socket interface is
not restricted to internet communicaons and many alternave underlying host to host transport
mechanisms are available (specied by the value of the ‘sa_family’ socket address structure member),
and these have dierent address schemes that have to be supported. The 14-byte address data is
formaed in dierent ways depending on the underlying transport. For simplicity; a specic internet
socket address structure has also been dened which is used as an overlay to the more generic socket
address structure. This makes programming the address informaon much more convenient as
a template for the specic internet address value format is provided:
Lisng 10.6.2: internet socket address struct C
1 #include <netinet/in.h>
2 struct sockaddr_in {
3 sa_family_t sin_family; /* address family: AF_INET */
4 in_port_t sin_port; /* port in network byte order */
5 struct in_addr sin_addr; /* internet address */
6 };
The internet socket address structure has the member type struct in_addr:
Lisng 10.6.3: internet address struct (IPv4) C
1 #include <netinet/in.h>
2
3 struct in_addr {
4 uint32_t s_addr; /* address in network byte order */
5 };
When using a variable of internet socket address type, it is good pracce to zero-ll the overlay
padding sin_zero.
Network byte order versus host byte order
The network byte order for TCP/IP is dened as big-endian; this is reected in the data types
used in the standard socket interface library. As such it is essenal that the host byte order
is correctly mapped to the network byte order when seng values of the standard socket
data type variables used and vice versa when interpreng these values. The htons() and htonl()
funcons are used to convert host byte order 16-bit and 32-bit data types to their respecve
network byte order and ntohs() and ntohl() funcons are used to convert network byte order
16-bit and 32-bit data types to their respecve host byte order. This feature of standard
socket programming is a minor but essenal aspect to ensure the portability of internet
applicaon code.
Arm plaorms can be congured to run in lile-endian or big-endian mode at boot me, so it is
essenal to use the above conversion funcons to ensure correctness of the code.
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
258
10.6.3 Common POSIX socket API funcons
Create a socket descriptor: socket()
A socket (socket(2)) is opened, and its descriptor created using:
Lisng 10.6.4: socket() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3
4 int socket(int domain, int type, int protocol);
Return value: the returned socket descriptor is a standard I/O system le descriptor and can also be
used with the I/O funcons: close() and in the case of stream type sockets with read()’ and ‘write().
On error, the value -1 is returned.
Input parameters: The address family parameter domain should be set to AF_INET for internet
socket communicaons. The socket type parameter type should be selected from SOCK_STREAM
or SOCK _DGRAM for stream (TCP) and datagram (UDP) sockets respecvely. The protocol family
parameter family can be set to 0 to allow the socket funcon to select the associated protocol family
automacally.
Bind a server socket address to a socket descriptor: bind()
For ‘server’ type applicaons (i.e., those that listen for incoming connecons on an opened socket) the
server socket address is bound to a socket descriptor using bind(2):
Lisng 10.6.5: bind() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3 int bind(int sockfd, const struct sockaddr *addr,
4 socklen_t addrlen);
Return value: the funcon returns 0 on success or -1 if the socket is invalid, the specied socket
address is invalid or in use or the specied socket descriptor is already bound.
Input parameters: typically for internet server type applicaons an internet socket address is used for
convenience when specifying the local socket address; however since the internet socket address
structure is designed as an overly to the generic socket address structure — variables of type struct
sockaddr_in can be passed as the addr parameter using a suitable type cast. Before calling the bind()
funcon it is necessary to populate the internet socket address (shown as myaddr below) with the
local system IP address and the required server port number:
259
Lisng 10.6.6: populang the internet socket address for bind() C
1 mysd = socket(AF_INET, SOCK_STREAM, 0);
2 memset((char *) &myaddr, 0, sizeof(struct sockaddr_in)); /* zero socket address */
3 myaddr.sin_len = (u_char) sizeof(struct sockaddr_in); /* address length */
4 myaddr.sin_family = AF_INET; /* internet family */
5 myaddr.sin_addr.s_addr = inet_addr("192.168.0.10"); /* local IP address */
6 myaddr.sin_port = htons(3490); /* local server port */
7 bind(mysd, (struct sockaddr_in *) &myaddr, sizeof(struct sockaddr) );
Note the use of ‘memset()’ from the ANSI string library to rst zero the internet socket address bytes
and the internet address manipulaon funcon inet_addr() to produce the (network byte order) 4 byte
IP address. Using the specic port number 0 tells bind() to choose a suitable unused port — if that is
desired rather than having a xed server port allocaon (the selected port gets wrien to the supplied
socket address before return). Wring the specic local IP address is not very convenient, and the
code can ulmately be made more portable using the INADDR_ANY predened IP address (declared
for use with struct sockaddr_in) which tells bind() to use the local system IP address automacally
(which is also wrien to the supplied socket address before return). Therefore, the server local IP
address is more typically set using:
myaddr.sin_addr.s_addr = htonl(INADDR_-
ANY); /* auto local IP address */
Enable server socket connecon requests: listen()
Once a server socket descriptor has been bound to a socket address it is then necessary to enable
connecon requests to this socket and create an incoming connecon request queue using listen(2):
Lisng 10.6.7: listen() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3
4 int listen(int sockfd, int backlog);
Return value: the funcon returns 0 if okay or -1 if the socket is invalid or unable to listen.
Input parameters: incoming connecon requests are queued unl accepted by the server. The parameter
backlog is used to specify the maximum length of this queue and should have a value of at least 1.
Accept a server socket connecon request: accept()
Server socket connecon requests are accepted using accept(2):
Lisng 10.6.8: accept() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3
4 int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
260
Return value: the funcon returns the newly created socket descriptor associated with the client
socket address on success or -1 on error.
Input parameters: a socket address structure variable (more likely of type struct sockaddr_in with
a suitable type cast) is provided as parameter addr and is used to record the socket address associated
with the socket descriptor of the accepted incoming connecon request which is returned on success.
A pointer to an integer containing the socket address structure length is provided as parameter
addrlen, and this integer variable should contain the length of the socket address structure on input
and is modied to the actual address bytes used on return.
On success, the returned client socket descriptor can be used by the server to send and receive data
to the client.
If no pending connecons are present on the queue, and the socket is not marked as nonblocking,
accept() blocks unl a connecon is present. If the socket is marked nonblocking and no pending
connecons are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK.
Linux kernel implementaon of accept()
If the accept() is blocking, the kernel will take care of sleeping the caller unl the call returns. The
process will be added to a wait queue and then suspended unl a TCP connecon request is received.
Once a connecon request has been received, the sock data structure is returned to the socket layer.
The le descriptor number of the new socket is returned to the process as the return value of the
accept() call.
Client connecon request: connect()’
For ‘client’ type applicaons (i.e., those that connect to an acve server socket) a connecon request
to a specied server socket address is made using connect(2):
Lisng 10.6.9: connect() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3
4 int connect(int sockfd, const struct sockaddr *addr,
5 socklen_t addrlen);
Return value: the funcon returns 0 on success or -1 on error.
Input parameters: typically for internet client type applicaons an internet socket address is used for
convenience when specifying the remote server socket address; however since the internet socket
address structure is designed as an overly to the generic socket address structure variables of type
struct sockaddr_in can be passed as the addr parameter using a suitable type cast. Before calling the
connect() funcon, it is necessary to populate the internet socket address (shown as srvaddr below)
with the remote server system IP address and the required server port number:
261
Lisng 10.6.10: populang the internet socket address for connect() C
1 srvsd = socket(AF_INET, SOCK_STREAM, 0);
2 memset((char *) &srvaddr,0, sizeof(struct sockaddr_in)); /* zero socket address */
3 srvaddr.sin_len = (u_char) sizeof(struct sockaddr_in); /* address length */
4 srvaddr.sin_family = AF_INET; /* internet family */
5 srvaddr.sin_addr.s_addr = inet_addr("192.168.0.10"); /* server IP address */
6 srvaddr.sin_port = htons(3490); /* server port */
7 connect(srvsd, (struct sockaddr_in *) &srvaddr, sizeof(struct sockaddr) );
On success, the socket descriptor used in the connecon request can be used by the client to send
and receive data to the server.
Write data to a stream socket: send()
Data is wrien to a stream socket using send(2):
Lisng 10.6.11: send() API call C
1 #include <sys/types.h>
2 #include <sys/socket.h>
3
4 ssize_t send(int sockfd, const void *buf, size_t len, intags);
5 ssize_t sendto(int sockfd, const void *buf, size_t len, intags,
6 const struct sockaddr *dest_addr, socklen_t addrlen);
7 ssize_t sendmsg(int sockfd, const struct msghdr *msg, intags);
Return value: the funcon returns the actual number of bytes sent or -1 on error.
Input parameters: in-stream sockets the socket transport protocol bitwise ags of MSG_OOB (send
as urgent) and MSG_DONTROUTE (send without using roung tables) can be used (mulple bitwise
ags can be set concurrently by OR’ing the selecon). For standard data sending the value 0 is used
for parameter ags.
Because of the stream socket operaon compability with system I/O device operaon, the send()
socket-specic funcon is somemes replaced with the generic write() system I/O funcon. This
means that data sending can be easily redirected to other system devices (such as an opened le or
standard output).
Read data from a stream socket: recv()
Data is wrien to a stream socket using recv(2):
Lisng 10.6.12: recv() API call C
1 #include <sys/types.h>
2 #include <sys/socket.h>
3
4 ssize_t recv(int sockfd, void *buf, size_t len, intags);
5 ssize_t recvfrom(int sockfd, void *buf, size_t len, intags,
6 struct sockaddr *src_addr, socklen_t *addrlen);
7 ssize_t recvmsg(int sockfd, struct msghdr *msg, intags);
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
262
Return value: the funcon returns the actual number of bytes read into the receive buer; or
0 on end of le (socket disconnecon); or -1 on error.
Input parameters: in-stream sockets the socket transport protocol bitwise ags of MSG_OOB (receive
urgent data) and MSG_PEEK (copy data without removing it from the socket) can be used (mulple
bitwise ags can be set concurrently by OR-ing the selecon). For standard data recepon, the value 0
is used for parameter ags.
Because of the stream socket operaon compability with system I/O device operaon, the recv()
socket specic funcon is somemes replaced with the generic read() system I/O funcon. This
means that data recepon can be easily redirected to other system devices (such as an opened le
or standard input). Care should be taken when reading an expected number of bytes; the socket
transport does not guarantee when to receive bytes will be available, and blocks may be split into
smaller receive secons which may confound a simple socket read approach.
If no messages are available at the socket, the recv() call waits for a message to arrive, unless the
socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable
errno is set to EAGAIN or EWOULDBLOCK. The recv() call normally returns any data available, up
to the requested amount, rather than waing for receipt of the full amount requested. Therefore in
pracce, recv() is usually called in a loop unl the required number of bytes has been received.
Seng server socket opons: setsockopt()
It is possible to set important underlying protocol opons for a server socket using setsockopt(2):
Lisng 10.6.13: setsockopt() API call C
1 #include <sys/types.h> /* See NOTES */
2 #include <sys/socket.h>
3 int getsockopt(int sockfd, int level, int optname,
4 void *optval, socklen_t *optlen);
5 int setsockopt(int sockfd, int level, int optname,
6 const void *optval, socklen_t optlen);
Return value: the funcon returns 0 on success or -1 if the socket is invalid, or the opon is unknown,
or the funcon is unable to set the opon.
Input parameters:
Socket Opon Used For
SO_KEEPALIVE
Detecng dead connecons (connecon is dropped if dead)
SO_LINGER
Graceful socket closure (does not close unl all pending transacons complete)
TCP_NODELAY
Allowing immediate transmission of small packets (no congeson avoidance)
SO_DEBUG
Invoking debug recording in the underlying protocol soware module
SO_REUSEADDR
Allows socket reuse of port numbers associated with “zombie“ control blocks
SO_SNDBUF
Adjusng the maximum size of the send buer
SO_RCVBUF
Adjusng the maximum size of the receive buer
SO_RCVBUF
Enabling the use of the TCP expedited data transmission.
263
The most commonly applied socket opon for internet server applicaons is the socket reuse address
opon which is required to allow the server to bind a socket to a specic port that has not yet been
enrely freed by a previous session. Without this seng, any call to bind() may be prevented by a
“zombie” session. In order to set this opon, the dened SOL_SOCKET (socket protocol level) is used
for parameter level; the SO_REUSEADDR (the predened name of socket reuse address opon) is
used for parameter optname and for this opon the value is an integer which is set to 0 (OFF) or 1
(ON). A simple example of this for the myfd server socket descriptor is shown below:
Lisng 10.6.14: Example setsockopt() API call C
1 sra_val = 1;
2 setsockopt(myfd, SOL_SOCKET, SO_REUSEADDR, (char *) &sra_val, sizeof(int))
Many other protocol opons are available, see the man page for more details.
10.6.4 Common ulity funcons
Internet address manipulaon funcons
The following internet address manipulaon funcons are available:
Lisng 10.6.15: Internet address manipulaon funcons C
1 #include <arpa/inet.h>
2 /* converts dotted decimal IP address string */
3 /* to network byte order 4 byte value */
4 u_long inet_addr(char * addr);
5 /* converts network byte order 4 byte IP addr*/
6 /* to dotted decimal IP address string */
7 char *inet_ntoa(struct in_addr addr);
Internet network/host byte order manipulaon funcons
The following network/host byte order manipulaon funcons are available and should be
consistently applied:
Lisng 10.6.16: Network/host byte order manipulaon funcons C
1 #include <netinet/in.h>
2 u_short htons(u_short x); /* 16-bit host to network byte order convert */
3 u_short ntohs(u_short x); /* 16-bit network to host byte order convert */
4 u_long htonl(u_long x); /* 32-bit host to network byte order convert */
5 u_long ntohl(u_long x); /* 32-bit network to host byte order convert */
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
264
Host table access funcons
The local host name can be read from the host table using:
Lisng 10.6.17: Host table access funcons C
1 #include <unistd.h>
2 int gethostname (
3 char *name, /* name string buer */
4 int namelen /* length of name string buer */
5 );
Return value: the funcon returns 0 on success or -1 on error.
10.6.5 Building applicaons with TCP
The TCP protocol provides a reliable, bi-direconal stream service over an IP-based network between
pairs of processes.
One process is known as the server; when it comes to life, it binds itself to a parcular TCP port
number on the host upon which it executes and at which it will provide its parcular service.
The other process is known as the client; when it comes to life, it connects to a server on a parcular
host that is bound to a parcular TCP port number. Upon compleon of the connecon, either party
can begin sending bytes to the other party over the stream.
Request/response communicaon using TCP
The TCP protocol is designed to maximize the reliable delivery of data end-to-end; to enable both the
reliable delivery and to maximize the amount of data so delivered, the protocol is allowed to ship the
data supplied by the sender into as many packets as it likes (within reason). In parcular, TCP does not
guarantee that:
A sender’s data is sent as soon as the send() or write() system call completes — i.e., your system can
choose to buer the data from several send()/write() system calls before actually sending the data
over the network to the server.
A receiver receives the data in the same sized chunks that were specied in the sender’s send()/
write() system calls — i.e., it does not maintain “message” boundaries.
If you are trying to implement a request/response applicaon protocol over TCP, then you need to
program around these features. In the following secons, it is assumed that your client and server
must maintain message boundaries.
265
Force the sending side to send your data over the network immediately
Lisng 10.6.18: Example ush() API call C
1 int s; /* your socket that has been created and connected */
2 int len;
3 FILE *sockout;
4 sockout = fdopen(s, "w"); /* FILE stream corresponding to le descriptor */
5 len = strlen("your message\n"); /* calculate length of message */
6 write(s, "your message\n", len); /* send the message */
7 lush(sockout);/* force the message over the network */
Maintaining message boundaries
If your messages consist only of characters, use a sennel character sequence at the end of each
message — e.g., <cr><lf>
If you have binary messages, then the actual message sent consists of a 2-byte length, in network
order, followed by that many bytes.
TCP server
As described in Secon 10.6.1 above, a TCP server must execute the socket funcons according to
the following pseudocode:
Lisng 10.6.19: TCP server pseudocode C
1 s = socket(); /* create an endpoint for communication */
2 bind(s); /* bind the socket to a particular TCP port number */
3 listen(s); /* listen for connection requests */
4 while(1) { /* loop forever */
5 news = accept(s); /* accept rst waiting connection */
6 send()/recv() over news /* interact with connected process */
7 close(news); /* disconnect from connected process */
8 }
9 close(s);
You may need to perform one or more setsockopt() calls before invoking bind(). Below is an example
of a skeleton TCP server that reads all data from the connecon and writes it to stdout.
Lisng 10.6.20: Example skeleton TCP server C
1 #include <stdio.h>
2 #include <sys/types.h>
3 #include <sys/socket.h>
4 #include <netinet/in.h>
5
6 #dene MYPORT 3490 /* the port users will be connecting to */
7
8 int main(int argc, char *argv[]) {
9 int sfd, cfd; /* listen on sfd, new connections on cfd */
10 struct sockaddr_in my_addr; /* my address information */
11 struct sockaddr_in their_addr; /* client address information */
12 int sin_size, c;
13 int yes=1;
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
266
14
15 /**** open the server (TCP) socket */
16 if ((sfd = socket(AF_INET, SOCK_STREAM, 0)) == -1) {
17 perror("socket");
18 return(-1);
19 }
20
21 /**** set the Reuse-Socket-Address option */
22 if (setsockopt(sfd, SOL_SOCKET, SO_REUSEADDR, (char*)&yes, sizeof(int))==-1) {
23 perror("setsockopt");
24 close(sfd);
25 return(-1);
26 }
27
28 /**** build server socket address */
29 bzero((char*) &my_addr, sizeof(struct sockaddr_in));
30 my_addr.sin_family = AF_INET;
31 my_addr.sin_addr.s_addr = htonl(INADDR_ANY);
32 my_addr.sin_port = htons(MYPORT);
33
34 /**** bind server socket to the local address */
35 if (bind(sfd, (struct sockaddr *)&my_addr, sizeof(struct sockaddr)) == -1) {
36 perror("bind");
37 close(sfd);
38 return(-1);
39 }
40
41 /**** create queue (1 only) for client connection requests */
42 if (listen(sfd, 1) == -1) {
43 perror("listen");
44 close(sfd);
45 return(-1);
46 }
47
48 /**** accept connection and read data until EOF, copying to standard output */
49 sin_size = sizeof(struct sockaddr_in);
50 if ((cfd = accept(sfd, (struct sockaddr *)&their_addr, &sin_size)) == -1) {
51 perror("accept");
52 close(sfd);
53 return(-1);
54 }
55 while (read(cfd, &c, 1) == 1)
56 putc(c, stdout);
57 close(cfd);
58 close(sfd);
59
60 return 0;
61 }
TCP client
As described above, a TCP client must execute the socket funcons according to the following
pseudocode:
Lisng 10.6.21: TCP client pseudocode C
1 s = socket(); /* create an endpoint for communication */
2 connect(s); /* connect the socket to a particular host and TCP port number */
3 send()/recv() over s /* interact with server process */
4 close(s); /* disconnect from connected process */
267
You can see from the above code that a server needs to know the port to which they will bind, and
from the pseudocode that the client needs to know the port to which the server is bound. A stream in
TCP is idened by a 4-tuple of the form [source host, source port, desnaon host, desnaon port].
The connect() socket call actually assigns a random TCP port to the client. Since it is not a server, the
fact that the port is randomly chosen from the legal port space is immaterial. The following TCP client
connects to the above server and sends all data obtained from standard input to the server.
Lisng 10.6.22: Example skeleton TCP client C
1 /*
2 ** TCPclient.c -- a TCP socket client
3 ** connects to 127.0.0.1:3490, sends contents of standard input
4 **
5 */
6
7 #include <stdio.h>
8 #include <sys/types.h>
9 #include <sys/socket.h>
10 #include <netinet/in.h>
11
12 #dene MYPORT 3490 /* the port users will be connecting to */
13
14 int main(int argc, char* argv[]) {
15 int sfd; /* connect on sfd */
16 struct sockaddr_in s_addr; /* server address information */
17 char buf[1024];
18 int len;
19
20 /**** open the server (TCP) socket */
21 if ((sfd = socket(AF_INET, SOCK_STREAM, 0)) == -1) {
22 perror("socket");
23 return(-1);
24 }
25
26 /**** build server socket address */
27 bzero((char*) &s_addr, sizeof(struct sockaddr_in));
28 s_addr.sin_family = AF_INET;
29 s_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
30 s_addr.sin_port = htons(MYPORT);
31
32 /**** connect to server */
33 if (connect(sfd, (struct sockaddr *)&s_addr, sizeof(struct sockaddr)) == -1) {
34 perror("connect");
35 close(sfd);
36 return(-1);
37 }
38
39 while (fgets(buf, sizeof(buf), stdin) != NULL) {
40 len = strlen(buf);
41 if (send(sfd, buf, len, 0) != len) {
42 perror("send");
43 close(sfd);
44 return(-1);
45 }
46 }
47 close(sfd);
48
49 return 0;
50 }
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
268
10.6.6 Building applicaons with UDP
The UDP protocol provides an unreliable, bi-direconal datagram service over an IP-based network
between pairs of processes. Unlike TCP, there are no “connecons” in UDP. A process that wishes to
interact with other processes via UDP simply has to bind itself to a UDP port on its host. As long as it
knows of at least one other process’s host/port pair, it can begin to communicate with that process.
When a process receives a UDP message, it can be informed of the host/port pair for the process that
sent the message.
If you think about servers in the TCP realm, they adverse on well-known ports . We can think of long-
lived processes that bind themselves to well-known UDP ports as servers.
Processes that bind themselves to random UDP ports, and that iniate communicaons with other
processes, can be considered to be UDP clients.
The meline of the various socket library funcon calls required in a typical (simple) datagram socket
interacon is shown below:
Note that UDP communicaon is unreliable. UDP primarily provides the ability to put applicaon-
level data directly into IP packets, with the UDP header providing the port informaon necessary to
direct the data, if received, to the correct process. UDP also provides a data integrity checksum of the
applicaon data so that a receiver knows that if it receives the data, it has received the correct data —
i.e. the data in the packet has not been corrupted.
Since the applicaon data is placed in an IP packet, this implies that the size of the applicaon
message, plus the UDP and IP headers, cannot exceed the size of an IP packet. Hosts negoate
the maximum IP packet size for communicaons between them; most networks support packets
containing 1536-byte UDP packets, but some are limited to 512 bytes UDP packets. If you have larger
messages to send, then you must fragment your message into mulple UDP packets, and reassemble
them at the receiver. For this reason, most uses of UDP are for short messages, such as measurements
from distributed sensors.
The following program lisngs are for UDP versions of the service provided in Secon 10.6.5.
Server Process Client Process Alternave Client Descripon
1. socket(. . . ) 1. socket(. . . ) 1. socket(. . . ) Creates a socket le descriptor
2. bind(. . . ) 2. bind(. . . ) 2. bind(. . . ) Associate the socket with a UDP port number
(server’s is predened)
3. connect(. . . ) 3. recvfrom(. . . )/
sendto(. . . )
3. sendto(. . . )/
recvfrom(. . . )
Client binds server info to socket
4. send(. . . )/recv(. . . ) 4. Close(. . . ) 4. Close(. . . ) Client/Server data communicaons
5. Close(. . . ) Stop using the socket
269
UDP server
Lisng 10.6.23: Example skeleton UDP server C
1 / *
2 * UDPserver.c -- a UDP socket server
3 *
4 */
5
6 #include <sys/socket.h>
7 #include <sys/types.h>
8 #include <netinet/in.h>
9 #include <arpa/inet.h>
10 #include <stdio.h>
11
12 #dene MYPORT 3490 /* the port to which the server is bound */
13
14 int main(int argc, char *argv[]) {
15 int sfd; /* the socket for communication */
16 int len, n;
17 struct sockaddr_in s_addr; /* my s(erver) address data */
18 struct sockaddr_in c_addr; /* c(lient) address data */
19 char buf[1024];
20
21 memset(&s_addr, 0, sizeof(s_addr)); /* my address info */
22 s_addr.sin_family = AF_INET;
23 s_addr.sin_port = htons(MYPORT);
24 s_addr.sin_addr.s_addr = htonl(INADDR_ANY);
25
26 /**** open the UDP socket */
27 if ((sfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
28 perror("socket");
29 return(-1);
30 }
31
32 /**** bind to my local port number */
33 if ((bind(sfd, (struct sockaddr *)&s_addr, sizeof(s_addr)) < 0)) {
34 perror("bind");
35 return(-1);
36 }
37
38 /**** receive each message on the socket, printing on stdout */
39 while (1) {
40 memset(&c_addr, 0, sizeof(s_addr));
41 len = sizeof (c_addr);
42 n = recvfrom(sfd, buf, sizeof(buf), 0, (struct sockaddr *)&c_addr, &len);
43 if (n < 0) {
44 perror("recvfrom");
45 return(-1);
46 }
47 fputs(buf, stdout);
48 lush(stdout);
49 }
50 }
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
270
UDP client
Lisng 10.6.24: Example skeleton UDP client C
1 / *
2 * UDPclient.c -- a UDP socket client
3 *
4 */
5
6 #include <sys/types.h>
7 #include <sys/socket.h>
8 #include <netinet/in.h>
9 #include <arpa/inet.h>
10 #include <stdio.h>
11
12 #dene MYPORT 3490 /* the port to which the server is bound */
13
14 int main(int argc, char *argv[]) {
15 int sfd; /* the socket for communication */
16 struct sockaddr_in s_addr, m_addr; /* s(erver) and m(y) addr data */
17 char buf[1024];
18 int n;
19
20 memset(&m_addr, 0, sizeof(m_addr)); /* my address information */
21 m_addr.sin_family = AF_INET;
22 m_addr.sin_port = 0; /* 0 ==> assign me a port */
23 m_addr.sin_addr.s_addr = htonl(INADDR_ANY);
24
25 memset(&s_addr, 0, sizeof(s_addr)); /* server addr info */
26 s_addr.sin_family = AF_INET;
27 s_addr.sin_port = htons(MYPORT);
28 s_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
29
30 /**** open the UDP socket */
31 if ((sfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
32 perror("socket");
33 return(-1);
34 }
35
36 /**** bind to local UDP port (randomly assigned) */
37 if (bind(sfd, (struct sockaddr *)&m_addr, sizeof(m_addr)) < 0) {
38 perror("bind");
39 return(-1);
40 }
41
42 /**** send each line from stdin as a separate message to server */
43 while (fgets(buf, sizeof(buf), stdin) != NULL) {
44 n = strlen(buf) + 1; /* include the EOS! */
45 sendto(sfd, buf, n, 0, (struct sockaddr *)&s_addr, sizeof(s_addr));
46 }
47
48 /**** close the socket */
49 close(sfd);
50
51 }
271
UDP client using connect()
Instead of the using sendto()/recvfrom(), the UDP client could rst make a call to connect(), and then
use send()/recv():
Lisng 10.6.25: Example UDP client with connect() C
1 /* After call to bind() */
2 /**** connect to remote host and UDP port */
3 if (connect(sfd, (struct sockaddr *)&s_addr, sizeof(s_addr)) < 0) {
4 perror("connect");
5 return(-1);
6 }
10.6.7 Handling mulple clients
The skeleton TCP server code from Secon 10.6.5 will block on the accept() and read() calls for the
connecon to a single client. That means that it can only serve this client. Typically, serves should be
able to handle many client requests. In this secon, we discuss the mechanisms that can be used to
build mul-client servers.
The select() system call
The select(2) call enables one to monitor several sockets at the same me. It indicates which sockets
are ready for reading, which are ready for wring, and which sockets have raised excepons. While
select() is primarily used for networking applicaons, it works for le descriptors bound to any type
of I/O device. The synopsis is:
Lisng 10.6.26: select() API call C
1 /* According to POSIX.1-2001, POSIX.1-2008 */
2 #include <sys/select.h>
3 /* According to earlier standards */
4 #include <sys/time.h>
5 #include <sys/types.h>
6 #include <unistd.h>
7
8 int select(int nfds, fd_set *readfds, fd_set *writefds,
9 fd_set *exceptfds, struct timeval *timeout);
Return value: the return value is the number of le descriptors that have been set in the fd_- sets;
if a meout occurred, then the return value is 0. On error, the value -1 is returned.
Input parameters: For the nfds parameter, see below. Each fd_set parameter should have bits set
corresponding to le descriptors of interest for reading/wring/excepons; upon return, the fd_set
parameter will only have bits set for those le descriptors that are ready for reading/wring or those
that have generated excepons. The meout parameter should contain the me to wait before
returning; if the parameter has a value of 0, then select() simply checks the current state of the le
descriptors in the fd_set parameters and returns immediately; if meout is NULL, then select() waits
unl there is some acvity on one of the le descriptors specied.
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
272
The funcon monitors sets of le descriptors; in parcular readfds, writefds, and excepds. Each
of these is a simple bitset. If you want to see if you can read from standard input and some socket
descriptor, sockfd, just add the le descriptors 0 (for stdin) and sockfd to the set readfds.
The parameter numfds should be set to the values of the highest le descriptor plus one. In this
example, it should be set to sockfd+1, since it is assuredly higher than standard input (0).
The select call will block unl either:
a le descriptor becomes ready;
the call is interrupted by a signal handler; or
the meout expires.
When select() returns, readfds will be modied to reect which of the le descriptors you selected
which is ready for reading. You can test this with the macro FD_ISSET(). The following macros are
provided to manipulate sets of type fd_set:
void FD_ZERO(fd_set *set): clears a le descriptor set
void FD_SET(int fd, fd_set *set): adds fd to the set
void FD_CLR(int fd, fd_set *set): removes fd from the set
void FD_ISSET(int fd, fd_set *set): tests to see if fd is in the set
The struct meval allows you to specify a meout period. If the me is exceeded and select() sll
hasn’t found any ready le descriptors; it will return so you can connue processing.
The struct meval has the following elds:
Lisng 10.6.27: meval struct C
1 struct timeval {
2 int tv_sec; /* seconds to wait */
3 int tv_usec; /* microseconds to wait */
4 };
When select() returns, meout might be updated to show the me sll remaining. You should not
depend upon this, but this does imply that you must reset meout before each call.
Despite the provision for microseconds, the usual mer interval is around 10 milliseconds, so you will
probably wait that long no maer how small you set your struct meval. It is advisable to set your
mers to be mulples of 10 milliseconds.
273
Linux kernel implementaon of select()
The select() call works by looping over the list of le descriptors. For every le descriptor, it calls the
poll() method, which will add the caller to that le descriptor’s wait queue, and return which events
(readable, writeable, excepon) currently apply to that le descriptor.
The implementaon of the poll() method depends on the corresponding device driver, but all
implementaons have the following prototype:
Lisng 10.6.28: Linux kernel poll() method prototype C
1 unsigned int (*poll) (structle*,poll_table*);
The driver’s method will be called whenever the select() system call is performed. It is responsible for
two acons:
Call poll_wait() on one or more wait queues that could indicate a change in the poll status.
Return a bitmask describing operaons that could be immediately performed without blocking.
The poll_table struct (the second argument to the poll() method), is used within the kernel to implement
the poll() and select() calls; it is dened in linux/poll.h as a struct which contains a method to operate
on a poll queue and a bitmask.
Lisng 10.6.29: Linux kernel poll table struct C
1 typedef struct poll_table_struct {
2 poll_queue_proc _qproc;
3 __poll_t _key;
4 } poll_table;
An event queue that could wake up the process and change the status of the poll operaon can be
added to the poll_table structure by calling the funcon poll_wait():
Lisng 10.6.30: Linux kernel poll_wait() call C
1 static inline void poll_wait(structle*lp,
2 wait_queue_head_t * wait_address, poll_table *p){
3 if (p && p->_qproc && wait_address)
4 p->_qproc(lp,wait_address,p);
5 }
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
274
Below is an example TCP server skeleton that uses select(). It simply prints the message received from
the client on STDOUT.
Lisng 10.6.31: Code skeleton for server with select() (1): setup, bind and listen C
1 #include <stdlib.h>
2 #include <string.h>
3 #include <stdio.h>
4 #include <sys/types.h>
5 #include <sys/socket.h>
6 #include <netinet/in.h>
7
8 #include <sys/time.h>
9 #include <sys/select.h>
10
11 #dene MYPORT 3490 /* the port users will be connecting to */
12 #dene MAX_NCLIENTS 5
13 #dene MAX_NCHARS 128 /* max number of characters to be read/written at once */
14 #dene FALSE 0
15 /* ====================================================================== */
16
17 int main(int argc, char * argv[]) {
18 fd_set master; /* master set of le descriptors */
19 fd_set read_fds; /* set of le descriptors to read from */
20 int fdmax; /* highest fd in the set */
21 int s_fd;
22
23 FD_ZERO(&read_fds);
24 FD_ZERO(&master);
25 /* get the current size of le descriptors table */
26 fdmax = getdtablesize();
27
28 struct sockaddr_in my_addr; /* my address information */
29 struct sockaddr_in their_addr; /* client address information */
30
31 /**** open the server (TCP) socket */
32 if ((s_fd = socket(AF_INET, SOCK_STREAM, 0)) == -1) {
33 perror("socket");
34 return(-1);
35 }
36
37 /**** set the Reuse-Socket-Address option */
38 const int yes=1;
39 if (setsockopt(s_fd, SOL_SOCKET, SO_REUSEADDR, (char*)&yes, sizeof(int))==-1) {
40 perror("setsockopt");
41 close(s_fd);
42 return(-1);
43 }
44
45 /**** build server socket address */
46 bzero((char*) &my_addr, sizeof(struct sockaddr_in));
47 my_addr.sin_family = AF_INET;
48 my_addr.sin_addr.s_addr = htonl(INADDR_ANY);
49 my_addr.sin_port = htons(MYPORT);
50
51 /**** bind server socket to the local address */
52 if (bind(s_fd, (struct sockaddr *)&my_addr, sizeof(struct sockaddr)) == -1) {
53 perror("bind");
54 close(s_fd);
55 return(-1);
56 }
57
58 listen(s_fd, MAX_NCLIENTS);
275
Lisng 10.6.32: Code skeleton for server with select() (2): select, accept and read C
1 FD_SET(s_fd, &master); // add s_fd to the master set
2
3 fdmax = s_fd;
4
5 while (1) {
6 read_fds=master;
7 select(fdmax+1, &read_fds, NULL, NULL, (struct timeval *)NULL); // never time out
8 /* run through the existing connections looking for data to read */
9 for(int i = 0; i <= fdmax; i++) {
10 if (FD_ISSET(i, &read_fds)) { // if i belongs to the set read_fds
11 if (i == s_fd) { // fd of server socket
12 // accept on new client socket newfd
13 int sin_size = sizeof(struct sockaddr_in);
14 int newfd = accept(s_fd, (struct sockaddr *)&their_addr, &sin_size);
15 if (newfd == -1) {
16 perror("accept");
17 }
18 FD_SET(newfd, &master); // add newfd to the master set
19 if (newfd > fdmax) {
20 fdmax = newfd;
21 }
22 } else { // i is a client socket
23 printf("Hi, client\n");
24 /* handle client request */
25 char clientline[MAX_NCHARS]="";
26 char tmpchar;
27 char newline = '\n';
28 int eob = 0;
29 while(eob==0 && strlen(clientline)<MAX_NCHARS) {
30 read(i,&tmpchar,1);
31 eob=(tmpchar==newline) ? 1 : 0;
32 strncat(clientline,&tmpchar,1);
33 }
34 printf("%s",clientline);
35
36 /* clean up: close fd, remove from master set, decrement fdmax */
37 close(i);
38 FD_CLR(i, &master);
39 if (i == fdmax) {
40 while (FD_ISSET(fdmax, &master) == FALSE) {
41 fdmax -= 1;
42 }
43 }
44 } // i?=s_fd
45 } // FD_ISSET
46 } // for i
47 } // while()
48 return 0;
49 }
Mulple server processes: fork() and exec()
Handling mulple clients using select() can be a good opon on a single-core system. However, on
a system with mulple cores, we would like to take advantage of the available parallelism to increase
the server performance. One way to do this is by forking a child process (as discussed in Chapter 4)
to handle each client request. Even on a single-threaded system, this approach has an advantage:
if a fatal error would occur in the process handling the client request, the main server process would
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
276
not die. If we handle the client request in the same code as the main server acvity (as is the case if we
use select()) then an excepon in the client code would kill the enre server process.
Although fork()/exec() based code is conceptually simple, the TCP server skeleton below is a bit more
complicated because of the need of dealing with the zombie child processes. We do this using an
asynchronous signal handler sigchld_handler() which gets called whenever a child process exits. For
a discussion on signals, see Chapter 4; for details on signals and handlers, see sigacon(2). Essenally,
what the server does is fork a client handler whenever a request is accepted. The handler reads the client
message unl a newline is encountered, then it prints the message, closes the connecon, and exits.
Multhreaded servers using pthreads
A nal mechanism to handle mulple clients is to use POSIX threads. The approach is quite similar
to the fork-based server: the server spawns a client handler thread whenever a request is accepted.
The handler reads the client message unl a newline is encountered, then it prints the message, closes
the connecon, and exits.
Lisng 10.6.35: Code skeleton for server with pthreads (1): setup, bind and listen C
1 #include <unistd.h>
2 #include <string.h>
3 #include <stdio.h>
4 #include <sys/types.h>
5 #include <sys/socket.h>
6 #include <netinet/in.h>
7 #include <pthread.h>
8
9 #dene MYPORT 3490 /* the port users will be connecting to */
10 #dene MAX_NCLIENTS 5
11 #dene MAX_NCHARS 128 /* max number of characters to be read/written at once */
12 #dene FALSE 0
13 /* ====================================================================== */
14
15 void *client_handler(void *);
16
17 int main(int argc, char * argv[]) {
18
19 struct sockaddr_in my_addr; /* my address information */
20 struct sockaddr_in their_addr; /* client address information */
21
22 pthread_t tid;
23 pthread_attr_t attr;
24 pthread_attr_init(&attr);
25 pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);
26
27 /**** open the server (TCP) socket */
28 int s_fd = socket(AF_INET, SOCK_STREAM, 0);
29 if (s_fd == -1) {
30 perror("socket");
31 return(-1);
32 }
33
34 /**** set the Reuse-Socket-Address option */
35 const int yes=1;
36 if (setsockopt(s_fd, SOL_SOCKET, SO_REUSEADDR, (char*)&yes, sizeof(int))==-1) {
37 perror("setsockopt");
38 close(s_fd);
277
39 return(-1);
40 }
41
42 /**** build server socket address */
43 bzero((char*) &my_addr, sizeof(struct sockaddr_in));
44 my_addr.sin_family = AF_INET;
45 my_addr.sin_addr.s_addr = htonl(INADDR_ANY);
46 my_addr.sin_port = htons(MYPORT);
47
48 /**** bind server socket to the local address */
49 if (bind(s_fd, (struct sockaddr *)&my_addr, sizeof(struct sockaddr)) == -1) {
50 perror("bind");
51 close(s_fd);
52 return(-1);
53 }
54
55 listen(s_fd, MAX_NCLIENTS);
Lisng 10.6.36: Code skeleton for server with pthreads (2): accept, create thread and read C
1 unsigned int sin_size = sizeof(struct sockaddr_in);
2
3 while (1) {
4 // accept on new client socket newfd
5 int newfd = accept(s_fd, (struct sockaddr *)&their_addr, &sin_size);
6 if (newfd == -1) {
7 perror("accept");
8 } else {
9 // Create new thread
10 pthread_create(&tid, &attr, client_handler, (void*)newfd);
11 }
12 } // while()
13 return 0;
14 }
15
16 void * client_handler(void* fdp) {
17 /* handle client request */
18 int c_fd = (int) fdp;
19 char clientline[MAX_NCHARS]="";
20 char tmpchar;
21 char newline = '\n';
22 int eob = 0;
23
24 while(eob==0 && strlen(clientline)<MAX_NCHARS) {
25 read(c_fd,&tmpchar,1);
26 eob=(tmpchar==newline) ? 1 : 0;
27 strncat(clientline,&tmpchar,1);
28 }
29 printf("%s",clientline);
30
31 close(c_fd);
32 pthread_exit(0);
33 }
Chapter 10 | Networking
Operang Systems Foundaons with Linux on the Raspberry Pi
278
10.7 Summary
In this chapter, we have discussed why and how networking is implemented in the Linux kernel and
provided an overview of the POSIX API for socket programming. We have provided examples of the
most typical client and server funconality and discussed the dierent mechanisms a server can use
to handle mulple clients.
10.8 Exercises and quesons
10.8.1 Simple social networking
1. Implement a minimal Twier-like TCP/IP client-server system.
The client can send messages of 140 characters to one other client via a server.
Each client has an 8-character name.
Implement the server using select(), fork/exec, and pthreads.
2. Add addional features:
a) client discovery;
b) ability to send to mulple clients.
10.8.2 The Linux networking stack
1. Discuss the structure of the Linux networking stack and the structure and role of the socket buer
datastructure.
2. How does the Linux model dier from the OSI model?
10.8.3 The POSIX socket API
1. Why does Linux use a separate socket API for networking, instead of using the le API?
2. Sketch in pseudocode the meline of the various socket library funcon calls required in a typical
(simple) stream socket connecon.
3. Which POSIX socket API calls are blocking and why?
279
References
[1] L. L. Peterson and B. S. Davie, Computer networks: a systems approach. Elsevier, 2007.
[2] A. S. Tanenbaum et al., “Computer networks, 4th edion,” Prence Hall, 2003.
[3] O. Bonaventure, Computer Networking: Principles, Protocols, and Pracce. The Saylor Foundaon, 2011.
[4] E. Jeong, S. Woo, M. A. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park, “mTCP: a highly scalable user-level TCP stack for
mulcore systems.” in NSDI, vol. 14, 2014, pp. 489–502.
[5] S. Thongprasit, V. Visooviseth, and R. Takano, “Toward fast and scalable key-value stores based on user space TCP/IP stack,
in Proceedings of the Asian Internet Engineering Conference. ACM, 2015, pp. 40–47.
[6] T. Barbee, C. Soldani, and L. Mathy, “Fast userspace packet processing,” in Proceedings of the Eleventh ACM/IEEE Symposium
on Architectures for networking and communicaons systems. IEEE Computer Society, 2015, pp. 5–16.
[7] K. Zheng, “Enabling ’protocol roung’: Revising transport layer protocol design in internet communicaons,
IEEE Internet Compung, vol. 21, no. 6, pp. 52–57, November 2017.
[8] H. Zimmermann, “OSI reference model–the ISO model of architecture for open systems interconnecon,
IEEE Transacons on communicaons, vol. 28, no. 4, pp. 425–432, 1980.
Chapter 10 | Networking
Chapter 11
Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
282
11.1 Overview
So far in this textbook, we have presented standard concepts for current mainstream OS distribuons,
with parcular reference to Linux. This nal chapter will outline more advanced trends and features:
many of these are not yet reected in contemporary OS code bases; however, they may be integrated
within the next decade. Rather than presenng concrete details, this chapter will provide pointers and
search keywords to facilitate further invesgaon.
What you will learn
Aer you have studied the material in this chapter, you will be able to:
1. Give examples of dierent classes of systems on which Linux is deployed.
2. Explain how the characteriscs of diverse systems lead to various trade-os in OS construcon
and conguraon.
3. Jusfy the requirement for lightweight, rapid deployments of specialized systems, parcularly
in the cloud.
4. Illustrate security vulnerabilies and migaons in modern manycore systems, parcularly with
respect to speculave execuon.
5. Assess the need for formal vericaon of OS components in various target scenarios.
6. Appreciate the community-based approach to developing new features in the Linux kernel.
11.2 Scaling down
The computer on which Torvalds inially developed Linux in 1991 was a 32-bit 386 processor clocked
at 33MHz, with 4MB of RAM. Thanks to Moore’s law, present-day smartphones and wearable
devices are much more powerful than this original Linux machine. Many such small-scale consumer
devices run variants of Linux such as Android, Tizen, or Chrome OS, see Figure 11.1. The compelling
advantage of Linux is that it provides a highly customizable, o-the-shelf, core OS plaorm, enabling
rapid me-to-market for consumer electronics. These modern Linux variants are specialized to enable
fast boot mes on specialized, proprietary hardware. They oen restrict execuon to a controlled set
of trusted vendor-supplied apps.
The movaon is radically dierent for Raspberry Pi and other single board computers, which are
intended to be as exible and general-purpose as possible. These devices will support the broad
exibility of Linux kernel conguraons, with a vast range of oponal hardware device support.
Generally, single board computers track smartphone hardware in terms of features and capabilies,
since they are oen based around similar chipsets and peripherals.
Smaller, less capable, embedded devices include internet-of-things (IoT) sensors or network edge
devices. These nodes have minimal RAM and persistent storage, and may only have access to low
bandwidth, intermient network connecons. Generally, such devices are targeted with specialized
283
Linux distribuons. One example is Alpine Linux, which has a minimal installaon footprint of around
100MB. Reduced runme memory requirements are supported by a specialized C library, such as
musl, and monolithic executables that provide a range of Unix ulies, e.g., busybox.
Figure 11.1: Chromebook running a Linux variant on an Arm chipset. Photo by author.
There is a logical progression in this trend to consolidate OS kernel, libraries, and applicaon into
a single monolithic image. If the user knows ahead-of-me the precise system use-cases, then it
is feasible to eliminate large porons of the OS and libraries from the build, since they will never
be required. This is the unikernel concept, exemplied by MirageOS, which performs aggressive
specializaon and dead code eliminaon to produce slim binaries for deployment.
11.3 Scaling up
Linux is the default OS for supercomputers. Since 2017, all machines in the TOP500 list of most
powerful supercomputers in the world run Linux.
Generally, high-performance compung tasks are handled via a parallel framework such as MPI
(see Secon 7.6.3). Work is divided into small units to execute on the various nodes. Each shared-
memory node runs Linux individually, so a supercomputer may have tens of thousands of Linux kernels
running concurrently. The Archer facility at the Edinburgh Parallel Compung Centre, see Figure 11.2,
incorporates 4920 nodes.
Similarly, large-scale cloud datacenters may have hundreds of thousands of nodes, each running
a Linux image with higher-level control soware, such as OpenStack, to enable eecve resource
management. This is warehouse-scale compung, a phrase appropriately coined by Google engineers [1].
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
284
Figure 11.2: Archer high-performance compung facility. Photo by Edinburgh Parallel Compung Centre.
Rack-scale systems feature tens of nodes, with hundreds of cores. Large data processing tasks are
scheduled on such systems and may require inter-node cooperaon, e.g., for distributed garbage
collecon. This inter-node synchronizaon of acvies is eecvely a meta-level OS [2].
As system architectures become larger and more complex, and the disncon between on-node and
o-node memory is increasingly blurred, there is a trend towards mul-node, distributed OS designs.
The Barrelsh experimental OS is a mulkernel system. Each CPU core runs a small, single-core kernel,
and the OS is organized as a distributed system of message-passing processes on top of these kernels.
Processes are locaon agnosc, since inter-process communicaon may be with local or remote cores.
From a programmer perspecve, there is no disncon.
A related project is Plan 9 from Bell Labs, a distributed operang system that maintains the ‘everything
is a le’ abstracon. Its developers include some of the original designers of Unix. The key noveles
are a per-process namespace (individual view of the shared network le system) and a message-based
le system protocol for all communicaon. Eric Raymond summarizes the elegance of Plan 9 and the
reasons for its minimal adopon [3]. Note there is a Plan 9 OS image available for installaon on the
Raspberry Pi.
The growth of heterogeneous compung means many machines have special-purpose accelerators such
as GPUs, encrypon units, or dedicated machine learning processors. These resources should be under
the control of the OS, which mediates access by users and processes. This is parcularly important for
ulity compung contexts, where many tenants are sharing an underlying physical resource.
In addion to supporng scaled up compung on large machines, the next-generaon OS also needs
to handle scaled-up storage. Tradional Linux le systems like ext4 do not scale well to massive
285
and distributed contexts, due to the metadata updates and consistency that are required. Parallel
frameworks oen layer custom distributed le systems on top of per-node le systems, for instance,
HDFS for Hadoop.
Global-scale distributed data systems are oen key-value stores, such as etcd or mongodb, which
feature replicaon and eventual consistency to migate latencies in wide area networks. Object
stores, such as Minio and Ceph, allow binary blobs to be stored at known locaons (perhaps web
addresses) with associated access controls and other metadata.
11.4 Virtualizaon and containerizaon
Ulity compung implies general compute resource is situated in the cloud. Users simply rent CPU
me on virtual servers they can provision on-demand.
Virtualizaon enables mulple virtual machines (VMs) to be hosted and isolated from each other
on a single physical node. The hypervisor layer mulplexes guest VMs on top of the host machine.
Figure 11.3 presents the concepts of virtualizaon as a schemac diagram. This approach is crucial
for infrastructure service providers to support exible deployment and resource overprovisioning.
It is possible to migrate a VM to another physical node if service levels are not sucient. Modern
processors have extensions to support virtualizaon navely. These include extra privilege levels and
an addional layer of indirecon in memory management. Linux supports hardware virtualizaon with
the Kernel-based Virtual Machine (KVM), which acts as a hypervisor layer. Virtual machine soware
that runs on top of KVM includes the QEMU full system emulator. This allows a disnct guest OS,
possibly compiled for a dierent processor architecture, to execute on top of the Linux host OS.
Figure 11.3: Schemac diagram for virtualizaon, showing that an app actually runs on top of two kernels (in guest and host OS respecvely).
There is an alternave approach: unlike fully-edged virtualizaon where each VM runs a disnct
guest OS, Linux containers enable lightweight isolaon of processes that share a common host OS
i
i
“chapter” 2019/8/13 21:02 page 3 #3
i
i
i
i
i
i
Host OS
Hypervisor
Guest OS Guest OSGuest OS
libs
app
libs
app
libs
app
Virtual Machine Virtual Machine Virtual Machine
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
286
kernel. While containers lack the exibility of heavyweight virtualizaon, they are potenally much
more ecient. For this reason, containerizaon is popular for use cases requiring rapid deployment
mes such as DevOps, cloud systems, and serverless compung. A user wants to spin up a relevant
applicaon service with minimal latency. Tools like Docker enable services to be specied and
composed declaravely as scripts, then prebuilt images can match these scripts. This avoids lengthy
conguraon and build mes, enabling services to come up quickly.
Linux kernel facilies such as control groups (cgroups) enable containerizaon. Key concepts are
namespace isolaon and resource liming. Sets of processes can be collected together into a cgroup
and controlled as a unit. The bash lisng below illustrates how to exercise this control, and Figure 11.4
shows the outcome on a typical quad-core Raspberry Pi node.
Figure 11.4: CPU usage from top command, showing how Linux distributes CPU resource based on cgroups conguraon.
Lisng 11.4.1: Using cgroups to limit CPU resource Bash
1 sudo apt-get install stress # tool for CPU stress-testing
2 sudo apt-get install cgroup-tools # utils for cgroups
3 sudo cgcreate -g cpu:morecpu
4 sudo cgcreate -g cpu:lesscpu
5 cgget -r cpu.shares morecpu # default is 1024
6 sudo cgset -r cpu.shares=128 lesscpu # limit CPU usage
7 # now run some stress code in dierent control groups
8 sudo cgexec -g cpu:lesscpu stress --cpu 4 &
9 sudo cgexec -g cpu:morecpu stress --cpu 4 &
10 top # to see the CPU usage
11 sudo killall stress # to stop the stress jobs
Process sandboxes support throw-away execuon. Processes may be run once; then, their side-eects
may be isolated and discarded. In this sense, Linux containers are a progression of earlier Unix chroot
and BSD jail concepts. User-friendly conguraon tools like Docker have massively popularized
containerizaon.
The growth in the ulity compung market requires greater levels of resource awareness in the
underlying system. In parcular, the OS needs to support three key acvies:
287
1. Predicng: The OS must esmate ahead-of-me how long user tasks will take to complete and
which resources they will need. This is useful for ecient scheduling.
2. Accounng: The OS must keep track of precisely which resources are used by each applicaon.
This depends on low-level tools like perf, alongside higher-level applicaon-specic metrics such
as a number of database queries. This is essenal for billing users accurately for their workloads.
3. Constraining: The OS must allow certain sets of acons for each applicaon. Similar to sandboxing,
there are constraints on the applicaon behavior. Oen the constraints are expressed as a blacklist
of disallowed acons; this is generally how smartphone apps are executed. On the other hand, the
constraints could be expressed as a whitelist of allowable acons; this might be supported by a
capability-based system. CPU usage constraints, as outlined above, rely on quantave thresholds
that must be enforced by the kernel.
11.5 Security
In this secon, we discuss two recently discovered types of exploits that make use of aws in the
hardware to compromise the system. Appreciang these exploits requires knowledge of hardware
architecture (DRAM, cache, TLB, MMU, DMA), the memory subsystem, memory organizaon
(paging), and memory protecon. Therefore studying these exploits is a very good way to assess your
understanding of concepts covered in the book.
Figure 11.5: Logos for the Rampage exploit and Guardion migaon.
11.5.1 Rowhammer, Rampage, Throwhammer, and Nethammer
The original Rowhammer exploit makes use of a vulnerability in modern DRAM, in parcular, DDR3
and DDR4 SDRAM. Essenally, in such DRAMs, there is a non-zero probability of ipping a bit in a
given row by alternated accesses to the adjacent rows [4]. The actual exploit uses this aw by causing
permission bits to be ipped in a page table entry (PTE) that causes the PTE to point to a physical
page containing a page table of the aacking process. That process thereby gets read-write access to
one of its own page tables, and hence to the enre physical memory. A very good explanaon is given
in the original blog post by Mark Seaborn. You can also test if your own computer is vulnerable.
Several variants of this exploit have been developed: Rowhammer.js which uses hps://github.com/
IAIK/rowhammerjs [5], Rampage, which uses the Android DMA buer management API to induce
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
288
the bit ips [6], and building on this hps://vusec.net/projects/throwhammer, which exploits remote
direct memory access (RDMA) [7] and Nethammer [8], which uses only a specially craed packet
stream. Neither of these exploits requires the aacker to run code on the target machine. All of the
cited papers also discuss migaon strategies against the exploits.
The DRAM on the Raspberry Pi board is DDR2, which is generally not vulnerable to Rowhammer-type
exploits.
11.5.2 Spectre, Meltdown, Foreshadow
A modern OS has memory protecon mechanisms which stop a process from accessing data
belonging to another user, and also stop user processes from accessing kernel memory. Speculave
execuon aacks exploit the fact that a CPU will already start accessing data before it knows if it is
allowed to, i.e., while the memory protecon check is in progress. In theory, this is permissible because
the results of this speculave execuon should be protected at the hardware level. If a process does
not have the right privilege, it is not allowed to access this data, and the data is discarded.
However, the protected data is stored in the cache regardless of the privilege of the process. Cache
memory can be accessed more quickly than regular memory. The aacker process can try to access
memory locaons to test if the data there has been cached, by ming the access. This is known as
a side-channel aack. Both Spectre and Meltdown, and also the more recent Foreshadow exploit,
work by combining speculave execuon and a cache side-channel aack.
Meltdown [9] gives a user process read access to kernel memory. The migaon in Linux is
a fundamental change to how memory is managed: as explained in Chapter 6, Linux normally maps
kernel memory into a poron of the user address space for each process. On systems vulnerable to
the Meltdown exploit, this allows the aacker process to read from the kernel memory. The soluon
is called kernel page-table isolaon (KPTI).
Spectre [10] is a more complex exploit, harder to execute, but also harder to migate against.
There are two variants: one ("bounds-check bypass", CVE-2017-5753) depends on the existence of
a vulnerable code sequence that is conveniently accessible from user space; the other ("branch target
injecon", CVE-2017-5715) depends on poisoning the processor’s branch-predicon mechanism so
that indirect jumps will under speculave execuon be redirected to an aacker-chosen locaon.
The migaon strategies are discussed in a post on LWN.
Finally, Foreshadow [11] (or L1 Terminal Fault) is the name for three speculave execuon
vulnerabilies that aect Intel processors. Foreshadow exploits a vulnerability in Intel’s SGX (Soware
Guard Extensions) technology. SGX creates a ‘secure enclave’ in which users can provide secure
soware code that will run without being observed by even the operang system. SGX protects
Meltdown and Spectre; however, Foreshadow manages to circumvent this protecon. A good
explanaon of the exploit is given by Jon Masters, chief ARM architect at Red Hat.
The Arm processor on the Raspberry Pi board is not suscepble to these speculave execuon aacks
as it does not perform speculave execuon.
289
Figure 11.6: Logos for the Meltdown and Spectre exploits—it seems that eye-catching graphics are compulsory for OS security violaons.
11.6 Vericaon and cercaon
Formal vericaon techniques use mathemacal models and proofs to provide guarantees about the
properes and behavior of systems. This is essenal as soware grows in size and complexity, and as
it becomes the essenal foundaon of our everyday societal interacons. Many industrial sectors are
establishing cered requirements for soware to be veried formally, e.g., ISO 26262 for automove
vehicles and DO-178C for aerospace. Since the OS is a crical part of the soware stack, it will
become increasingly necessary to apply vericaon techniques to sets of OS components.
Microso pioneered veried components for the Windows OS with its device driver vericaon
program. Poor quality, third-party device drivers running in privileged mode can compromise kernel
data structures and invariants, oen resulng in the ‘blue screen of death, see Figure 11.7. This is
the Windows equivalent of the Unix kernel panic. At one point, bugs in device drivers caused 85% of
system crashes in Windows XP. [12]
Figure 11.7: Blue screen of death in Windows XP (le) and Windows 10 (right) — since the driver vericaon program, such blue screens are much less
common. Photo by author.
The SLAM project blends ideas from stac analysis, model checking, and theorem proving [13]. The
key tool is the Stac Driver Verier (SDV) which analyzes C source code, typically a device driver
implementaon comprising thousands of lines of code, to check it respects a set of hard-coded rules
that encapsulate legal interacons with the Windows kernel.
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
290
The SDV simplies input C code by converng it to an abstract boolean program, retaining the
original control ow but encoding all relevant program state as boolean variables. This abstract
program is executed symbolically to idenfy and report kernel API rule violaons. An example rule
species that locks should be acquired then subsequently released in strict sequence. The collecon
of pre-packaged API rules is harvested from previously idened error reports and Windows driver
documentaon. Empirical evidence shows the SDV approach has signicantly reduced bugs in
Windows device drivers.
When modern OS soware is built-in high-level languages such as C# and Rust, it is feasible to
perform stac analysis directly on the source code, to provide guarantees about memory safety and
data race freedom, for instance. Such guarantees may be composed to generate high-level OS safety
properes.
The seL4 project is a fully veried microkernel system [14]. The OS consists of small independent
components with clearly dened communicaon channels. Minimal funconality is provided in veried
microkernel, which is implemented in 10K lines of code, mostly C with some assembler. Properes
include access control guarantees, memory safety, and system call terminaon. Generally, there is
proof that the C source code matches the high-level abstract specicaon of the system. These kinds
of proofs are extremely expensive, in terms of expert human eort, to construct.
Cercaon involves mechanisms to guarantee the integrity of executable code. Cryptographic
hashes, such as MD5 and SHA1, are used to check a le has not been modied. For instance, when
you download a Raspbian SD card image from the Raspberry Pi website or a mirror, it is possible to
check the published SHA-256 hash of the le to guarantee its authencity, see Figure 11.8.
Figure 11.8: OS image hash is published alongside the download link to ensure authencity.
A signed executable ensures provenance as well as integrity. Using public key infrastructure, a code
distributor can sign the executable le or a hash of the le with their private key. A potenal user
can check the signature and the hash, to be sure the code is from an appropriate source and has not
been modied. Linux ulies like elfsign or the Integrity Measurement Architecture support digital
291
signatures for executable les. Hardware support, such as Arm TrustZone, is required for secure code
cercaon. In parcular, it is necessary to check the rmware and boot loader to ensure that only
cered code is able to run on the system.
Reproducibility is a key goal in modern systems. This is important for scienc experiments, for
debugging, for ensuring compability in a highly eclecc system of soware components. Declarave
scripng languages, like those provided by Nix or Puppet, enable systems to be congured easily to a
common standard. This is ideal for DevOps scenarios. The Nix package manager keeps track of all data
and code dependencies required to build each executable, via a cryptographic hash. This is encoded
directly in the path for the executable, e.g. /nix/store/a9i0a06gcs8w9fj9nghsl0b6vvqpzpi4-
bash-4.4-p23 which means mulple versions of an applicaon can co-exist in the same system,
and be managed easily with congurable proles. System administrators never need to ‘overwrite’ an
old applicaon or library when they upgrade to a new version, which makes compability and rollback
much easier.
Lisng 11.6.1: Example nix docker session Bash
1 # check out https://nixos.org/nix/manual/
2 # for more details
3 docker pull nixos/nix
4 docker run -it nixos/nix
5 nix-env -qa
6 nix-build '<nixpkgs>' -A hello
7 nix-shell '<nixpkgs>' -A hello
8 ./result/bin/hello
9 ls -l ./result
11.7 Recongurability
As compung plaorms become more exible, incorporang technology such as FPGA accelerators,
the OS must support on-the-y reconguraon. Similarly, in cloud compung contexts, the resources
available to a VM may change as the guest OS is migrated to dierent virtual servers with a range of
hardware opons. Even a commodity CPU on a laptop can be congured to operate at dierent clock
frequencies, trading o compute performance and power consumpon.
Presently, Linux supports dynamic reconguraon with a range of heurisc policies for parcular
resources. For instance, there is a CPU frequency governor that controls processor clock frequency
depending on current resource usage. Various research projects have explored the potenal for
machine learning to enable automac runme tuning of OS parameters. To date, there is no machine
learning component embedded in a mainstream OS kernel. Self-tuning systems based on machine
learning may arrive soon, although they would not be compliant with current domain-specic
cercaon, e.g., in the automove or aerospace sectors.
There is an accelerang trend to move OS components into user space. We introduced the noon
of a le system in user space (FUSE) in Chapter 9. Networking in user space is also supported, with
frameworks like the Data Plane Development Kit (DPDK) that support accelerated, customized
packet processing in user applicaon code. This exibility enables techniques like soware-dened
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
292
networking and network funcon virtualizaon. Eecvely, the network stack can be recongured at
runme in soware.
In theory, as the Linux kernel transfers these tradional OS responsibilies to user space code, its
architecture increasingly resembles a micro-kernel OS. The historical cricism of Linux was that it was
too monolithic to scale and survive, see Figure 11.9. Torvalds addressed these cricisms directly at the
me, and reected on his design principles at a later date [15].
Figure 11.9: Part of the famous ‘Linux is obsolete’ debate focused on its non-microkernel architecture. Cartoons by Lovisa Sundin.
11.8 Linux development roadmap
There is no formal roadmap for Linux kernel development. There are a number of release candidates
with experimental features, some of which will be incorporated in future stable releases.
Check hps://kernel.org for the latest details. The Linux Weekly News service keeps track of ongoing
changes to the kernel, see hps://lwn.net/Kernel/
i
i
“chapter” 2019/8/13 21:02 page 9 #9
i
i
i
i
i
i
Andrew Tanenbaum
Subject: LINUX is obsolete
MINIX is a microkernel-based system.
The file system and memory management
are separate processes, running outside
the kernel. The I/O drivers are also
separate processes (in the kernel, but
only because the brain-dead nature of
the Intel CPUs makes that difficult to
do otherwise). LINUX is a monolithic
style system. This is a giant step back
into the 1970s. That is like taking an
existing, working C program and rewriting
it in BASIC. To me, writing a monolithic
system in 1991 is a truly poor idea.
Tanenbaum’s criticism of the Linux architecture on Usenet (29 Jan 1992)
>1. MICROKERNEL VS MONOLITHIC SYSTEM
True, linux is monolithic, and I agree
that microkernels are nicer. With a less
argumentative subject, I’d probably have
agreed with most of what you said. From
a theoretical (and aesthetical) standpoint
linux loses [sic].
>MINIX is a microkernel-based system.
>[deleted, but not so that
>you miss the point ]
>LINUX is a monolithic style system.
If this was the only criterion for
the "goodness" of a kernel, you’d be
right...
Excerpt of Torvalds’ response on Usenet (29 Jan 1992)
Linus Torvalds
Andrew Tanenbaum
i
i
“chapter” 2019/8/13 21:02 page 9 #9
i
i
i
i
i
i
Andrew Tanenbaum
Subject: LINUX is obsolete
MINIX is a microkernel-based system.
The file system and memory management
are separate processes, running outside
the kernel. The I/O drivers are also
separate processes (in the kernel, but
only because the brain-dead nature of
the Intel CPUs makes that difficult to
do otherwise). LINUX is a monolithic
style system. This is a giant step back
into the 1970s. That is like taking an
existing, working C program and rewriting
it in BASIC. To me, writing a monolithic
system in 1991 is a truly poor idea.
Tanenbaum’s criticism of the Linux architecture on Usenet (29 Jan 1992)
>1. MICROKERNEL VS MONOLITHIC SYSTEM
True, linux is monolithic, and I agree
that microkernels are nicer. With a less
argumentative subject, I’d probably have
agreed with most of what you said. From
a theoretical (and aesthetical) standpoint
linux loses [sic].
>MINIX is a microkernel-based system.
>[deleted, but not so that
>you miss the point ]
>LINUX is a monolithic style system.
If this was the only criterion for
the "goodness" of a kernel, you’d be
right...
Excerpt of Torvalds’ response on Usenet (29 Jan 1992)
Linus Torvalds
Linus Torvalds
293
11.9 Further reading
Throughout this chapter, we have given a avor of contemporary trends in OS development and
deployment. Some of these issues have an immediate impact on Linux; others may aect the plaorm
over the next decade.
The annual workshop on Hot Topics in Operang Systems (HotOS) is an excellent venue for OS future
studies and speculaon. If you are interested in OS research and development, consult recent years’
proceedings of this event, which should be available online.
11.10 Exercises and quesons
11.10.1 Make a minimal kernel
Congure and build a custom Linux kernel for your Raspberry Pi. How small a kernel image can
you create?
11.10.2 Verify important properes
Veried soware systems provide formal guarantees about their properes and behavior. Suggest
some properes you might want to prove about components of an OS.
11.10.3 Commercial comparison
Much of the popularity of Linux could be aributed to the fact it is free, open-source, soware (FOSS).
Compare Linux with a mainstream OS that is not FOSS. Can you idenfy dierences, and explain why
they might occur? Is there a dierent emphasis on developing new features?
11.10.4 For or against cercaon
Soware cercaon has a number of advantages and disadvantages, which must be carefully
assessed. Draw up a debate card, lisng the pros and cons of OS cercaon. This could form the
basis for a group discussion with your peers.
11.10.5 Devolved decisions
The modern Linux kernel abdicates responsibility for certain policies to user space, e.g., for le systems
(with FUSE) and networking (with DPDK). Discuss other services that might be transferred from the
kernel to user space. System logging is one candidate.
11.10.6 Underclock, overclock
It is possible to modify the conguraon of your Raspberry Pi board to change the CPU clock
frequency. Find the line specifying arm_freq = 1200 in your /boot/cong.txt and modify this. The
frequency is specied as an integer, denong MHz. There are other frequencies you can change, such
as those for GPU and memory. Check online documentaon for details, and note that some sengs
may void your warranty.
You can invesgate how frequency and power trade-o, by monitoring your Raspberry Pi power
consumpon when you run CPU-intensive applicaons (perhaps the stress ulity). You will need to
use an external USB digital mulmeter or power monitor. Produce a graph to show the relaonship
between frequency in MHz and power in W.
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
294
References
[1] L. A. Barroso, U. Hölzle, and P. Ranganathan, The Datacenter as a Computer: Designing Warehouse-Scale Machines, 3rd ed.
Morgan Claypool, 2018.
[2] M. Maas, K. Asanovic ́, T. Harris, and J. Kubiatowicz, “Taurus: A holisc language runme system for coordinang distributed
managed-language applicaons,” in Proceedings of the Twenty-First Internaonal Conference on Architectural Support for
Programming Languages and Operang Systems, 2016, pp. 457–471.
[3] E. S. Raymond, Plan 9: The Way the Future Was. Addison Wesley, 2003, hp://catb.org/~esr/wrings/taoup/html/plan9.html
[4] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without
accessing them: An experimental study of DRAM disturbance errors,” in ACM SIGARCH Computer Architecture News, vol. 42,
no. 3, 2014, pp. 361–372.
[5] D. Gruss, C. Maurice, and S. Mangard, “Rowhammer.js: A remote soware-induced fault aack in Javascript,” in Internaonal
Conference on Detecon of Intrusions and Malware, and Vulnerability Assessment. Springer, 2016, pp. 300–321.
[6] V. Van Der Veen, Y. Fratantonio, M. Lindorfer, D. Gruss, C. Maurice, G. Vigna, H. Bos, K. Razavi, and C. Giurida, “Drammer:
Determinisc Rowhammer aacks on mobile plaorms,” in Proceedings of the 2016 ACM SIGSAC conference on computer and
communicaons security, 2016, pp. 1675–1689.
[7] A. Tatar, R. Krishnan, E. Athanasopoulos, C. Giurida, H. Bos, and K. Razavi, “Throwhammer: Rowhammer aacks over the
network and defenses,” in 2018 USENIX Annual Technical Conference, 2018.
[8] M. Lipp, M. T. Aga, M. Schwarz, D. Gruss, C. Maurice, L. Raab, and L. Lamster, “Nethammer: Inducing Rowhammer faults
through network requests,arXiv preprint arXiv:1805.04956, 2018.
[9] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin et al., “Meltdown:
Reading kernel memory from user space,” in 27th USENIX Security Symposium, 2018, pp. 973–990.
[10] P.Kocher, D.Genkin, D.Gruss, W.Haas, M.Hamburg, M.Lipp, S.Mangard, T.Prescher, M.Schwarz, and Y. Yarom, “Spectre
aacks: Exploing speculave execuon,arXiv preprint arXiv:1801.01203, 2018.
[11] J.VanBulck, M.Minkin, O.Weisse, D.Genkin, B.Kasikci, F.Piessens, M.Silberstein, T.F.Wenisch, Y. Yarom, and R. Strackx,
“Foreshadow: Extracng the keys to the Intel SGX kingdom with transient out-of-order execuon,” in 27th USENIX Security
Symposium, 2018, pp. 991–1008.
[12] T.Ball, E.Bounimova, B.Cook, V.Levin, J.Lichtenberg, C.McGarvey, B.Ondrusek, S.K.Rajamani, and A. Ustuner, “Thorough
stac analysis of device drivers,ACM SIGOPS Operang Systems Review, vol. 40, no. 4, pp. 73–85, 2006.
[13] T.Ball, B.Cook, V.Levin, and S.K.Rajamani, “SLAM and Stac Driver Verier: Technology transfer of formal methods inside
Microso,Tech. Rep. MSR-TR-2004-08, 2004, hps://www.microso.com/en-us/research/wp-content/uploads/2016/02/tr-
2004-08.pdf
295
[14] G.Klein, J.Andronick, K.Elphinstone, G.Heiser, D.Cock, P.Derrin, D.Elkaduwe, K.Engelhardt, R.Kolanski, M. Norrish, T. Sewell,
H. Tuch, and S. Winwood, “seL4: Formal vericaon of an operang-system kernel,Communicaons of the ACM, vol. 53, no. 6,
pp. 107–115, Jun. 2010.
[15] L. Torvalds, The Linux Edge. O’Reilly, 1999, hp://www.oreilly.com/openbook/opensources/book/linus.html
Chapter 11 | Advanced topics
Operang Systems Foundaons with Linux on the Raspberry Pi
296
Address space A set of discrete memory addresses. The physical address space is the set
of all of the memory in a computer system, including the system memory
(DRAM) as well as the I/O devices and other peripherals such as disks.
Applicaon binary
interface (ABI)
The specicaons to which an executable must conform in order to
execute in a specic execuon environment.
Arithmec logic unit
(ALU)
The part of a processor that performs computaons.
Assembly language A low-level programming language with a very strong correspondence
between the program’s statements and the architecture’s machine code
instrucons, used as a target by compilers for higher-level languages.
Atomic operaon An operaon which is guaranteed to be isolated from interrupts,
signals, concurrent processes, and threads.
Boong The process of starng up a computer system and pung it in a state
so that it can be used.
Cache A small but fast memory used to limit the me spent by the CPU in
waing for main memory access. For every memory read operaon,
rst the processor checks if the data is present in the cache, and
if so (cache hit) it uses that data rather than accessing the DRAM.
Otherwise (cache miss) it will fetch the data from memory and store
it in the cache.
Cache coherency In a mulcore computer system with mulple caches, cache
coherency (or cache coherence or) is the mechanism that ensures that
changes in data are propagated throughout the memory system in a
mely fashion so that all the caches of a resource have the same data.
Clock ck Informal synonym for _clock cycle_, the me between two
consecuve rising (posive) edges of the system clock signal.
Complex instrucon set
compung (CISC)
A CPU with a large set of complex and specialized instrucons rather
a small set of simple and general instrucons. The typical example is
the x86 architecture.
Concurrency The fact that more than one task is running concurrently (at the same
me) on the system. In other words, concurrency is a property of
the workload rather than the system, provided that the system has
support for running more than one task at the same me. In pracce,
one of the key reasons to have an OS is to support concurrency
through scheduling of tasks on a single shared CPU.
Glossary of terms
297
Crical secon A secon of a program which cannot be executed by more than one
process or thread at the same me. Crical secons typically access
a shared resource and require synchronizaon primives such as
mutual exclusion locks to funcon correctly.
Deadlock The state in which each process in a group of communicang process is
waing for a message from the other process in order to proceed with
an acon. Alternavely, in a group of processes with shared resources,
there will be deadlock if each process is waing for another process to
release the resource that it needs to proceed with the acon.
Direct memory access
(DMA)
A mechanism that allows peripherals to transfer data directly into
the main memory without going through the processor registers.
In Arm systems, the DMA controller unit is typically a peripheral.
DRAM Dynamic random-access memory, high-density memory, slower than
SRAM. It is typically used as the main memory in a computer system.
A DRAM cell is typically a small capacitor. As the charge leaks, it
needs to be periodically refreshed.
Endianness The sequenal order in which bytes are arranged into words when
stored in memory or when transmied over digital links. There are
two incompable formats in common use, called big-endian and
lile-endian. In big-endian format, the most signicant byte (the byte
containing the most signicant bit) is stored at the lowest address.
Lile-endian format reverses this order.
Everything is a le A key concept in Linux and other UNIX-like operang systems.
It does not mean that all objects in Linux are les as dened above,
but rather that Linux prefers to treat all objects from which the
OS can read data or to which it can write data using a consistent
interface. So it might be more accurate to say, "everything is a stream
of bytes." Linux uses the concept of a le descriptor, an abstract
handle used to access an input/output resource (of which a le
is just one type). So one can also say that in Linux, “everything is
a le descriptor.
File A named set of related data that is presented to the user as a single,
conguous block of informaon, and that is kept in persistent storage.
File system A system for the logical organizaon of data. The purpose of most le
systems is to provide the le and directory (folder) abstracons. A le
system not only allows to store informaon in the form of les organized
in directories but also informaon about the permissions of usages for
les and directories, as well as mestamp informaon. The informaon
in a le system is typically organized as a hierarchical tree of directories,
and the directory at the root of the tree is called the root directory.
Glossary of terms
Operang Systems Foundaons with Linux on the Raspberry Pi
298
Hypervisor A program, rmware, or hardware system that creates and runs virtual
machines.
Instrucon A computer program consists of a series of instrucons. Each
instrucon determines how the processor interacts with the system
through the address space.
Interrupt A signal sent to the processor by hardware (peripherals) or soware
indicang an event that needs immediate aenon. The acon of
sending the signal is called an interrupt request (IRQ).
Kernel The program that is the core of an operang system, with complete
control over everything in the system. It is usually one of the rst
programs loaded when boong the system (aer the bootloader).
It handles the rest of startup and inializaon as well as requests for
system services from other processes.
Memory The hardware that stores informaon for immediate use in
a computer, typically SRAM or DRAM.
Memory management unit
(MMU)
A computer system hardware component which manages memory
access control and memory address translaon, in parcular, the
translaon of virtual memory addresses to physical addresses.
Memory address An unsigned integer value used as the idener for a word of data
stored in memory.
MIPS for the masses The slogan of the original Arm design team, which aimed to create
a cheap but powerful processor that would provide lots of processing
power ("MIPS" means Millions of Instrucons Per Second) for a price
that everybody could aord.
Mnemonic An abbreviaon for an operaon. Assembly language uses
mnemonics to represent each low-level machine instrucon or
opcode, typically also each architectural register, ag, etc. Also the
surname of the eponymous character in William Gibson’s novella
"Johnny Mnemonic" (1981).
MPI (Message passing
interface)
An API specicaon designed for high-performance compung.
It provides a distributed memory model for parallel programming.
Its main targets have been clusters and mulprocessor machines, but
recently also manycore system. The message passing model means
that tasks do not share any memory. Instead, every task has its own
private memory, and any communicaon between tasks is via the
exchange of messages.
Glossary of terms
299
Mounng The operaon performed by the kernel to provide access to a le system.
Mounng a le system aaches that le system to a directory (mount
point) and makes it available to the system. The root le system is always
mounted. Any other le system can be connected or disconnected from
the root le system at any point in the directory tree.
Multasking The concurrent execuon of mulple tasks (also known as processes)
over a certain period of me.
Network interface
controller (NIC)
Also known as a network interface card or network adapter,
is a computer hardware component that connects a computer to
a computer network.
Networking The interacon of a computer system with other computer systems
using an intermediate communicaon infrastructure.
Opcode An opcode or operaon code is the part of a machine language
instrucon that species the operaon to be performed. Most
instrucons also specify the data to be processed in the form of
operands.
OpenCL An open standard for parallel compung on heterogeneous architectures.
OpenMP A standard for shared-memory parallel programming. It is based on
a set of compiler direcves or pragmas, combined with a programming
API to specify parallel regions, data scope, synchronizaon, etc..
OpenMP is a portable parallel programming approach, and the
specicaon supports C, C++, and Fortran.
Operang system An operang system (OS) is a dedicated program that manages
the hardware and soware resources of a computer system and
provides common services for computer programs running on the
system. Modern operang systems keep track of resource usage
of tasks and use me-sharing to schedule tasks for ecient use of
the system.
Parallelism Parallel processing is a capability of a computer system.
Paron A disk can be divided into parons, which means that instead of
presenng as a single blob of data, it presents as several dierent
blobs. Parons a are logical rather than physical, and the informaon
about how the disk is paroned is stored in a paron table.
Peripheral A device connected to a computer, used to put informaon into and
get informaon out of the computer. "The Peripheral" is also the name
of a science con novel by William Gibson (2014).
Glossary of terms
Operang Systems Foundaons with Linux on the Raspberry Pi
300
Persistent storage Also known as non-volale storage is a type of storage that retains its
data even if the device is powered o. Examples are solid-state drives
(SSD), hard disks, and magnec tapes.
Polling The acon of periodically checking the state of a peripheral.
POSIX The Portable Operang System Interface (POSIX) is a family of IEEE
standards aimed at maintaining compability between operang
systems. POSIX denes the applicaon programming interface (API)
used by programs to interact with the operang system.
Preempon The act of temporarily interrupng a task being carried out by a
computer system (and in parcular a process running on a CPU),
without requiring the cooperaon of that task, and with the intenon
of resuming the task at a later me. Preempon is a key feature
of preempve multasking. The alternave approach where the
cooperaon of a task is needed is called cooperave multasking.
Process A process is a running program, i.e., the code for the program and
all system resources it uses. The concept of a process is used for the
separaon of code and resources. With this denion, a process can
consist of mulple threads.
Process control block
(PCB)
Also called Task Control Block (TCB). The operang system kernel
data structure, which contains the informaon needed to manage the
scheduling of a parcular process.
RAM Random-access memory. Data stored in RAM can be read or wrien in
almost the same amount of me irrespecve of the physical locaon
of data inside the memory. This as opposed to other direct-access data
storage media such as hard disks, CDs, DVDs, and magnec tapes.
Reduced instrucon set
compung (RISC)
A CPU with a small set of simple and general instrucons, rather than
a large set of complex and specialized instrucons. Arm processors
have a RISC architecture.
Register le An array of words called registers, typically implemented as SRAM
memory and part of the CPU.
Root user In Linux and other Unix-like computer OSes, root is the convenonal
name of the user who has all rights or permissions (to all les and
programs) in all modes (single- or mul-user). Alternave names
include superuser and administrator. In Linux, the actual name of the
account is not the determining factor.
Scheduling The mechanism used by the operang system kernel to allocate CPU
me to tasks.
Glossary of terms
301
SIMD (Single instrucon
mulple data)
A type of parallel computaon where mulple processing elements
perform the same operaon on mulple data points simultaneously.
SRAM Stac random-access memory, lower-density memory, faster than
DRAM. It is typically used for cache memory in a computer system.
An SRAM cell is a latch, so it retains its value as long as the device is
powered on, without the need for refreshing.
Symmetric
mulprocessing (SMP)
An operaonal model for mulcore computer systems where two or
more idencal cores are connected to a single, shared main memory,
have full access to all input and output devices, and are controlled by
a single operang system instance that treats all processors equally,
reserving none for special purposes. Most modern mulcore systems
use an SMP architecture.
System clock A counter of the me elapsed since some arbitrary starng date
called the epoch. Linux and other POSIX-compliant systems encode
system me as the number of seconds elapsed since the start of
the Unix epoch at 1 January 1970 00.00.00 UT, with excepons for
leap seconds.
System state The set of all informaon in a system that the system remembers
between events or user interacons.
System-on-chip (SoC) Also called system-on-a-chip, an IC (integrated circuit) that integrates
all components of a computer system. These components typically
include a CPU, memory, I/O ports, and secondary storage, combined
on a single chip.
Task A unit of execuon or a unit of work on a computer system. The term
is somewhat less strictly dened and usually relates to scheduling.
Thread Mulple concurrent tasks execung within a single process are called
threads of execuon. The threads of a process share its resources.
For a process with a single thread of execuon, the terms task and
process are oen used interchangeably.
Timer A specialized type of clock used for measuring specic me intervals.
Translaon look-aside
buer (TLB)
A special type of cache which stores recent translaons of virtual
memory to physical memory. It is part of the MMU.
User In general, a user is a person who ulizes a computer system.
However, in the context of an operang system, the term user is used
more broadly to idenfy the ownership of processes and resources.
Therefore a user does not need to be a person.
Glossary of terms
Operang Systems Foundaons with Linux on the Raspberry Pi
302
Virtual machine A program which emulates a computer system. Virtual machines are
based on computer architectures and provide the funconality of a
physical computer. Modern computer systems provide hardware support
for deployment of Virtual Machines (virtualizaon) through hypervisors.
Word A xed-size, conguous array of bits used by a given processor
design. A word is a xed-sized piece of data handled as a unit by the
instrucon set or the hardware of the processor. The number of bits
in a word (also called word size, word width, or word length) is a key
characterisc for any specic processor architecture. Typically, a word
consists of a number of bytes (a sequence of 8 bits), which are stored
either in lile-endian or big-endian format (see endianness). The most
common word sizes for modern processors are 64 and 32 bits, but
processors with 8 or 16-bit word size are sll used for embedded
systems.
Glossary of terms
303
Operang Systems Foundaons with Linux on the Raspberry Pi
Operang Systems Foundaons with Linux on the Raspberry Pi
304
Index
AArch32 6, 54-55
AArch64 50, 54-55, 57, 94
Accelerator Coherency Port 64
Accept 255, 259
Acon 2, 4-6
Address map
space
space layout, see Address map
space layout randomizaon
3-4, 61-63, 129
6, 23, 61-63, 128-131, 148, 296
149
Advanced high-performance bus 50
ALU, see Arithmec logic unit
Applicaon binary interface 94, 296
Arithmec Logic Unit 8, 296
Arm Cortex A53 50, 53-61
Arm Cortex M0+ 50-52
Armv6-M 51-52
Armv8-A 50, 53-55
Assembly language 8, 296
Associave 15, 58, 185
Atomic 165-169, 222, 296
Big-endian 257, 297
Binary tree 185
Bind 255, 258-259
Bitmap 165-167, 234
Block device 204
Blocking IO 207
Boot process (see also Boong)
sequence (see also Boong)
36-37, 183
36
Boong 36, 183, 296
Bootloader 32, 36
Branch 7, 9
Buer cache 142, 242
Cache
coherency
15-18, 59-61, 127, 142, 296
61, 64, 296
Character device 203-204
Chgrp 35, 225
305
Index
Chrt 111, 120
Clock processor clock
page replacement algorithm
system clock
cycle
ck
2, 13, 291
143-144
5, 301
5, 91
296
Completely fair scheduler 107
Complex instrucon set compung 50, 296
Concurrency 20, 158-161, 296
Connect 255, 260-261
Context switch 81-82
Control ow 12
Copy on write 72, 146-147
Core 49-50, 149, 159
Cortex, see Arm Cortex
Credenals 34-35, 225
Deadlock 161-162, 297
Demand paging 145
Device driver
tree
31, 38, 204, 253
32
Dijkstra 162-163
Directory 219-220, 231-233
Direct memory access 13-14, 63, 210, 297
DMA, see Direct memory access
Docker 244, 286
DRAM, see Dynamic random access memory
Dynamic shared object 95
Dynamic random access memory 3, 127, 287, 297
EABI 94-95
Earliest deadline rst 101, 112
Ethernet 206, 250, 253
Everything is a le 33, 218, 284, 297
Evict 15, 60, 144
Exclusive monitor 163-164
Exec 73, 275
Extended le system 233
Extents 236
Ext4 233-238
Operang Systems Foundaons with Linux on the Raspberry Pi
306
Index
FAT, see le allocaon table
Fetch-decode-execute cycle 8
File
allocaon table
system
218, 297
238-242
32-33, 218-220, 228-230, 297
Floang-point unit 55
Fork 71-74, 275
Fsck 243
Futex 174
Getsockopt 262
Gey 38
Groupadd 35
Heterogeneous mulprocessing 181
Host layers 251
HTTP 250
Hypervisor 56-57, 285, 298
IEEE 754-2008 55
Illegal instrucon 86-87
Init 34, 37-38, 77
Inializaon 37, 183
Inode 230-231
Insmod 39, 205
Instrucon
cycle
register
set
5-8, 298
8
9
50-51, 54
Interrupt
handler
request
service roune, see Interrupt handler
vector table
4, 209-211, 298
210-213
4, 13
13
Ioctl 207
IRQ, see Interrupt request
ISR, see Interrupt service roune
IVT, see Interrupt vector table
Journal 237
Kbuild 42
307
Index
Kernel Linux
OpenCL
module
space
31-32, 37-42, 76, 292, 298
192
39-42, 204
32
Kill 83
Large physical address extension 61, 133
LDREX 163-164
Least recently used 144
Link register 6, 12
Listen 255, 259
Lile-endian 257, 297
Load balancing 183-184
Logical address space 23
Login 38
LR, see Link register
MapReduce 185, 195
Media layers 251-252
Memory
address
barrier
management, see Memory
management unit
operaon ordering
protecon
protecon unit
126, 298
298
170-172
57, 129, 298
169
52, 152, 288
24, 52
Memset 259
MIPS for the masses 48, 298
Modprobe 39
MOESI 61
Monitor 163-164
MPI 190-191, 283, 298
Mprotect 152
Mutex 163, 174-177, 178
NEON (see also SIMD) 53
Nested vectored interrupt controller 51
Network adapter
interface controller
layer
protocol
250, 299
250, 299
252
253
Operang Systems Foundaons with Linux on the Raspberry Pi
308
Index
Networking 250-279, 299
Nice 102, 104, 119
Non-blocking IO 207
Non-preempve 96-100
Not recently used 143
Opcode 9, 299
OpenCL 191-194, 299
OpenMP 189-191, 299
OSI 251-252
Page cache
fault
table
table entry metadata
166
138-141
130-137
134
Parallelism 181-185, 189, 193, 195, 299
Paron 32, 299
PC, see Program counter
Peripheral 2, 299
Permissions 34-35, 225
Persistent storage (see also le system) 300
Physical address
address space
23, 128-130
3
Plan 9 284
Polling 209, 300
POSIX 42, 299
Preempon 96-100, 115-116, 300
Preempve 96-100, 102
Priories 98-99, 104-107, 119
Privileges 23, 34-35
Process, see Task (also see Thread)
control block
lifecycle
74-76, 300
70, 91-92
Program counter 6, 9
Programming model 158
Pthreads 186-189
RAM, see Random access memory
Random access memory 3, 126-127, 300
Raspberry Pi 36, 53
309
Index
RAM, see Random access memory
Random access memory 3, 126-127, 300
Raspberry Pi 36, 53
Read 256, 258, 262, 271
Read-modify-write 165
Recv 255, 261-262, 268, 271
Red-black tree 116-119
Reduced instrucon set compung 50, 300
Reducon 185
Register 6, 55, 300
Renice 119
Rmmod 39, 213
Root directory
user
33, 219
34, 300
Round-robin 21, 98
Scheduler 21, 31, 98, 184
Scheduling 21, 31, 43, 90-122, 184
Select 271-273, 275-276
Semaphore 159, 162-163, 175-179
Send 255, 261, 264, 268, 271
Sendto 255, 268, 271
Setsockopt 255, 262-265
SEV 183
Shared resource 158-160
Shortest job rst
remaining me rst
99
99
Signal handler 84
SIMD, see Single instrucon mulple data
Single instrucon mulple data 55, 182, 300
Socket 255-263, 265-268, 271
SP, see Stack pointer
Spin lock 173, 179
Stack
pointer
6, 11, 148
6, 11-12, 23, 52, 57
State 3
Stream socket 255-256, 261-262
Operang Systems Foundaons with Linux on the Raspberry Pi
310
Index
STREX 163-164
Subroune call 12-13
Superblock 229, 233-234
Supervisor 52
Swap cache
space
166
138
SWI 95, 210
Symmetric mulprocessing 164, 169, 183, 301
Synchronizaon 161, 163-165, 171-172, 177, 189-190, 194-195
Syscall 94-95
System state
mer
2-6, 301
21, 52
System-on-a-chip 36, 206, 301
Systemd 37-38
Tanenbaum 292
Task (see also Process)
scheduler
_struct
20-22, 90, 301
21
76, 102
TCP 252-253, 255, 264-267
TCP/IP 44, 250, 252-253
Thread
_info
31, 77, 101-102, 186, 301
76
Threading building blocks 194-195
Thumb 49, 51-52, 54
Time
slice
slicing
92-93, 140
21, 98
21
TLB, see Translaon look-aside buer
Torvalds 282, 292
Translaon look-aside buer 24, 58, 136, 301
Transport layer 254-255
UDP 253, 268-271
Ulimit 35, 221
Union le system 244
User
space
34, 301
20, 32
Useradd 35
311
Index
Virtual address space
memory
le system
128-129
127-130
75, 228-229
Wait 73-74
Waing (process state) 79-80, 92
Wilson 48
Working set 141
x86 50
YIELD 183
Zombie 80
Operang Systems Foundaons with Linux on the Raspberry Pi
312
Arm Educaon Media
Online Courses
Our online courses have been developed to help students learn about state-of-
the-art technologies from the Arm partner ecosystem. Each online course contains
10-14 modules, and each module comprises lecture slides with notes, interacve
quizzes, hands-on labs and lab soluons. The courses will give your students an
understanding of Arm architecture and the principles of soware and hardware
system design on Arm-based plaorms, skills essenal for today’s computer
engineering workplace.
Available now:
Ecient Embedded Systems Design and Programming
Rapid Embedded Systems Design and Programming
Digital Signal Processing
Internet of Things
Graphics and Mobile Gaming
System-on-Chip Design
Real-Time Operang Systems Design and Programming
Advanced System-on-Chip Design
Embedded Linux
Mechatronics and Robocs
313
Introducon to System-on-Chip Design
Online Courses
The Internet of Things promises devices endowed with processing, memory,
and communicaon capabilies. These processing nodes will be, in eect, simple
Systems-on-Chips (SoCs). They will need to be inexpensive, and able to operate
under stringent performance, power and area constraints.
The Introducon to System-on-Chip Design Online Course focuses on building
SoCs around Arm Cortex-M0 processors, which are perfectly suited for IoT
needs. Using FPGAs as prototyping plaorms, this course explores a typical SoC
development process: from creang high-level funconal specicaons to design,
implementaon, and tesng on real FPGA hardware using standard hardware
descripon and soware programming languages.
Discover more at www.armedumedia.com
Learning outcomes:
Knowledge and understanding of
Arm Cortex-M processor architectures
and Arm Cortex-M based SoCs
Design of Arm Cortex-M based SoCs in
a standard hardware descripon language
Low-level soware design for Arm Cortex-M
based SoCs and high-level applicaon
development
Intellectual
Ability to use and choose between dierent
techniques for digital system design and
capture
Ability to evaluate implementaon results
(e.g., speed, area, power) and correlate them
with the corresponding high-level design
and capture
Praccal
Ability to use commercial tools to develop
Arm Cortex-M based SoCs
Course Syllabus:
Prerequisites: Basics of hardware descripon
language (Verilog or VHDL), Basic C, and
assembly programming.
Modules
1. Introducon to Arm-based System-on-Chip
Design
2. The Arm Cortex-M0 Processor Architecture:
Part 1
3. The Arm Cortex-M0 Processor Architecture:
Part 2
4. AMBA3 AHB-Lite Bus Architecture
5. AHB SRAM Memory Controller
6. AHB VGA Peripheral
7. AHB UART Peripheral
8. Timer, GPIO, and 7-Segment Peripherals
9. Interrupt Mechanisms
10. Programming an SoC Using C Language
11. Arm CMSIS and Soware Drivers
12. Applicaon Programming Interface and
Final Applicaon
Operang Systems Foundaons with Linux on the Raspberry Pi
Operang Systems Foundaons with Linux on the Raspberry Pi
314
Arm Educaon Media
Books
The Arm Educaon books program aims to take learners from foundaonal
knowledge and skills covered by its textbooks to expert-level mastery of
Arm-based technologies through its reference books. Textbooks are suitable
for classroom adopon in Electrical Engineering, Computer Engineering, and
related areas. Reference books are suitable for graduate students, researchers,
aspiring and praccing engineers.
Available now:
Embedded Systems Fundamentals with Arm Cortex-M based
Microcontrollers: A Praccal Approach
By Dr. Alexander G. Dean
ISBN 978-1-911531-03-6
Digital Signal Processing using Arm Cortex-M based
Microcontrollers: Theory and Pracce
By Cem Ünsalan, M. Erkin Yücel, H. Deniz Gürhan
ISBN 978-1-911531-16-6
System-on-Chip Design with Arm
®
Cortex
®
-M Processors:
Reference Book
By Joseph Yiu
ISBN 978-1-911531-18-0
9 781911 531210
978-1-911531-21-0
Operating Systems
Foundations
with Linux on the Raspberry Pi
Reference Book
The aim of this book is to provide a praccal introducon to the foundaons of
modern operang systems, with a parcular focus on GNU/Linux and the Arm
plaorm. The unique perspecve of the authors is that they explain operang
systems theory and concepts but also ground them in praccal use through
illustrave examples of their implementaon in GNU/Linux, making the connecon
with the Arm hardware supporng the OS funconality. For use in ECE, EE, and
CS Departments.
Arm Educaon Media is a publishing operaon with Arm Ltd, providing a range of educaonal materials for aspiring and praccing engineers.
For more informaon, visit: armedumedia.com
Contents
1 A Memory-centric
System Model
2 A Praccal View of the
Linux System
3 Hardware Architecture
4 Process Management
5 Process Scheduling
6 Memory Management
7 Concurrency and Parallelism
8 Input / Output
9 Persistent Storage
10 Networking
11 Advanced Topics
While the modern systems soware stack has
become large and complex, the fundamental
principles are unchanging. Operang Systems
must trade o abstracon for efciency.
In this respect, Linux on Arm is parcularly
instrucve. The authors do an excellent job of
presenng Operang Systems concepts, with
direct links to concrete examples of these
concepts in Linux on the Raspberry Pi. Please
don’t just read this textbook – buy a Pi and
try out the praccal exercises as you go."
Steve Furber CBE FRS FREng
ICL Professor of Computer Engineering,
The University of Manchester