Technical notes on perl, python, php, sql, cgi, c/c++, q/kdb+, unix/shell, revision control tools, data structures & algorithms, and their applications into web services and other various forms of software engineering.

intro to OS

Georgia Tech - Graduate Intro to Operating Systems 
$ /usr/bin/make -f makefile_example 
--------------------------------// makefile_example 
# specify the compiler 
# specify options for the compiler 
CFLAGS=-c -Wall 
all: hello 
hello: main.o hello.o 
    $(CC) main.o hello.o -o hello 
main.o: main.cpp 
    $(CC) $(CFLAGS) main.cpp 
hello.o: hello.cpp 
    $(CC) $(CFLAGS) hello.cpp 
    rm -rf *o hello 
###  topic overview  (P1L1) 
OS abstractions, mechanisms, and policies. 
- process mgmt 
- threads, concurrency and synchronization 
- resource mgmt : scheduling, memory mgmt 
- OS services for communication and IO 
- OS support for distributed services 
####     Intro to OS     ##### (P1L2) 
OS : software that abstracts (simplify) & arbitrates (manage/control) the underlying hardware system components. 
- directs operational resources (CPU/GPU, memory, peripheral devices, disks, NIC, USB, etc) 
- enforces working policies (fair/limit resource access, e.g. how many FDs one proc can occupy, etc) i.e. provide isolation & protection 
- mitigates difficulty of complex tasks (abstract HW details, e.g. system calls) 
e.g. thanks to sys call that abstracts R/W, you dont have to worry about the disk sector of HDD when you save data. 
OS examples 
- servers 
- mainframe 
- desktop 
- embedded (e.g. smartphone) 
e.g.  Windows, Unix/Linux, android(linux), iOS, Symbian 
OS elements 
- abstractions   (process, thread, file, socket, mem page, etc) 
- machanisms     (create, schedule, open, write, allocate, etc) 
- policies       (LRU, EDF = earliest deadline first, etc) 
OS Design Principles 
- separation of mechanism & policy 
-- implement flexible mechanisms to support many policies (e.g. LRU, LFU, random) 
- optimize for common case 
-- where will the OS be used? (HW details, like what kind of machine, CPU, mem, disks, etc) 
-- what will it be used for most? (common user/application) 
-- what are the workload requirement? 
OS Protection Boundary 
- kernel-level (priviledged mode for OS to directly access HW) 
- user-level   (un-priviledged mode) 
===> the switch btwn the two modes are supported by HW 
      - trap instruction (basically a user level app tries to do something priviledged, it invokes trap instruction where OS instructs accordingly, e.g. permit or deny. etc) 
      - system call (interface for app to request OS to perform certain priviledged mode services. open, send, malloc) 
      - signals 
"Synchronous Mode" : wait until sys call completes. (as opposed to asynchronous mode) 
(example: sys call) 
===> user-kernel transition is not cheap, involves many trap instrucions = CPU cycles, can be an overhead in performance. 
OS Basic Services 
- scheduling 
- mem mgr 
- block device driver (for disks, speaker, etc) 
- file system 
==> gotta manage proc/file/mem/device/storage/security, etc 
(sys call quiz) 
[1] Monolithic(general) OS 
- everything included 
- inlining, compile time optimization 
- customization, portability(heavy). manageability 
- mem footprint 
- performance 
[2] Modular(interface) OS 
e.g. Linux, one can implement & install new interface, like writing a new file system optimized for certain operations/work load. 
- maintainability 
- less resource needs 
- smaller mem footprint 
- indirection can impact performance (depending on how good your interface is) 
- maintainability can be an issue (module can come from a completely outside/separate code base, potentially the source of bugs) 
[3] Microkernel (primitive/minimal) 
- only address space, threads, inter-process-communication 
- no OS level support for file system, Disk/DB drivers, etc 
- small size 
- verifiability 
- portability (microkernel is specialized in nature, so the underlying HW is specialized too, usually) 
- complexity of software dev 
- cost of user/kernel crossing 
Linux Architecture 
- user                      (user interface) 
- standard utility programs (library interface)  # user mode 
- standard library          (syscall interface)  # user mode 
- OS                        # kernel mode 
- HW 
(see kernel diagram) 
#####     Process & Process Mgmt     ######  P2L1 
what is a process? 
how is it represented by OS? 
how does OS manage multiple processes sharing the same underlying HW platform? 
process = task = job 
- instance of an executing program 
[state of execution] 
- program counter, stack pointer 
[parts & temporary holding area] 
- data, register state in mem 
[may require special HW] 
- I/O devices, e.g. disks, NIC 
application == program on disk/flash mem  ==>  "static entity" 
process     == state of a program when executing, loaded in memory ==> "active entry" 
e.g. application = MS Office Word 
     process = any instances of Word, you may open more than one instance when you wanna edit multiple docs. 
  stack    # grows and shrinks in FIFO way 
  heap     # dynamically created during execution 
  data     # static state when process starts 
  text     # static state when process starts 
why stack ?  - suppose you are executing a program, it has some state A, then wants to turn to state B and run some procedure, then come back to state A. then keeping the states in stack makes sense. 
Address Space == 'in memory' representation of a process. (virtual address, v0, v1, v2,,,) 
Page Tables   == mapping btwn virtual & physical addresses (phsycal mem, e.g. DRAM) 
e.g. 32bits virtual address space = 4GB 
===> if you allocate full 4GB to every proc, you soon run out of RAM, so OS must be smart how much to (statically/dynamically) allocate from RAM or disk. 
===> OS needs to maintain both address space & page tables 

#  process execution state 

PC (program counter) : keeps tracks of where the current pos is in the instruction(execution) sequence. 
CPU register         : keeps PC among other things immediately necessary for the execution. 
Stack Pointer        : points to the top of the exec stack. 
===> to manage all those, OS maintains PCB (process control block) 
PCB: a data structure OS maintains to manage every process (current state, PID, PC, registers and their values as related to the proc, mem limit, open files, priority, signal mask, CPU scheduling info, etc) 
PCB is created when process is created, then initialized, then its fields updated as process state changes. 
(e.g. fields like PC updates at every instruction. for this, usually CPU has a dedicated register for PC) 
Context Swtich : switching the CPU from the context of one process to the context of another. 
                 e.g. OS may work on process A, then interrupt it and may swatch to process B, then intrrupt it and back to A. 
                 OS will maintain PCB_pA, PCB_pB. 
==> context switch is expensive ! 
    - [direct cost]  : number of cycles for load & store (PCB blocks) instructions. 
    - [indirect cost]: cold cache (= cache misses)    # basically un-worked proc data is will be moved from register|L1|L2 to lower level processor cache or even RAM. then when CPU context switches back to the proc, then it will incur some cache miss (aka cold cache). 
==> really need to minimize context switch. 
(good quiz) 
==> sometimes, with good hot cache, CPU must still context switch, depending on the priority given to other proc, pr maybe due to policy that may dictate time sharing. 

#  process life cycle 

- running VS idle 
- idle -> (scheduler dispatch) -> running -> (interrupt) -> idle 
- a more granular statetransition model(new, ready, running, waiting, terminated, etc) is available 

#  process creation 

- fork : copies the parent PCB into a new child PCB. 
         child continues execution of the instruction immediately after the fork. 
- exec : replace child image. 
         load new program and start from the first instruction. 

#  CPU scheduler 

- suppose there are 3 processes in 'ready' state 
- which one to run first? (and how long?) -> decided/managed by "CPU Scheduler" 
- "pre-empt" : interrupt and save current context 
- "schedule" : run scheduler to choose next process. 
- "dispatch" : dispatch process & switch into its context  # start 
===> these OS services are a overhead to actual work, so we need good design, algorithm, and implementation to kep OS efficient. 
===> how often shdoul we run CPU Scheduler? (i.e. how long should a process run?) 
     - timeslice : ratio of time spent on actual work versus scheduling work. obviously we want the actual work to be bigger. 
                   scheduling design principles (to be discussed deeper later) 
                   - what are appropriate timeslice values? 
                   - metrics to choose next proc to run? 
(good quiz) 

#  Inte-Process Communication (IPC) model 

- transfer data btwn address spaces 
- maintain protection & isolation 
- provide flexibility & performance 
(1) message-passing IPC 
- OS provides communication channel, like shared buffer or file 
- processes has a common API by OS, e.g. write/read messages to/from channel. 
[pros] OS mgmt 
[cons] overhead 
(2) shared memory IPC 
- OS establishes a shared channel and maps it into each process address space 
- processes directly read/write from this mem 
[pros] no OS overhead since OS is out of the way 
[cons] since OS is out of the way, developer must implement write/read from/to shared mem safely. (can be hard) 
####     Threads and Concurrency     #### 
single process & multiple execution contexts ==> threads 
- threads & concurrency 
- synchronization 
(excellent paper) An Introduction to Programming with Threads 
a thread : 
- an active executing unit of a process 
- works simultaneiously with other threads 
- requires coordination on resource sharing (I/O devices, CPUs, memory, etc) 
===> but how to decide how to share? it is an important design decision for OS & SW developers. 
process VS thread 
- single process : 
[code][data][files] - [PCB] 
1. [reg][stack][exec context]     # reg contains PC 
- multi threaded process 
[code][data][files] - [PCB]      # PCB shared among threads. just ONE address space (virt/phy address map) 
1. [reg][stack][exec context]    # multi exec contexts 
2. [reg][stack][exec context]    # code is the same, but each thread executing its own instruction, 
3. [reg][stack][exec context]    # so each needs to have its own stack, reg, PC 
4. .. 
- parallelization : speed up (each thread gets its own CPU core/cache) 
- specialization : certain thread with higher priority, for certain task/request/user etc. 
                   => hot cache !  (same address space, same data, and multiple contexts accessing them, clearly hotter cache) 
multi-proc VS multi thread 
==> multi thread can create multi execution contexts with one address space setup (this AS swap was the most costly part of context switch in multi-proc, so this benefit is big). 
==> more mem efficient. 
==> also, multi-proc requires IPC, which can be costly, whereas multi thread can accomodate a shared variable, etc more easily. (of course it does introduce some of its own challenges but we will discuss them later) 
Q. are threads useful on a single CPU ? 
A. Yes. 
suppose you have 2 threads, one requests to a disk, and other receiving from the disk. then depending on how long the disk takes just doing its own read ops, the 2nd thread may do more work while the 1st thread has nothing to do. so threads in a single CPU/core can be still useful. 
more in general, # of threads > # of CPUs can be useful. 
lets define variables as follows; 
time_idle : e.g. like disk seek/read time as above 
time_context_switch : how long it takes to switch context 
if (time_idle > 2 * time_context_switch) then it makes sense to context switch 
time_context_switch(threads) < time_context_switch(CPUs) 
====> means more latency eliminated in threads. 
both applications and OS (kernel) benefit from multi-threads. 
Thread basic mechanism: 
- data structure : identify threads, keep track of resource usage 
- mechanisms to create and manage threads 
- machanisms to safely coordinate among threads running concurrently in the same address space 
 (because threads exist within the same one proc that has virt addr, and the same underlying phy addr, threads need to have a mutual exclusion machanism, to avoid data race.) 
(1) mutual exclusion: "mutex" - exclusive access to only one thread at a time 
(2) wait & notify : "condition variable" 
(3) wait & wakeup 
===> we focus on (1) & (2) : both called "synchronization" machanism 
Threads Creation 
- thread type (data structure) : thread ID, PC, SP, registeres, stack, attributes 
- fork(proc, args) : create a new thread with a procedure to execute, and its arguments 
                     it is NOT unix fork that simply creates a duplicate process of the caller process. 
thread_1 may call  thread_2 = fork(proc,agus); then the second thread starts up. 
- join(thread) : waits till the specified thread terminates. 
thread_1 may call  child_result = join(thread_2); 
===> note, both parent and child threads pretty much have the same priviledge, in terms of accessing shared data, OS services, etc. 
Shared_list list; 
Thread thread1 = fork(safe_insert, 4); 
join(thread1);  // optional 
===> the resulting list may be either  4-6-null  or 6-4-null 
===> assuming the list is a standard LL, then we need a proper mutex. 
Mutex: a lock construct for certain shared resource 
       - state : locked or not ? 
       - owner : who acquired the lock(mutex) ? 
       - blocked_threads : who are being blocked ? 
lock(mutex){    // called acquite the lock 
 do_some_work_on_shared_resource();  // this part of the code is 
 do_more();                          // called "critical section" 
}   // called release/freeze the lock 
NOTE: instead of lock(m){critical section;},  some API may implement lock(m); unlock(m); 
// critical section; 
==> now, re-visit safe_insert(int i); 
list<int> my_list; 
Mutex m; 
void safe_insert(int i){ 

NOTE: suppose t1 locks m first, then while being locked, t2,t3,t4 sequentially call lock(m), what happens when t1 unlocks(m) ? 
      ==> it is not like a waiting queue, any one of t2,t3,t4 may get m next. 
(good quiz) 

#  Condition Variable 

suppose you have many consumer threads filling up a list. 
then you have one consumer thread that wants to process the list ONCE the list is full. 
if you only use mutex/lock mechanism, at each iteration of producer thread inserting an element to the list, consumer thread may lock it and check if the list is full. 
===> but this is terribly wasteful. 
===> "condition variable" to resucue 
// consumer: print_and_clear 
      wait(m,list_full);     // releases the mutex once start waiting, and re-acquire it once the wait is finished 

// producers : safe_insert 
   if my_list.full() 
      signal(list_full);    // aka notify 

[common condition variable API] 
(1) condition data type : mutex ref, waiting threads 
(2) wait(mutex,cond)    : release & re-acquite mutex 
(3) signal(cond)        : notify only one thread waiting on condition 
    broadcast(cond)     : notify all waiting threads 
(good quiz) 

#  Reader / Writer problem 

suppose you have multiple reader threads, and writer threads, and one shared file. 
when nobody is writing, any number of readers can read the file. 
so using a plain lock(m) in every thread is too restrictive. 
if(read_counter == 0 && write_counter == 0){ R = ok; W = ok; } 
if(read_counter > 0){ R = ok; } 
if(writer_counter == 1){ R = ng; } 
====> to be smart about this, instead of locking the mutex for the shared file itself, we can mutex lock on the state variable for the file. 
- free : resource_counter = 0 
- read : resource_counter > 0 
- write: resource_counter = -1 
// variables 
Mutex counter_mutex;     // mutex for resource_counter 
Condition read_phase, write_phase; 
int resource_counter = 0; 
// Readers 
   while(resource_counter == -1)         // think of this section 
      wait(counter_mutex, read_phase);   // as read 'entry' critical section 
} // unlock 
// reading happens here, no need to keep mutex locked, because we control it at entry/exit points. 
// lets call this section read critical section 
   resource_counter--;           // think of this section 
   if(resource_counter == 0)     // as read 'exit' critical section 
} //unlock 
// Writer 
   while(resource_counter != 0)           // similarly, think of this section 
      wait(counter_mutex, write_phase);   // as write 'entry' critical section 
   resource_counter = -1; 
} //unlock 
// writing happens here, no need to keep mutex locked, because we control it at entry/exit points. 
// lets call this section write critical section 
   resource_counter = 0;      // think of this section 
   broadcast(read_phase);     // write 'exit' critical section 
} //unlock 
###   Critical Section structure with proxy(condition) variable 
lets generalize what we saw in the above readers/writer example. 
// Entry critical section 
   while(!predicate)   // predicate for access 
      wait(mutex, cond_var); 
   update predicate; 

# perform critical operation (read/write shared file) 

// Exit critical section 
   update predicate; 

===> with this condition variable, we can now facilitate more complex mutex control, as seen in R/W example. 
##  Common Pitfalls 
- keep track of mutex/cond_var used with a resource 
  e.g. mutex_type m_f1;  // mutex for file1 
- always correctly use lock & unlock 
  e.g. are you ensuring lock & unlock happen at all cases?  // some compiler can help check. 
- use a single mutex to access a single resource ! 
  e.g. do NOT use multiple mutexes for one same resource. // it will likely mess up the resource. 
- ensure you signal correct cond var. 
- distinguish btwn signal vs broadcast. 
- be aware that signal does not guarantee order in which thread gets notified. you must control thread priority somehow. 
- spurious wake up 
- deadlock 
##  Spurious(unnecessary) Wake Up 
- does not affect correct-ness, but affects performance. 
----------------------- // readers entry crit sec code 
   while(resource_counter == -1) 
      wait(counter_mutex, read_phase); 

------------------------ // writer exit critical section code 
   resource_counter = 0; 

====> now, suppose broadcast(read_phase) is called, then wait() in reader code is called, but then when it tries to obtain the mutex, writer exit code is still on its way to running signal(), so wait() still really has to wait before the reader entry code can really re-acquire the mutex and resume. 
====> causing unnecessary context switch back to reader entry code, then back to writer exit code. 
====> spurious wake up. 
====> in this case, we can safely (and more correctly) re-write writer exit code as below, to avoid spurious wakeup. 
   resource_counter = 0; 

=====> but note, we cannot always unlock before signal/broadcast 
=====> for example, in the below code we cannot simply take signal() out of the crit section. 
------------------------ // readers exit critical section code 
   if(counter_resource == 0) 

##  daedlock 
two or more competing threads are waiting on each other to complete, but none of them ever do. 
suppose both thread A & B want to access shared resource X & Y. 
thread A locks mutex_x, then mutex_y 
thread B locks mutex_y, then mutex_x 
===> deadlock. 
===> one way is to always unlock mutex_y before locking mutex_x (aka fine-grained locking) 
===> but still does NOT work when needing to lock both X & Y 
===> have one mutex for both A & B  (works ok but may be too restrictive..) 
===> always maintain lock order.  (this is good, logically prevents a wait cycle) 
    e.g. always lock mutex_x, then mutex_y 
[in summary] 
a cycle in the wait graph is necessary and sufficient for a deadlock to occur. 
(edges from thread waiting on a resource to thread owning a resource) 
- check at every lock request, if it causes any deadlock : very expensive 
- detect a deadlock and recover : rollback may not be possible if the thread depends on external input & output 
- ostrich algorithm : do nothing 
(good quiz) 
##  Kernel VS User-level threads 
- threads can exist in both kernel and user levels. 
kernel level threads are managed by OS/kernel level scheduler/components, etc 
and some of those threads interface with user level threads. 
(1) one-to-one model 
1-to-1 connection btwn user level thread and kernel level thread. 
- OS sees/understands threads, synchronization, blocking. 
- must goto OS for all operations (may be expensive) 
- OS may have limits on policies, thread #, etc  (portability issue, because limited by OS level features) 
(2)   many-to-one model 
many user level threads mapped to one kernel level thread. 
==> user level thread mgmt library decides which user level thread gets to comm with kernel level thread at any given point of time. 
- portability: everything is done at user level thread mgmt library, scheduling, synchronization, etc 
               does not depend on OS limits/policies. 
- OS has no insights into app needs. 
- OS may block entire process if one user-level thread blocks on I/O.   # performance degradation 
(3)  many-to-many model 
- best of both models. 
- can have bound or unbound threads. (1-to-1 mapped threads are 'bound', often for special purpose, priority, etc) 
- requires coordination btwn user and kernel level thread mgmt 

#  Scope of Multi-threading 

System Scope : system-wide thread mgmt by OS-level thread managers (e.g. CPU scheduler) 
Process Scope: proces-wide user level library manages threads within a single process 
===> suitable allocation of kernel level threads to per proc which each may have diff number of user level threads within, we need a good mgmt policy/implementation, etc 

#  multi-theading patterns 

(1) boss-workers 
- boss: assigns work to workers 
- workers: performs assigned tasks 
==> throughput of the system limited by boss thread efficiency. 
==> throughput = 1/boss_time_per_order 
how does the boss assign ? 
(a) directly signal specific worker. 
  [pros] works dont need to synchronize 
  [cons] boss needs to keep track of each worker -> boss busy -> throughput degradation 
(b) place work in a queue (where boss puts work in the queue, workers take them) 
  [pros] boss doesnt need to know about workers details 
  [cons] producer/consumer queue synchrnization   # lock while adding,taking tasks out of the queue, which is the shared resource. 
==> overall, considered better than (a) 
how many workers? 
- on demand          # not efficient, takes time to create a new worker(thread) every time. 
- pool of workers    # overhead 
-- static vs dynamic # dynamically adjust the pool of workers. 
- simplicity  (one thread assigning tasks, all workers do the same work, in a divided way.) 
- overhead on thread pool mgmt 
- ignores locality (boss blindly puts work into the queue, so no regard as to which worker may be efficient at certain work, etc.) 
# boss-worker model variant 
- workers specialized for certain tasks.  (as opposed to all workers created equal) 
- better locality, QoS mgmt. 
- load balancing. (boss needs to know better, about each worker specialty. this overhead offset by the pros) 
(2) pipeline 
- threads assigned in each subtask(stage) in the system 
- entire taskss are pipeline of threads 
- multiple tasks concurrently in the system, in different pipeline stages. 
===> bottleneck becomes the least performing thread in the whole pipeline. 
     throughtput == weakest link   # but you can add more threads into the weakest stage. 
===> pipeline stage == thread pool 
- shared buffer based communication btwn stages 
- specialization & locality 
- overhead balancing and synchronization of the pipeline. 
(3) layered 
- each layer = a group of related subtasks(stages) 
- end-to-end tasks must pass up and down through all layers. 
- specialization 
- less time-grained than pipeline. 
- not suitable for all appliations 
- synchronization hard 
(good quiz) 
#####     PThread (POSIX thread) API    ##### 
POSIX : portable OS interface 
Birrel's mechanisms:             |    Pthread 
- Thread (data type/structure)   |   pthread_t th_foo;   // thread type 
- Fork(proc,args)                |   int pthread_create(pthread_t *th, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg); 
- Join(Thread)                   |   int pthread_join(pthread_t thread, void **status);   // status variable will capture info 

#  pthread attributes 

- lets you specify features : stack size, scheduling policy, priority, inheritance, system/process scope, joinable, etc 
- as arg in pthread_create() 
- has default behavior with NULL 
int pthread_attr_init(pthread_attr_t *attr); 
int pthread_attr_destroy(pthread_attr_t *attr); 
detachability (joinability) : by default, pthread is joinable 
parent thread creates & joins child thread. 
what happens when parent thread exits before child thread completes? -> zombie thread. 
pthread allows parent thread to "detach" child thread. 
==> then threads really become equal !! (except parent holds some info regarding child thread.) 
==> parent thread can die and child thread can still independently operate. 
int pthread_detach(pthread_t th_child); 
// or you can specify a thread to be detached at the creation timing. 
int pthread_attr_init(pthread_attr_t *attr); 
pthread_attr_setdetachstate(sttr, PTHREAD_CREATE_DETACHED); 
pthread_create(..., attr, ...); 
void pthread_exit(); 
------------------------------------// pthread_attr_example.c 
#include <stdio.h> 
#include <pthread.h> 
void *foo (void *arg){  // thread func 

int main (void){ 
   int i 
   pthread_t tid; 
   pthread_attr_t attr; 
   pthread_attr_init(&attr);  // required 
   pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); 
   pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); 
   pthread_create(NULL, &attr, foo, NULL) 
   return 0; 

- compile source code with -lpthread (or -pthread) 
e.g.  $ gcc -o main.exe main.c -pthread 
- check return values of common functions 
----------------------------------// pthread_creation_example.c 
#include <stdio.h> 
#include <pthread.h> 
#define NUM_THREADS 4 
void*hello (void *arg){  // thread func 
   int *p = (int*)arg; 
   printf("hello thread %d \n", ); 
   return 0; 

int main (void){ 
   int i; 
   pthread_t tid[NUM_THREADS]; 
   for (i = 0; i < NUM_THREADS; i++) { 
      pthread_create(&tid[i], NULL, hello, &i);    // we dont know 
   for (i = 0; i < NUM_THREADS; i++) { 
      pthread_join(tid[i], NULL); 
   return 0; 

===> possible output 
- 0 1 2 3 
- 0 2 1 3 
- 0 2 2 3  # since i is a global var, and we pass by ref, before "1" gets printed, possibly it got incremented. 
           # called "data race" or "race condition" : a thread tries to read a var while another thread is modifying it. 
           # in this case, you can create a new variable at each iteration(like an array element) to solve this issue. 
(good quiz) 
(good quiz) 

#  pthread mutex  : to solve mutual exclusion problem among concurrent threads 

Birrell's mechanisms:                     |     Pthread 
- mutex                                   |  pthread_mutex_t m;  // mutex type 
- lock(mutex){...critical section...;}    |  int pthread_mutex_lock(pthread_mutex_t *m);    // lock 
                                          |   ... critical section ... 
                                          |  int pthread_mutex_unlock(pthread_mutex_t *m);  // unlock 
list<int> mu_list; 
pthread_mutex_t m; 
void safe_insert(int i){ 

# mutex initialization (a must) 
int pthread_mutex_init(pthread_mutex_t *m, const pthread_mutexattr_t *attr); 
==> attr specifies behavior when a mutex is shared among processes 
# mutex trylock 
int pthread_mutex_trylock(pthread_mutex_t *m);  // unlike lock() which waits till locking, trylock() returns immediately, 
                                                // only locks if it was not already locked. 
# mutex destroy 
int pthread_mutex_destroy(pthread_mutex_t *m); 
NOTE:  there are many others 

#  dire tips 

- shared data should always be accessed thru a single mutex 
- mutex scope must be visible to all 
- globally, for all threads, lock mutexes in order 
- dont forget unlcok 
##  condition variable 
Birrell's mechanisms:     |        pthread 
condition var             |  pthread_cond_t c;    // cond var type 
wait                      |  int pthread_cond_wait(pthread_cond_t *c, pthread_mutex_t *m); 
signal                    |  int pthread_cond_signal(pthread_cond_t *c); 
broadcast                 |  int pthread_cond_broadcast(pthread_cond_t *c); 
# pthread cond var init & destroy 
int pthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr);  // attr lets you specify stuff, NULL means default behavior 
int pthread_cond_destroy(pthread_cond_t *c); 
NOTE: do not forget to notify waiting threads. 
      when predicate change ==> signal/broadcast correct cond var. 
      when in doubt, broadcast   # but performance loss 
NOTE: recall how we called signal/broadcast AFTER unlock. keep that aspect in mind. 
##  produce/consumer Pthread implementation example 
-------------------------------------------------------------------// procon_pthread.c 
// global scope 
#include <stdio.h> 
#include <stdlib.h> 
#include <pthread.h> 
#define BUF_SIZE 3 /* Size of shared buffer */ 
int buffer[BUF_SIZE];   /* shared buffer */ 
int add = 0;   /* place to add next element */ 
int rem = 0;       /* place to remove next element */ 
int num = 0;        /* number elements in buffer */ 
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;    /* mutex lock for buffer */ 
pthread_cond_t c_cons = PTHREAD_COND_INITIALIZER; /* consumer waits on this cond var */ 
pthread_cond_t c_prod = PTHREAD_COND_INITIALIZER; /* producer waits on this cond var */ 
void *producer (void *param); 
void *consumer (void *param); 
int main(int argc, char *argv[]) { 
pthread_t tid1, tid2;  /* thread identifiers */ 
int i; 
/* create the threads; may be any number, in general */ 
if(pthread_create(&tid1, NULL, producer, NULL) != 0) { 
fprintf(stderr, "Unable to create producer thread\n"); 
if(pthread_create(&tid2, NULL, consumer, NULL) != 0) { 
         fprintf(stderr, "Unable to create consumer thread\n"); 
     /* wait for created thread to exit */ 
     pthread_join(tid1, NULL); 
     pthread_join(tid2, NULL); 
     printf("Parent quiting\n"); 
     return 0; 

/* Produce value(s) */ 
void *producer(void *param) { 
      int i; 
      for (i=1; i<=20; i++) {  /* Insert into buffer */ 
        pthread_mutex_lock (&m); 
          if (num > BUF_SIZE) { 
            exit(1);  /* overflow */ 
          while (num == BUF_SIZE) {  /* block if buffer is full */ 
                  pthread_cond_wait (&c_prod, &m); 
          /* if executing here, buffer not full so add element */ 
          buffer[add] = i; 
          add = (add+1) % BUF_SIZE; 
          pthread_mutex_unlock (&m); 
          pthread_cond_signal (&c_cons); 
          prinntf ("producer: inserted %d\n", i); 
          fflush (stdout); 
     printf("producer quiting\n"); 
     return 0; 

/* Consume value(s); Note the consumer never terminates */ 
void *consumer(void *param) { 
     int i; 
     while(1) { 
        pthread_mutex_lock (&m); 
        if (num < 0) { 
        } /* underflow */ 
        while (num == 0) {  /* block if buffer empty */ 
           pthread_cond_wait (&c_cons, &m); 
        /* if executing here, buffer not empty so remove element */ 
        i = buffer[rem]; 
        rem = (rem+1) % BUF_SIZE; 
        pthread_mutex_unlock (&m); 
        pthread_cond_signal (&c_prod); 
        printf ("Consume value %d\n", i);  fflush(stdout); 
     return 0; 

#####      Threads Design Consideration    #### 
recall user level VS kernel level threading discussion. we will go deeper. 
two seminal papers. 

#  thread related data structure 

- recall PCB : virtual address mapping, stack, registers, etc 
==> we can split PCB and assign some of its features to KLT (kernel level thread) 
==> so we have 
- CPU : holds pointers to current and other threads 
- KLT : stack, registers, etc 
- PCB : virtual addr mapping, etc 
- ULT (= user level thread) lib : UL thread ID, UL register, UL stack, etc 
from ULT lib perspectives, each KLT is like a virtual CPU. 
(see video for visual) 

#  Hard & Light process states 

- when 2 KLTs belong to the same addr space (= 1 proc which may have multiple ULTs), and when the 2 KLT context-switches, it changes PCB. 
- but the two PCB shares the same virtual addr mapping, but two distinct sets of signals, sys call info for each KLT. 
- for share-able stuff, we keep it in "hard" process state PCB. (for all KLTs) 
- for non-share-able stuff, we keep it in "light" process state PCB. (for each KLT) 

#  rationale for multiple data structures 

[ single PCB ] 
-- large contiguous data structure 
-- private for each entity (KLT) 
-- saved and restored on every context switch 
-- update for any change within the PCB 
===> NO scalability, overheads, BAD performance, NO flexibility 
[ multiple data structures ]   # like hard & light proc states 
-- smaller data structures 
-- easier to share 
-- on context switch, only save/restore what needs to change 
-- ULT lib need only update portion of the state 
===> improved scalability, overheads, performance, flexibility 

#  ULT in Sun OS 5.0, (Solaris 2.0) 

- not Pthread, but similar 
- thread_creation() returns tid 
-- tid : index into table of pointers each pointing to thread data structure (exec context, reg, signal mask, priority, stack ptr, thread local storage, stack ) 
- stack growth can be dangerous : as a solution "red zone" is defined within the data structure, if reached, it detects it as error. 

#  KLT in Sun OS 5.0, (Solaris 2.0) 

Proc level data structure   #  Hard 
-- list of KLTs 
-- virtual addr map 
-- user credentials 
-- signal handlers 
LWP : light weight process   # Light  (like sub-subset of proc info) 
- user level reg 
- sys call args 
- resource usage info 
- signal mask 
==> similar to ULT, but visible to kernel. not needed when process not running. swappable. 
KLT data structure 
- KL reg 
- stack ptr 
- scheduling info (class, etc) 
- ptr to associated LWP, proc, CPU structure 
===> info needed even when proc not running. NOT swappable. 
CPU data structure 
- current thread 
- list of KLT 
- dispatching & interrupt handling info 
(on SPARC, there are many extra regs, each dedicated for current thread) 
===> Proc contains user data, and points to virtual addr space, points to a list of KLT, which each points to swappable LWP & stack. 
===> KLT has ptr to CPU 
(see visual) 

#  Basic Thread Mgmt Interaction 

- suppose 1 proc with 4 ULTs, doing some work, including IO work, with 2 KLTs. 
  (in general) 
- ULT does not know about KLT 
- KLT does not know about ULT 
===> needs a way for them to communicate/interact/coordinate : Signals, System Calls ! 
ULT lib sees 
- ULTs 
- available KLTs 
Kernel sees 
- KLTs 
- CPUs 
- KL scheduler 
many-to-many model 
- UL scheduling decisions 
- change ULT-KLT mapping 
===> kernel can NOT see mutex variable, wait queues within ULTs 
PROBLEM: visibility of state and decisions btwn kernel and UL lib. 
How/When does UL lib run ? 
=> Process jumps to ULT lib scheduler when: 
-- ULTs explicitly yield 
-- timer set by UL lib expires 
-- ULTs call lib functions like lock/unlock 
-- blocked threads become runnable 
-- signaled from timer or kernel 
## multiple CPUs : thigns get more complicated. 
suppose a proc with 3 ULTs, and two CPUs with KLT on each CPU. 
ULT priority : t3 > t2 > t1 
suppose t2 on k2, t1 on k1. t2 is locking mutex which t3 is waiting on. 
not t2 unlock the mutex, then t3 becomes runnable, and t2 still is running. 
at this point, ULT lib somehow needs to notify k1 KLT that context-switch needs to happen to place t3 on k1. 
(aka preempt t1) this can be complex. 

# Synchronization related issues 

suppose a proc with 4 ULTs, and two CPUs each with a KLT. 
t1 on k1, t4 on k2. 
suppose t4 waits on mutex locked by t1, now k2 may want to context-switch while t4 is not runnable. 
BUT, if t1's critical section is short, by the time k2 is context-switching, t1 will unlock the mutex. 
thus it makes sense to spin (not detach) t4 from k2. 
if t1's critical section is longer, then default blocking behavior is ok. 
====> called "adaptive mutex"  that adapts to the duration of critical section in the thread that locks the mutex that the concerned thread is waiting for. 

#  destroying threads 

- instead of destroying, reuse threads 
- when a thread exits: 
-- put on a 'death row' 
-- periodically destroyed by 'reaper thread' 
-- otherwise thread structures/stacks are used ==> performance gains !! 
(Linux thread quiz) 
###   interrupts  VS  signals 
- events generated externally by components other than the CPU 
e.g. IO devices, timers, other CPU 
- determined based on the physical platform (i.e. what IO devices are avaiable, etc) 
- appear asynchronously (i.e. not directly related to actions taking place on the CPU) 
- events triggered by the CPU &| software running on the CPU 
- determined based on the OS 
- appear synchronously or asynchronously (sync means signals triggered in direct response to some action in CPU, like trying to access mem that has yet to be allocated, will cause SIGSEGV #11 ) 
- both have a unique ID depending on the HW(interrupts) or OS(signals) 
- can be "masked" (disabled/suspended via corresponding mask) 
-- interrupt mask : per-CPU 
-- signal mask    : per-process 
- if enabled, trigger corresponding handler 
-- interrupt handler set for enture system by OS 
-- signal handler set on per process basis by process 
##  interrupt handling 
- interrupt msg # aka MSI (message signal interrupt) msg # sent to to CPU 
- based on MSI #, CPU refers to 'interrupt handler table' and executes the corresponding function 
[ interrupt handler table ] 
 INT-0      | handler_0_start_addr 
 INT-1      | handler_1_start_addr 
 INT-2      | handler_2_start_addr 
 ..         | .. 
 ..         | .. 
 INT-N      | handler_N_start_addr 
(HW defined)  (OS specific)            # note this 
##  signal handling 
[signal handler table] 
 SIG-0      | handler_0_start_addr 
 SIG-1      | handler_1_start_addr 
 SIG-2      | handler_2_start_addr 
 ..         | .. 
 ..         | .. 
 SIG-N      | handler_N_start_addr 
(OS defined)  (process specific)       # note this 
- OS normally has a set of 'default' actions/handling 
e.g. if SIG #N, terminate, ignore, stop, continue, etc 
- we may install process-level handler (for some signals) 
e.g. signal(), sigaction() 
- some signals cannot be caught. 
- SIGSEGV (access to protected mem) 
- SIGFPE  (divide by zero) 
- SIGKILL (kill, id)  # can be directed to a specific thread 
- SIGKILL (kill) 
##  why disable interrupts or signals ? 
e.g. some executing thread, with PC (pointing to some instruction in thread), and stack pointer (SP, pointing to thread stack) 
     upon interrupt, PC points to first instruction in handler 
===> what if handler has lock(mutex) when the very interrupted thread already locked the same mutex? deadlock ! 
     1. keep handler code simple : dont use lock(mutex)  # too restrictive 
     2. control interruptions by handler code : use interrupt/signal masks 
        (masks dynamically allow handler to enable/disable the thread) 
        (thread holds a list of mask bit, 0 or 1, indicating what interrupt is enabled or not) 
so, if an interrupt/signal msg is sent, if it's disabled, then the interrupt/signal stays 'pending' 
    once (the mask for) the interrupt/signal is enabled, then the interrupt/signal handler gets executed. 
note: if the same interrupt/signal is sent multiple times during the disabled period, it will not queue up, after enabled, it will exec only once. so we need to handle that somehow differently if you want each of them executed. 
##  more on masks 
interrupt masks : per CPU 
- if mask disables interrupt ==> HW's interrupt routing machanism will not deliver interrupt to CPU 
signal masks : per process (= per exec contect = ULT on top of KLT) 
- if mask disables signal ==> kernel sees mask and will not interrrupt correspnding thread 

# interrupts on multi-core/CPU systems 

suppose multiple cores/CPUs, 
- interrupts can be directed to any particular core/CPU that has the interrupt enabled. 
- OR, in reality, we may set interrupt on just a single core/CPU dedicated to interrupt handling. 
==> this lets us avoid overheads/instability on all other cores. better performance. 
##  type of signals 
[one shot signals] 
-- N signals pending == 1 signal pending : at least once 
-- must be explicitly re-enabled if you want to ensure the signal handled N times. 
[real time signals] 
-- N signals pending == N signals pending 
read <signal.h>  # google it. 
##  handling interrupts as threads 
recall a scenario where an interrupt handler tries to lock(mutex) when the mutex is already locked by the very interrupted thread. 
==> one way to solve this problem is to treat the interrupt handler as an independent thread, instead of being a sub routine within the to-be-int thread's stack. 
==> so the to-be-interrupted thread can continue, until it unlock(mutex) then the int handler can lock(mutex) 
BUT, dynamic thread creation is expensive ! 
Dynamic Decision: 
- if handler does not try lock(m) ==> execute on interrupted thread's stack (aka Bottom Half : fast, non-blocking) 
- if handler does try lock(m)     ==> turn into real thread (aka Top Half : arbitrary complexity) 
(==> adds about 40 SPARC instructions per interrupt on Sun OS.) 
- pre-create & pre-initialize thread structures for interrupt routines. 
(==> saves 12 instructions per mutex. no change in interrupt mask, level.) 
(==> fewer interrupts than mutex lock/unlock operations) 
(==> better performance !!) 
##  threads and signal handling 
signal mask : invisible between user level & kernel level 
case 1:  ULT signal mask = 1 
         KLT signal mask = 1 
suppose a signal occures, then it sees KLT sig mask (1=enabled) so no problem, KLT will then signal ULT (that is running on top of the KLT) which will then handle the signal (as ULT sig mask is also 1=enabled) 
case 2:  ULT_1 signal mask = 0  (currently executing) 
         ULT_2 signal mask = 1 
         KLT   signal mask = 1 
since KLT sig mask is 1, kernel things signal can be handled at process level. 
if a signal occurs at KLT level, then it fwds the signal to ULT layer. 
but currently executing ULT thread is not enabled for signal. 
here ULT library needs to resolve this, e.g. by having a wrapper for signal handler table, and context-swtiching within ULT layer. 
case 3:  ULT_1 signal mask = 0   # suppose ULT_1 running on top of KLT_1 
         KLT_1 signal mask = 1 
         ULT_2 signal mask = 1   # suppose ULT_1 running on top of KLT_2 
         KLT_2 signal mask = 1 
suppose a signal occurs to KLT_1 which fwds the signal to ULT_1 via ULT lib, but ULT lib sees ULT_1 mask is disabled. 
then ULT lib sees ULT_2 is enabled and it is running on KLT_2, so fwds the signal (called "directed signal") to KLT_2, which would then fwd the signal to ULT_2. 
case 4:  ULT_1,2 sig mask = 0    # suppose ULT_N runs on KLT_N 
         KLT_1,2 sig mask = 1 
suppose a signal occurs at KLT_1, then it fwds to ULT layer. 
ULT lib handling routine looks at ULT_1 and other ULT's masks, and confirms all disabled. 
so ULT lib responds to disable KLT_1 mask. 
then, ULT lib will re-issue the signal to other ULTs, which will cause their corresponding KLT_N mask to be disabled. 
now, if ULT_2 finishes some task, and ready to enable its mask again, then ULT lib who knows it disabled all KLT_N masks, will sys call KLT_2 to enable its mask. 
==> Optimize common case !! 
- signals : less frequent than signal mask updates 
- sys calls avoided : cheaper to update UL mask 
- signal handling more expensive : but as above, we made it less frequent than sig mask updates. 
##  thread support in linux 
task (linux lingo) : main execution context = kernel level thread 
single threaded task : 1 task per thread 
multi threaded task  : many tasks per thread 
"task struct" in linux 
struct task_struct{ 
  pid_t pid:    # task ID (pid is a task id, not process id, so it is misleading.) 
  pid_t tgid;   # task group id 
  // lots of ptrs to resources, e.g. mem, files, other tasks, etc. 

task creation: clone(function, stack_ptr, sharing-flags, args);     # like fork() 
                                                                    # fork() in linux means a diff thing. 
(see video for sharing flags) 
linux thread model : NPTL = native POSIX thread library 
==> 1:1 model. i.e. one KLT mapping to one ULT 
   (used to be many to many, but suffered from the complexity issues we discussed earlier) 
==> kernel sees each ULT info (because of 1:1) 
    (this is made possible for 2 reasons) 
   1. kernel traps are cheaper 
   2. more resources : mem, large range of IDs 
####     Thread Performance Consideration      #### 
performance comparison: multi-thread VS multi-proc VS event-driven 
===> well discussed, in the context of web server/application, in "Flash: an efficient and portable web server" in contrast with Apache 
recall 2 thread models : boss-worker VS pipeline 
=> we only looked at the total "completion" time. 
=> but if you look at the "average execution" time, i.e. sum up how long each task took to complete and divide by the number of tasks, one model may be better. 
===> depends on the performance "metrics" (as well as "workload") 
metrics: relevant property to evaluate.  aka  a measurement standard. 
-- execution time 
-- throughput 
-- response time (=request time) 
-- utilization 
-- wait time            # how long before it starts 
-- platform efficiency  # combination of throughput & utilization 
-- percentage of SLA violation 
-- performance per $ 
-- client-perceived performance 
-- many more.. 
=> metrics should be  measurable and/or quantifiable property of the system. 
 e.g. we may use metrics to compare performance btwen diff algorithms for a certain problem. 
for a matrix multiplication, we may care about "execution time" 
for a web service, we may care about "number of client requests handled per unit time" OR "response time" but ave/max/min/95percentile ? 
for a hardware chip, we may care about "higher utilization" e.g. (CPU) 
of course, sometimes you can not really collect metrics in real world environment, so you set up experiment/test setup that is sufficiently representative of the target envrionment. plus maybe some simulation. 
==> overall, we say "testbed" 
in summary, the useful ness of concurrency design and principles (multi-thread VS multi-proc VS event-driven) depends on 
- metrics 
- workload 
##  multi-proc VS multi-thread VS event-driven 
a simple web server 
- 1. client/browser send request 
- 2. web server accepts request 
- 3. server processing steps 
- 4. respond by sending file 
multi-proc  : [pros] simple programming 
              [cons] high mem usage, costly context switch, difficulty in shared resource maintenance. 
multi-thread: [pros] shared addr space, shared state, cheaper context switch (thus less mem usage) 
              [cons] requires synchronization(more complex programming), underlying OS support 
##  event-driven model (EDM) 
- single addr space 
- single process 
- single thread of control 
"event dispatcher" 
- is in a loop, dealing with incoming events, and invoke one or more registered handlers (just jumping to the code). 
- is a state machine 
Events: e.g. 
- request received 
- send completed 
- disk read completed 
- run to completion 
- if needing to block, initiate blocking operation and pass control to dispatch loop. 
==> How does EDM achieve concurrency ? 
MP/MT : one request per exec context (proc/thread) = cpu 
EDM   : many requests interleaved in an exec context. (kind of like internal context switch within a context) 
==> why does EDM work ? 
on one CPU, threads hide latency if (t.idle > 2 * t_ctx_switch) 
if (t.idle <= 2*t_ctx_switch) then switching would waste cycles. 
==> but EDM can process request until any wait happens then switch to another req, like internal switch. 
==> in case of multiple CPUs, each CPU can have an EDM,, thus multiple EDM processes. 
==> how is EDM implemented ? 
file descriptors(FD) 
- sockets : network 
- files   : disk 
event = input on FD 
how to get input on FDs. (OS APIs) 
- select()   # picks the FDs with input 
- poll()     # scan thru all FDs 
- epoll()    # most efficient 

# Benefit of EDM 

- single addr space, single flow of control 
- smaller mem requirement (no context switching) 
- no synchronization 
==> switching btwn diff FDs within an EDM proc will have some effect of 'loss of locality' 'cache pollution' but its performance gain is better than context switch expense in general. 
## helper threads/processes 
EDM challenge : a blocking req/handler will block the entire EDM proc/dispatcher 
==> solution : asynchronous IO operations 
Asynchronous SysCall 
- proc/thread makes sys call. 
- OS obtains all relevant info from stack, and either leans where to return results, or tells caller where to get results later. 
(so caller proc/thread can continue) 
- requires support from kernel (e.g. threads) and/or device (e.g. DHA) 
a helper : 
- a thread or a proc designated for blocking IO ops only 
- pipe/socket based comm with event dispatcher (e.g. select(), poll() etc) 
- helper blocks but main dispatcher will NOT ! 
===> aka asymmetric multi-proc/thread event driven model (AMPED/AMTED) 
NOTE: helper threads/procs are created only for concurrent blocking IO ops. thus efficiency of EDM is retained ! 
-  resolves the portability limitations of basic EDM 
-  less mem footprint than regular worker models 
   (because each helper only deals with IO blocking, nothing else, thus less mem for each thread/proc) 
- not applicable to all classes of services (effective in web servers) 
(good quiz) 
##  Flash : EDM web server 
- an EDM web server with AMPED (assymmetric helper procs) 
  | | | | | | 
Event Dispatcher 
  | | | | | | 
- helpers used for disk reads 
- pipes used for comm with dispatcher 
- helper reads file in mem (via mmap) 
- dispatcher checks(via mincore) if pages are in mm to decide "local" handler or helper. 
Additional Optimizations in Flash: 
- application-level caching (both data and computation=handler) 
- alighment for DMA | use of DMA with scatter-gather vector IO operations 
##  Apache web server 
core    : basic server skeleton 
modules : per functionality 
flow of control : similar to EDM 
- uses combination of MP + MT. 
-- each proc == boss/worker with dynamic thread pool (i.e. number of threads is dynamically adjusted) 
-- number of procs can be dynamically adjusted too. 
##  Experiment methodology (for flash) 
(1) define comparison points : what systems are you comparing? 
-- MP (each proc has single thread) 
-- MT (boss worker model) 
-- single proc event driven (SPED) 
-- Zeus (SPED with 2 procs) 
-- Apache (variaous versions) 
==> all compared against Flash (AHPED model 
(2) define inputs            : what workloads will be used ? 
-- CS web server trace 
-- owlnet trace 
==> controlled reproducible work load (from real world web server) 
-- synthetic workload 
==> simulating realistic request workload (distrib of web page access reqs over time) 
(3) define metrics           : how will you measure performance? 
-- bandwidth == bytes/time 
==> total bytes transferred from files/total time 
-- connection rate == reqests/time 
==> total client conn / total time 
evaluate both as a function of file size 
- larger file size => ammortize per connection cost => higher bandwidth 
                   => more work per connection => lower connection rate 
[RESULTS] video 

# summary 

(1) when data is in cache : 
- SPED >> AMPED Flash 
  (unnecessary test for mem presence) 
- SPED & AMPED Flash >> MT/MP 
  (sync & context switching oveahead) 
(2) with disk-bound workload (i.e. data cannot fit into cache) 
- AMPED Flash >> SPED 
  (blocks because no async I/O) 
- AMPED Flash >> MT/MP 
  (more mem efficient and less context switching) 
(good quiz) 
##  advice on designing experiments 
some sensible discussions: 
- apples to apples comparison 

  1. 2015-08-20 00:02:41 |
  2. Category : gatech
  3. Page View:

Google Ads