computational photography

Georgia Tech, Computational Photography. cs6475 notes 
 
Computer Vision textbook PDF : http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.pdf 
 

# prerequisite: linear algebra (you can assume Euclidean space unless otherwise specified) 

- cartegian coordinate systems : points 
- vector : magnitude & direction. scalar multiplication. 
- normalization : process of finding a unit vector (a vector of magnitude 1) in the same direction as a given vector. (obviously cannot normalize zero vector as it has no direction) 
   (example) : https://www.youtube.com/watch?v=s4L4S2ue5bA 
- inner products of vectors : Caucht Schwarz Inequality 
   (video) : https://www.youtube.com/watch?v=zgWi7AYnHxo 
- parallel and orthogonal vectors 
-- two vectors are parallel if one is a scalar multiple of the other. 
-- two vectors v & w are orthogonal if v*w = 0 
--- zero vector is parallel & orthogonal to any vector. 
--- zero vector is the only vector orthogonal to itself. 
(video) : https://www.youtube.com/watch?v=4Xe0PKH3F0E 
 
 
######################### 
####  (1)  intro    ##### 
######################### 
 
computational and technical aspects of photography : how the light is caputred to generate images. 
 
# environment 
- openCV (either python or C++) 
- matlab/octave 
 
## 
##  what is photography ? 
## 
 
https://en.wikipedia.org/wiki/Photography 
 
"Photography is the science, art and practice of creating durable images by recording light or other electromagnetic radiation, either electronically by means of an image sensor, or chemically by means of a light-sensitive material such as photographic film." 
 
===> storing natural lights into digital(or chemical) images. 
 
## 
##  what is computational photography ? 
## 
 
- a discipline of studying how computing impacts photography. 
-- digital sensors 
-- modern optics 
-- actuators 
-- smart lights 
 
## 
##  limitations of traditional film cameras 
## 
- chemicals, darkroom 
- one roll can contain only up 36 pics. 
==> cannot take many, and no ability to instantly view pics you took, and films are sensitive. 
 
## 
##  comp photography enables: 
## 
- unbounded dynamic range (HDR) 
- variable 
-- focus 
-- depth of firled 
-- resolution 
-- lighting 
-- reflectance 
 
## 
##  elements of comp photography 
## 
given a 3D scene, 
(1) illumination 
(2) optics/aperture 
(3) sensor 
(4) processing 
(5) display 
(6) user 
 
==> convert rays of light into "pixels" 
==> computation can control all steps.  we will study how more deeply, in the following lectures. 
 
 
 
################################### 
####     Dual Photography      ####  a comp photography example 
################################### 
 
= the process of measuring the light transport to generate a dual image. 
(video) : https://www.youtube.com/watch?v=sAunygIq_g0 
 
suppose a projector illuminates an object, and a camera captures an image, which we call a primal image. then we can send the light back to illuminator from the sensor(camera) side, and create a dual image, which is an image from the point of view from the illuminator. 
 
recall 6 elems of cp we looked at before, this focuses on everything except for (3) sensor, which can be included as well. 
 

# novel illumination: 

- of course, your target 3D scene is already illuminated with natural light. 
- but you can use additioinal controllable light source (e.g. a projector) plus a controllable aperture (like a modulator/filter that controls where to let light pass thru) 
--> you can have computer algorithm decide how to control this added illumination. 
 

# novel camera (optics/sensor/processing/display) 

for optics/sensor part, you may have an aperture/filter that controls what light to take in, and relate back to which light source is illuminating it. hense further understanding what illumination changes the resulting image. 
 
===> by controlling the aperture on both ends, we can do more stuff on image. 
 
## 
##  reflective property of rays of light 
## 
- reflection of light depends on the kind of surface. 
-- specular (e.g. mirror) 
-- diffuse (e.g. matte) 
==> thus, depending on the surface, the light can get to the sensor in different ways. 
==> the question : can we control it?  can we observe the controlled change ? 
 
 
###################### 
####   Panorama   ####  another comp photography example 
###################### 
 
recall 
given a 3D scene, 
(1) illumination 
(2) optics/aperture 
(3) sensor 
(4) processing 
(5) display 
(6) user 
 
in terms of the 6 elems of cp, panorama will mostly deal with (2)-(6) 
 

# Panorama steps 

(1) taking pics 
(2) detection and matching (the overlap btwn two pics so we can stich them together) 
(3) warping (aligning the pics on top of each other) 
(4) fade/blend/cut 
- cos lighting exposure may be diff btwn two pics, how much percentage of a pixel to take from one and from the other. or do we take 100% of a pixel from one and none from the other? we need a good algorithm to decide. 
(5) cropping (optional) 
 
==> we will revisit full technical details later. 
 
 
 
####################################################### 
####   computational photography as a discipline   #### 
####################################################### 
 

# Camera Process 

 
- lens             # generalized optics 
- sensor/detector  # generalized sensor e.g. CCD/CMOS, electronics 
==> then we create a pixel/image) 
 

#  photo stats 

(video) https://www.youtube.com/watch?v=CEk74FA_eto 
 
in 2011, 380 billion photos taken 
in 200 years of photo history, 3.5 trillion photos. 4 billion in the last year. i.e. 10% of all the photos ever taken was taken just in the last year. 
 
==> computations with photographs are becoming more relevant. 
 

# SLR vs Smartphone Camera 

 
DSLR (digital single lense reflex) 
- more light (great lens) 
- depth of field (zoom, etc) 
- shutter lag 
- control field of view 
- other features (flash, modes, etc) 
 
Smartphone camera 
- computations (takes multiple pics and does fusion ) 
- data (location, etc) 
- programmers/API for controlling some of the elements of CP 
- efficient (no need to download, etc) 
 

#  Film vs Digital cameras 

- film and digital cameras have roughly the same features and controls 
-- zoom & focus 
-- aperture & exposure 
-- shutter release & advance 
-- one shutter press - one snapshot 
 

#  CP extends FP/DP 

 
- for FP/DP, we can USE, but CP allows us to CHANGE 
-- optics, illumination, sensor, movement 
-- exploit wavelength, speed, depth, polarization, etc 
-- probes, actuators, network 
 
- also CP offers better specification and support for 
-- dynamic range 
-- varying focus point-by-point 
-- field of view & resolution 
-- exposure time & frame rate 
-- bursts (taking many pics at once) 
 

#  growing impact on society 

- pics/images are used to record history, analyze crime, etc. 
-- kennedy assassination 
-- september 11 
-- meteor 
-- boston bombing 
 

#  computer vision  vs  computer graphics 

they all work on the same things, but in the opposite direction 
 
computer vision  : take 2D, and infer 3D world(geometry, shape, photometry) 
computer graphics: generating nice 2,3 D images out of real world 
 

#  ultimate camera : human eyes 

- CP ultimately lets us understand human biology more 
 
 
 
 
################################### 
####   (2.1)  Digital Image    #### 
################################### 
 
how to represent an image. 
 
(3) sensor  : generate signals to represent a computable image (i.e. digital image ) 
(4) processing 
(5) display 
 

#  overview 

 
1. digital image : pixels & resolution 
                 : x & y coordiantes. witdth * height = resolution 
 
2.1 discrete (=matrix) e.g. I(i,j) 
2.2 continuous (=function) e.g. I(x,y) 
 
3. grayscale (=black & white) & color 
 
4. digital image format 
 

#  Pixel 

- a picture element that contains the light intensity at some location (i,j) in an image. 
 
I(i,j) = some numeric value 
 
- thus an image can be representated as a matrix. 
- in 8-bit-pixel grayscale images, intensity values range from 0=black to 255=white 
- 1 bit pixel means 0 or 1, black or white. two colors 
- 4 bit pixel means 16 colors. 
 

#  digital image as a function 

 
given a matrix of pixel values, we can extract 
- continuous signal 
- discrete signal 
 
given a matrix of pixel values,  I(x,y) = intensity value of the pixel x,y 
 

#  Sampling and Quantization 

- Sampling(deciding the measurement frequency/interval) 
- Quantization (=rounding to nearest value) 
 

#  image statistics 

- image histogram (distribution graph) 
-- take a region of an image, and draw a graph where x-axis is the intensity bins (grayscale 0 to 255), and y-axis is the occurrence of each grayscale value. 
-- you can do lots of stats analysis (average, median, mode, etc) 
 

#  Color digital image 

- each pixel has 3 channels (e.g. blue channel, red channel, green channel) 
- 8 bit each, thus 24 bit in total. 
 
 

#  tools 

- openCV (computer vision) 
-- API for image processing, available in C++/python 
- matlab/octave 
- proce55ing (java based) 
 
 

#  understanding image format 

- order of color channels 
- compression info 
- meta data about photos (exchangeable image file format, geo-location info, width height, pixel, etc) 
 
 
 
####################################### 
####    (2.2) point processing     #### 
####################################### 
 
PP : pixel-based arithmetic manipulation/computation of images. 
- addition/subtraction/multiplication/division etc 
- alpha-blending 
 
e.g. 
you take two pictures of the same target, say a class room, but one when empty, the other with a teacher. then if you subtract, you get a shape of the teacher. here you get an idea, this may work as security video camera processing. 
 
question: what do we do if we add/subtract and go out of 0-255 value range for a pixel ? 
- rescale (either before or after PP) 
 

#  alpha-blending 

 
suppose again, you have a photo of an empty class room, and another photo of the same class room with a teacher. 
if you multiple each photo's pixel value by 1/2, and add them together, your teacher shows half transparent and everything else as 100% visible. 
==> this transparency(aka opacity) is represented as "alpha" which ranges 0 to 1. 
i.e. 0 = invisible 
     1 = fully visible 
 
 
################################# 
####  (2.3) Blending Modes   #### 
################################# 
 
blending pixels 
e.g. 
gievn two images, a & b 
f_blend_ave(a,b) = (a+b) / 2      #  0.5 alpha for each 
f_blend_normal(a,b) = b           # just taking the base image 
 

#  common blend modes 

 
- divide : brighten photos 
- addition : too many whites 
- subtract : too many blacks 
- difference : subtract with scaling 
- darken   : f_blend(a,b) = min(a,b) for RGB 
- lighten  : f_blend(a,b) = max(a,b) for RGB 
- multiply : f_blend(a,b) = ab                # darker 
- screen   : f_blend(a,b) = 1 - (1-a)(1-b)    # brighter  # cos you invert both a & b, multiply, and invert again 
- overlay  : f_blend(a,b) = 2ab            if a < 0.5 
                          = 1-2(1-a)(1-b)  otherwise 
 

#  dodge and burn 

(techniques used in traditional film photography, in a dark room) 
- dodge : brighten an image  # screen mode 
- burn  : darkens an image   # multiply mode 
 
==> of course there are variants of each. 
 
 
 
############################# 
####   (2.4) smoothing   #### 
############################# 
 
1. smooth an image over a neighborhood of pixels (as opposed to a pin point one particular pixel) 
2. median filtering as a special non-linear filtering and smoothing approach 
 
smoothing : 
- can be construed as blurring or removing noise 
- commonly done with averaging. e.g. you take 3-by-3 or 5-by-5 neighborhoods. 
-- edge rows/columns, just expand the rows/columns to wrap around, and copy over the edge values, apply the neighborhood averaging strategy, etc. 
 
===> we use the notion of the neighborhood size k, "kernel", where the window size is 2k+1 
e.g. 
k = 1, then it is 3-by-3 
k = 2, then it is 5-by-5 
 

#  generalized mathematical representation for neighborhood-averaging smoothing 

 
G(i,j) = 1/(2k+1)^2 * Sum of all F(i+u,j+v) in the neighborhood. 
 
==> if you wanna assign(aka attribute) an non-uniform weight to each elem in the neighborhood, instead of place common 1/(2k+1)^2 factor, you use h(u,v) which tells the weight for the pixel. 
==> the whole concept known as "cross correlation" (to be revisited at a later lecture) 
 
(a must see video) https://www.youtube.com/watch?v=059eE08FgkA 
 
 

#  Median Filtering 

 
- a non linear operation often used in image processing 
===> just another statistical approach. instead of neighborhhod average(= mean), take the median. 
[benefits] 
- reduce noise 
- preserve edges (sharp lines !) 
 
(good example video) https://www.youtube.com/watch?v=lcfSk9RP8xA 
 
 
 
#################################################### 
####   (2.5)  cross-correlation & convolution   #### 
#################################################### 
 

# Cross Correlation 

- in signal processing, CC is a measure of similarity of two waveforms as a function of a timelag applied to one of them. 
i.e. we have two diff wave forms (like two diff images of matrix pixels) and combine them in such a way that best correlates the two. 
- aka "a sliding dot product" or "sliding inner-product" 
 
Filtering an image: replace each pixel with a linear combination of its neighbors. 
- filter "kernel" or "mask" which is the prescription for weights in the linear combination. 
 

# Gaussian Filter 

 
- a smoothing/filtering example 
-- 21 by 21 normal distribution filter. darken the edge, and brighten the center of the pixels. 
(video)  https://www.youtube.com/watch?v=-AuwMJAqjJc 
         https://www.youtube.com/watch?v=4RpdOAbnNYE 
 

# Convolution 

- a mathematical operation on two functions F and h 
- produces a third function that is typically viewed as a modified version 
- gives the area of overlap btwn the two functions 
- in a form of the amount that one of the original functions is translated. 
 
(video) https://www.youtube.com/watch?v=C3EEy8adxvc 
 
(good quiz) https://www.youtube.com/watch?v=u1_VRoHYkFU 
            https://www.youtube.com/watch?v=LwMdpIZ8Mw0 
            https://www.youtube.com/watch?v=yhL866nr6zs 
            https://www.youtube.com/watch?v=_W9qZhnDMyM 
 

#  properties of concolution 

- linear and shift invariants 
-- behaves the same everywhere 
-- i.e. the value of the output depends on the pattern in the image neighborhood, not the position of the neighborhood. 
- commutative :  F*G = G*F 
- associative :  (F*G)*H = F*(G*H) 
- identity : unit impulse 
-- kernel E = [..000010000..] # like a matrix with only one in the center, every other elem zero, not an identity matrix 
-- F*E = F    # because E only takes the focused pixel and don't mix with other neighbor pixels 
NOTE: this identity is true of cross correlation too 
- separable 
-- possible to convolute all rows, then all columns 
 
( Linear filter example )  https://www.youtube.com/watch?v=WeNpd_YEF6I 
 
 
 
############################# 
####   (2.6) Gradients   #### 
############################# 
 
- use an image gradient to compute/detect edges 
- image gradient in continous form for a function 
                 in discrete form for an image 
 

#  using filters to find features 

- extract higher level features 
-- map raw pixels to an intermediate representation 
-- reduce amount of data, preserve useful information 
 

#  what are the good features to match between images ? 

- features 
-- parts of an image that encode it in a compact form 
-- like discontinuities in a scene 
e.g. 
- surface, depth, color, illumination, edge 
 
- edge 
-- information theory view that edges encode change, therefore edges efficiently encode an image. 
 

#  edge detection 

- basic idea: look for a neighborhood with strong signs of change 
(issues to consider) 
-- the size of the nrighborhood ? 
-- what metrics represent a strong "change" ? pixel intensity diff of above threshold XYZ ? 
 
(example edge detection video) https://www.youtube.com/watch?v=B-ITjBgBU4o 
 
- recall an image(its pixel intensity values) can be expressed as a function of coordinates 
- an edge is where there is rapid change in the image intensity function. 
-- take the derivatives of F(x,y) 
(good video) https://www.youtube.com/watch?v=F8mA5sfAb24 
 

# differential operators for images 

- need an operation that, when applied to an image, returns its derivatives. 
-- model these "operators" as mask or kernel 
--- when applied, yields a new function that is the image gradient 
-- then "threshold" this gradient function to select edge pixels 
 

# image gradient 

- need to define "gradient" 
- gradient of an image : measure of directional change in image function F(x,y) in x, across columns, and y, across rows. 
(mathematical notation) https://www.youtube.com/watch?v=Dl5lPdoCXi8 
(partial derivative, and discrete approximation) https://www.youtube.com/watch?v=kj4vpaiE1KI 
(example of application) https://www.youtube.com/watch?v=ihYEdclqDkA 
                         https://www.youtube.com/watch?v=VRc9WhimKpk 
 
Gradient direction is the angle at which greatest positive change occurs. 
 
(good quiz)  https://www.youtube.com/watch?v=LzqBs--aaR4 
 
 
############################ 
#####   (2.7) Edges    ##### 
############################ 
 
recall : we can differentiate an image in x & y 
 
derivative as a local product 
(see the equations)  https://www.youtube.com/watch?v=E1I9jYMdRgg 
==> basically we can interpret the process of differentiation as cross correlation, with a kernel and an input image of pixel arrays. 
 

#  computing discrete gradients 

- desired : an "operator" (aka mask/kernel) that effectively computes discrete derivative values with cross-correlation (i.e. using finite differences) 
- finite differences provide a numerical solution for differential equations using approximation of derivatives. 
 
(example video)  https://www.youtube.com/watch?v=L8P8rmqWfoc 
 
# 3 examples 
(1) Prewitt kernel 
(2) Sobel kernel 
(3) Roberts kernel 
 
NOTE: significant noise in the signal makes it hard to detect edges. we will revisit this later. 
(a must-see example video)  https://www.youtube.com/watch?v=RuKy02HfeoM 
                            https://www.youtube.com/watch?v=t3dgjb5yD1Q 
 

#  gradients and convolution 

 
recall convolution  G = g * F 
derivative of a convolution dG/dx = d(h*F)/dx 
 
if D is a kernel to compute derivatives, and H is the kernel for smoothing, we could define kernels with derivative and smoothing in one: 
D*(H*F) = (D*H)*F 
 
 

#  gradient to edge 

- smoothing (use a filter like Gaussian filter to suppress noise) 
- compute gradient 
- apply edge enhancement 
- edge localiztion. edge VS noise. 
- threshold, thinning 
 

#  Canny edge detector   (a very common edge detector) 

 
1. filter image with derivative of Gaussian 
2. find magnitude and orientation of gradient 
3. non-maximum suppression 
- thin multi-pixel wide "ridges" down to single pixel width 
4. linking and thresholding. 
- define two thresholds (low & high). 
- use the high th to start edge curves, and the low th to continue them. 
 
 
 
############################ 
####   (3.1)  Cameras   #### 
############################ 
 
n- rays of lights to pixels 
- a camera without optics 
- lens in the camera system 
- the lens equation 
 

#  rays VS pixels 

- illumination (light rays) follows a path from the source to the scene 
- rays are fundamental primitives 
- scene via a 2D array of pixels 
- computation can control the parameters of the optics, sensor and illuminations 
 

#  single lense reflex camera 

 
(see the drawing) https://www.youtube.com/watch?v=aEaeRkadK5k 
 
- view finder 
- shutter release 
- focal plane shutters 
- photographic film (later replaced by CMOS sensos) 
- focus/zoom ring 
- frontal glass lens 
 

#  when you take a picture, you try to capture 

(1) geometry (3D-ness, perspective) 
(2) light scattering 
 

# how rays of lights, illumination is captured. 

(why it gets captured up side down, inverted, on the sensor. see video) https://www.youtube.com/watch?v=U5WsCFi7h4Y 
 
==> the machanism behind the "pinhole camera" (= camera obscura) 
 

# pinhole photograph 

in theory, 
- straight lines remain straight 
- infinite depth of field, i.e. everything in focus. (but there may be optical blur) 
- light diffracts (wave nature of light, smaler aperture means more diffraction) 
 
[pinhole size] = aperture 
==> the bigger, the more light, the more geometric blur, the less diffraction blur 
==> the smaller, the less light, the sharper image quality, the more diffraction blur 
===> best pinhole = very little light 
 
d : pinhole diameter 
f : focal length: distance from pinhole to sensor 
p : wave length of light 
 
d = 2 * sqrt(1/2 * f * p) 
 
(video) https://www.youtube.com/watch?time_continue=148&v=6_Epf-2uKQ0 
 

#  replacing the pinhole with a lens 

==> capture more lights, but still maintain the pinhole concept 
(video) https://www.youtube.com/watch?v=r7bo2DKKUUU 
 

#  geometrical optics 

- parallel (to the lens) rays converge to a point located at focal length f from lens 
- rays going thru the center of lens do NOT deviate (= funcstions like a pinhole) 
(good visualization video)  https://www.youtube.com/watch?v=Fao6ERbySqU 
 

#  ray tracing with lenses 

- rays from points on a plane parallel to the lens, focus on a plane parallel to the lens on the other side (and upside down) 
o : distance between object and lens 
i : distance between lens and image 
f : focal length 
 
thin lens equation :   1/o + 1/i = 1/f 
 
(good visual video) https://www.youtube.com/watch?v=YQ1-R2oYjhA 
 
 
###################### 
####  (3.2) Lens  #### 
###################### 
 
- focal length 
- field of view 
- sensor size 
- image formation & capture 
- perspective projection (how to capture 3D-ness) 
 
lens : concave(positive) and convex(negative) lenses 
==> modern cameras use a combination of both 
 
changes in object distance, focal length. 
==> changes in the size of the object captured at the focal plane. 
 

# focusing 

- achieved by controlling the position of both the lens and the sensor. 
-- moving the lens lets you decide the size of the object to capture 
-- the sensor must be placed at the focal length from the lens. 
 
sensor == film == screen 
 

# FOV(field of view) = how wide the angle of your view is. 

 
h = sensor size 
f = focal length 
 
FOV = 2 * tan^(-1)(h/2f) 
 
==> clearly, smaller h leads to smaller FOV 
             bigger f leads to smaller FOV 
 
(see video)  https://www.youtube.com/watch?v=pUuAx_zFnEk 
             https://www.youtube.com/watch?v=9O_-9srabGM 
             https://www.youtube.com/watch?v=HCxHM1GKirM 
 
by changing focal length, you can change both view-point and perspectives (= geometry). 
(video) https://www.youtube.com/watch?v=wgUgMIKrAbc 
 

# a camera coordinate model 

given the coordinate of the object, and its distance from the lens, and focal length, you can solve for the distance of the ideal sensor  position from the lens. 
(video) https://www.youtube.com/watch?v=OaObDEoW4v0 
 

# changes in focal length (and viewpoint) 

(example video) https://www.youtube.com/watch?v=iVd44d7E-Ks 
 
 
 
############################ 
####   (3.3) Exposure   #### 
############################ 
 
exposure triangle 
(1) aperture 
(2) shutter speed 
(3) IOS 
 
==> photographers try to optimize those 3 parameters. 
 

# exposure 

 
exposure = irradiance * time 
       H = E * T 
 
aperture : an opening, a hole, a gap 
irradiance : amount of light falling on a unit area of sensor per second, controlled by lens aperture 
 
exposure time T : how long the shutter is kept open. 
 

# SLR camera (single lens reflex) 

see the structure. 
https://en.wikipedia.org/wiki/Single-lens_reflex_camera#Optical_components 
 
 

# shutter speed 

 
- amount of time the sensor is exposed to light 
- usually denoted as a fraction of a second. e.g. 1/200, 1/30,, 10, 15, bulb(i.e. shutter is open as long as you press it) 
-- longer shutter speed: you get more blur 
(good visual example): https://youtu.be/59LMCZWi1kU 
 

# aperture 

area = pi * (f/2N)^2 
 
f = focal length 
N = aperture number # often denoted in f/N 
  = gives irradiance irrespective of the lens 
 
low f-number N on telephoto lens means BIG lens 
 
(good maths example) https://www.youtube.com/watch?v=J21gRHvzT5E 
 
e.g. 
- doubling N reduces Area by 2x, and therefore reduces light by 4x 
- from f/2.8 to f/5.6 cuts light by 4x 
- to cut light by 3x, increase N by sqrt(2) 
 
(example video)  https://www.youtube.com/watch?v=w48TPp4EJ9E 
                 https://www.youtube.com/watch?v=BpXi-an3Fi0 
 

#  ISO = sensitivity 

 
film : sensitivity VS grain (of film) 
digital : sensitivity VS noise (of sensor) 
 
ISO is linear : ISO 200 needs half the light of ISO 100 
                you need a higher ISO value when taking a pic in a dark place. 
 

# adjusting exposure variables (shutter speed, aperture, ISO) 

(example video) https://www.youtube.com/watch?v=YPv83szLWGg 
                https://www.youtube.com/watch?v=e_joOcwQBe4 
 
in recap, 
aperture     : depth of field 
shutter speed: motion blur 
ISO          : more grain 
 
 
############################# 
####  (3.4)  sensor     ##### 
############################# 
 

#  film VS digital 

- two primary sensors. 
- essentially the same. chemical for film. electronic for digital. 
- differentce is how the light is trapped and preserved. 
 
# film 
- converts light into chemicals 
- a film consists of many layers of color filters. 
 
# digital 
- converts light into data 
-- CCD: "charge-coupled device", a device for converting electrical charge, into a digital value. 
-- pixels are represented by capacitors, which convert and store incoming photons as electron charges 
- Bayer Filter: a kind of a color filter. 
- "demosaicing" : RGB values collected from Bayer needs to be processed = demosaiced. 
- CMOS: "complementary metal oxide semiconductor" 
-- photo sites in CCD are passive, and do no "work", just sent to amplifier later. 
-- photo sites in CMOS have local amplifiers for each photo site, and can do local processing 
 
# camera "raw" format 
--> contains minimally processed data from the sensor (image as viewed by the sensor) 
--> image encoded in device dependent color space 
--> captures radiometric characteristics of the scene 
--> like a photographic negative. 
---> has a wider dynamic range or color that preserves most of the information of the image. 
 
 
#################################### 
####  (4.1) Fourier Transform   #### 
#################################### 
 

# reconstructing a signal 

 
A = amplitude 
w = frequency 
t = time 
n = number of periods 
target signal:  f(t) = A*cos(nwt) 
 
          inf 
f^T (t) = Sum A*cos(nwt) 
          n=1 
 
(video) https://www.youtube.com/watch?v=Y2HOU9GgFUU 
 
## 
##  a fourier transform 
## 
- periodic function: a weighted sum of sines and cosines of diff frequencies 
- transforms f(t) into F(w), a frequency spectrum of f(t) 
- a reversible operation 
- for every w from 0 to inf, F(w) holds amplitude A, and phase G, of the sine function 
 
F(w) = A*cos(wt+G) 
 

# frequency domain for a signal 

 
how many samples N do we need? 
- smaller N : coarse 
- biiger N : fine signals 
 

# combining (1) time frequency and (2) frequency spectra 

(see equation) https://www.youtube.com/watch?v=MRx71wcjvlA 
               https://www.youtube.com/watch?v=lw5mwyVxueE 
 

# convolution theorem and the Fourier Transform 

- convolution in spatial domain is equivalent to multiplication in frequency domain 
 
(see the logic) https://www.youtube.com/watch?v=sMeQ-gQlaPs 
                https://www.youtube.com/watch?v=IWQfj05i87g 
 
 
 
############################# 
####  (4.2)  Blending    #### 
############################# 
 
- merge two images 
-- window size 
-- advantages of using the fourier domain 
 
- pixel averaging 
- cross fading # applying a different weighting for each pixel, adding up to 100%. 
-- window size for blending becomes important. see video. 
(video)  https://www.youtube.com/watch?v=fgr9QEApc00 
 
- factors for optimal blending window size 
-- to avoid seams    : window = sizeof largest prominent feature 
-- to avoid ghosting : window <= 2* size of smallest prominet feature 
==> use Fourier domain 
  -- largest frequency < 2* size of smallest frequency 
  -- image frequency content should occupy one "octave" * power of 2 
 
# an octave = a frequency spectrum 
 
- frequency spread needs to be modeled. 
FFT(Image_left) = F_left 
FFT(Image_right)= F_right 
-- decompose Fourier image into octaves(bands) 
-- feather the corresponding octaves of F_left & F_right 
-- compute inverse FFT and feather in spatial domain 
-- sum feathered octave images in frequency domain 
 
- what is feathering? 
-- blurring of the edge before applying the blend operations 
-- makes the merged resulting image smoother 
 
 
########################### 
####  (4.3) Pyramids   #### 
########################### 
 
- the whole FFT blending mumbo jumbo can be done with pyramids (Gaussian and Laplacian) 
 
- pyramid representation : A Gaussian Pyramid 
-- just using the same old Gaussian kernel filtering to scale an image down to a lower resolution, and repeat. 
-- this process is called "reduce" function 
-- its inverse is called "expand" which does not produce the original image, but at least attempts to. 
--- the diff between the original image and the expanded image is the error called "Laplacian" 
(video) https://www.youtube.com/watch?v=9qg_uysFBZs 
 
g1 = reduce(g0) 
g1_e = expand(g2) 
L1 = g1 - expand(g2) 
 

#  pyramid blending process 

given image a,b, and region R 
- build Laplacian pyramids (which require buidling Gaussian pyramids first) 
- build a Gaussian pyramid from selected region R 
- form a combined pyramid using Gr as weghts 
  Lout(i,j) = = Gr(i,j)*La(i,j) + (1-Gr(i,j)) * Lr(i,j) 
- collapse Lout pyramid to get the final belnded image 
 
(video) https://www.youtube.com/watch?v=gumDre9uX4o 
 
 
 
 
############################ 
####  (4.4)  cuts       #### 
############################ 
 
- cuts as opposed to blending 
- finding an optimal "seam" between images 
 
(good visual examples)  https://www.youtube.com/watch?v=Zt5ZxJQy9dM 
                        https://www.youtube.com/watch?v=fiU26c1apT0 
 
- done with graph-cuts algo, as well as dynamic programming 
(video) https://www.youtube.com/watch?v=THnmHfh3A_g 
 
 
 
############################ 
####  (4.5)  Features   #### 
############################ 
 
detecting features of an image, to be able to, for example, match with other images. 
 
some famous feature detection methods 
- Harris corner detection algo 
- SIFT detector 
 
common feature transformation: 
- transition (location movement) 
- rotation 
- scale (size change) 
- affine (shape change) 
- perspective (e.g. original feature is somebody's face, this can be the same face taken from the side) 
- lighting (pixel values) 
 
==> can come in combination 
 
characteristics of good features 
- repeatability/precision 
- saliency/matchability 
- compactness/efficiency 
- locality 
 
(example good feature video)  https://www.youtube.com/watch?v=4drqUKMnCXQ 
 
## 
## find corners 
## 
- key property : in the region around a corner, image gradient has two or more dominant directions 
- corners are repeatable and distinctive 
 
(good visualization video) https://www.youtube.com/watch?v=pha59TPYE1U 
 
(maths behind corner detection) https://www.youtube.com/watch?v=n8inFQlWxT8 
                                https://www.youtube.com/watch?v=aehKF4P5T3g 
                                https://www.youtube.com/watch?v=owJemgKy-UU 
 
## 
##  Harris Detector Algo overview 
## 
- compute Gaussian derivatives at each pixel 
- compute second moment matrix M in a Gaussian window around each pixel 
- compute corner response function R 
- threshold R 
- find local maxima of response function (non-maximum suppression) 
 

#  properties of Harris Detector 

- rotation invariant(=constant,unaffected) ? 
-- ellipses rotates, its but shape(=eigenvalues) remains the same 
-- corner response R is invariant to rotation 
 
- intensity invariant ? 
-- partial invariance to additive and multiplicative intensity changes (threshold issue for multiplicative) 
-- only image derivatives are used 
--- invariance to intensity shift: I -> I+b 
--- invariance to intensity scale: I -> a*I 
==> threshold needs to be adaptive 
 
- scale invariant 
-- No! dependent on window size. 
-- use pyramids (or frequency domain) 
==> this is why we need SIFT 
 

# examples of Scale Invariant Detectors 

(1) Harris Laplacian 
- find local maximum of: 
-- harris corner detector in space (image coordinates, for x,y) 
-- laplacian in scale 
 
(2) SIFT 
-- find local maximum of: 
-- difference of Gaussians(DoG) in space and scale 
-- DoG is simply a pyramid of the DoG within each octave 
- orientation assignment 
-- compute the best orientation for each keypoint region 
- keypoint description 
-- use local image gradients at selected scale and rotation to describe each keypoint region 
 
(SIFT example video) https://www.youtube.com/watch?v=o-ZeIWqV9Vk 
 
 
 
################################################## 
####  (4.6)  Feature Detection and Matching   #### 
################################################## 
 
Harris Detector: step by step 
- compute horizontal & vertical derivatives of the image. (convolve with derivative of Gaussians) 
- compute outer products of gradients M 
- convolve with larger Gaussian 
- compute scalar interest measure R 
- find local maxima above some threshold, detect features 
 
(must see good video) https://www.youtube.com/watch?v=AUxQtUV0Umc 
 
 

#  Scale Invariant Detection 

- consider regions of different sizes around a point 
(video) https://www.youtube.com/watch?v=_Oa9MZp79Lk 
 
- a region(circle) which is "scale invariant" 
- not affected by the size but will be the same for "corresponding regions" 
- e.g. : average intensity. for corresponding regions (even of different size) it will be the same. 
 
- compute the scale invariant function for different region sizes, and find the max point. 
- a "good" function for scale detectionhas one stable sharp peak. 
(see video for visual) https://www.youtube.com/watch?v=WhLCqGn-pbQ 
- for usual images: a good function would be one which responds to contrast (sharp local intensity change) 
 

#  key point localization 

- find robust extremum (maximum or minimum) both in space and in scale 
-- SIFT: scale invariant feature transform 
--- specific suggestion: use pyramid to find maximum values (remember edge detection) - then eliminate "edges" and pick only corners. remove low contrasts, edge bound 
(must see video) https://www.youtube.com/watch?v=2ShLnDnBdx4 
 
 
 
####################################### 
####  (5.1)  Image Transformation  #### 
####################################### 
 
image filtering: change the "range" of an image(=function)  i.e. pixel value 
image warping: change the "domain" of an image(=function)   i.e. pixel pos 
(good video) https://www.youtube.com/watch?v=EWb0fDDkIcs 
 
## 
## parametric global warping 
## 
- translation (x,y coordinate transition) # 2 DoF :degree of freedom 
- euclidian    # 3 DoF 
- aspect 
- scale        # 
- perspective  # 8 DoF 
- affine     # 6 DoF 
 
(video)  https://www.youtube.com/watch?v=O508zuEgeSo 
 
transformation function T(p) = p'    where p = original pixel 
 
T() is usually some sort of parametric matrix (in combination with trigonometry) 
p is x,y pixel value 
 
(good example video)  https://www.youtube.com/watch?v=6qMM6-t_Iyc 
                      https://www.youtube.com/watch?v=a59YQ4qe7mE 
                      https://www.youtube.com/watch?v=jOaWYpd-Iv4 
                      https://www.youtube.com/watch?v=cTlsAP93oz4 
                      https://www.youtube.com/watch?v=n4I7pUxhuqI 
 
(good quiz)  https://www.youtube.com/watch?v=l-kFpFaCmvg 
             https://www.youtube.com/watch?v=QCvvJMubOFw 
             https://www.youtube.com/watch?v=nuBy-gXdFvE 
             https://www.youtube.com/watch?v=rL_v_02FgtQ 
 
NOTE: transformation in general means simple x,y coordinate trans plus rotation, while warping is more point to point mapping. 
 
## 
##  fwd / inverse warping 
## 
- problems 
-- fwd : holes, overlap 
-- inv : minification (causes aliasing, blocking) 
 

#  image morphing 

- mesh based warping 
- find feature points (via feature detection algo we studied before), and create a mesh 
-- lots of useful functions/examples in openCV library 
 
 
 
############################## 
####  (5.3)  Panorama     #### 
############################## 
 
1. capture images 
2. detection and matching 
3. warping (aligning images)  # simple translation works but warp is better 
4. blending, fading, cutting 
5. cropping (optional) 
 
# a bundle of rays contain all vews 
-> able to a create synthetic view  (as long as there is the same center of projection) 
(example video)  https://www.youtube.com/watch?v=JuQtxpHRFag 
 
 
# dealing with bad matches 
-> RANSAC (random sample consensus) 
--> find "average" translation vector 
 
(video) https://www.youtube.com/watch?v=wFxm2QfyX68 
        https://www.youtube.com/watch?v=qN61qa3mPog 
 
 

#  3 types of panorama projection plane 

- plane 
- cylinder 
- sphere 
(video)  https://www.youtube.com/watch?v=qlU6md2KKy0 
 
 

#  openCV panorama code example 

(video)  https://www.youtube.com/watch?v=De7D_9ZjjfU 
 
 
 
######################### 
####   (5.4) HDR     #### 
######################### 
 
high dynamic range : HDR 
==> basically, 8 bit pixel(0 to 255) intensity does not suffice to capture real world. should be much wider. like 5 to 10 mil values. 
 
luminance : a photometric measure of the luminous intensity per unit area of light travelling in a given direction. measured in candela per square meter (cd/m^2) 
 
## 
##  camera calibration 
## 
- geometric : how pixel coordinates relate to directions in the world. 
- radiometric/photometric : how pixel values relate to radiance amounts in the world. 
 
===> basically taking an image at diff exposure values, and taking the right range for each pixel. 
 
(good example video on mechanics)  https://www.youtube.com/watch?v=EyVr1104yUs 
                                   https://www.youtube.com/watch?v=HbR3ZbvSRcg 
                                   https://www.youtube.com/watch?v=DyKchAF-uOI 

# 32-bit radience map, representation 

(video) https://www.youtube.com/watch?v=TDaPLUt1R2I 
 

#  tone mapping 

- diff algos exist to cope with the problem of HDR images looking unnatural 
- trying to compact the range into a smaller range. 
(video)  https://www.youtube.com/watch?v=_pwAt_N8HaY 
 
 
 
######################### 
####  (5.5) Stereo   #### 
######################### 
 
- Depth of a scene 
-- capture a 3D scene with geometry 
-- recall the focal length and x,y,z coordinate expression of a scene depth/position. 
-- fundamentaldepth  ambiguity : any points along the same ray map to the same pixel in the image 
                            e.g. your hand looks like holding a building (playing with depth ambiguity) 
(good video) https://www.youtube.com/watch?v=WQWolyZ7s_Q 
             https://www.youtube.com/watch?v=BJ6JnF7ZSdk 
 
- how to estimate(infer) depth/shape from a single view point 
-- illumination of structure 
-- shades 
-- occulusion (where an object sits in front of other objects, etc) 
-- using objects of known sizes/textures 
-- perspectives 
-- motion 
-- focus 
 
====> need to resolve depth ambiguity ! 
====> stereo : image captured from two view points 
 
(good example video) https://www.youtube.com/watch?v=-lHvI8VM6Pc 
 
## 
## Depth Parallax 
## 
- parallax : apparent motion of scene features located at diff distances 
-- basically nearby objects tend to move(displace) more than far away objects. 
(example video) https://www.youtube.com/watch?v=npauK1iBKng 
 
## 
##  Anaglyph 
## 
- anaglyph encodes parallax in a single picture. two slightly diff perspectives of the same subject are superimposed on each other in contrasting colors, producing a three dimensional effect, when viewed thru two correspondingly colored filters. basically two images taken thru diff filters (red and cyan) 
(example video)  https://www.youtube.com/watch?v=DN3QCsl_s7U 
(excellent summary video)  https://www.youtube.com/watch?v=ExOgHDC0cgc 
 
 
## 
##  can you compute depth if you have stereo (two view points) for a scene? 
## 
- yes, it;s a simple geometry. 
- basically you match objects (via feature detection/matching) between two images, and then you can compute the disparity(i.e. distance in this case) between left and right eyes. 
- epipolar constraints (basically it says, you only need to focus 1D epipolar lines to do matching, assuming y-coordinate does not change between left and right eyes.) 
- stereo matching/reconstruction algos have been extensively researched already. google them. 
- occulusion: depending on obstacles, matching is not quite easy. 
- RGBD camera (red green blue depth) using time of flight(time it takes for light to travel and bounce back) 
 
(maths video)  https://www.youtube.com/watch?v=3hivcpmV9IE 
               https://www.youtube.com/watch?v=1eQAIl0zPWQ 
               https://www.youtube.com/watch?v=5JnsHrYoSW0 
               https://www.youtube.com/watch?v=D6fDQTvqiV4 
               https://www.youtube.com/watch?v=92UD6nX8124 
               https://www.youtube.com/watch?v=AJTbdMwYTps 
 
 
############################# 
####  (5.6)  Photosynth  #### 
############################# 
 
- go beyond panoramas 
- photo tourism (aka photosynth) 
-- scene reconstruction of 3D geometry from many photos 
--- relative camera positions, orientations, focal length of cameras, point cloud, correspondence of feature points 
--- scene construction process: 
---- 1. feature detection (e.g. SIFT) 
---- 2. pairwise feature matching (e.g. RANSAC) 
---- 3. correspondence estimate 
---- 4. incremental structure from motion (pretty much the same fundamental logic as panorama) 
        (video) https://www.youtube.com/watch?v=CH5WOjNm8uk 
 
(a must see impressive video)  https://www.youtube.com/watch?v=q957FNyq6bw 
 
 
good website to play around the topic.   https://photosynth.net 
 
 
google maps        : https://www.youtube.com/watch?v=2AUFBxTnxzw 
google street view : https://www.youtube.com/watch?v=MvGcIvyPbSg 
 
 
 
##################################### 
####   (6.1)  Video Processing   #### 
##################################### 
 
video : a stack of images displayed sequentially over time. 
 
- aspect ratio 
- frame rate 
- codec/compression algo 
 

# persistence of vision 

- human vision system perceives frames of images as flicker-less smooth continuous motion picture 
- specifically more than 24 frames per second 
 
 
the same image procesing techniques (filtering, feature detection and matching, feature tracking/registration/blending/morphing/warping) apply to video, just we add another dimension of time. 
(cool demo videos)  https://www.youtube.com/watch?v=te7vTXRiaMs 
                    https://www.youtube.com/watch?v=z_uWEIxcNmU 
 
 
################################## 
####  (6.2)  Video Textures   #### 
################################## 
 
say you have a 10 sec video of for example a candle light. it does tremble a bit and say you loop the video, so you have a continous infinite duration video of a candle light. but when you stitch together the end to the beginning, it will flicker, which feels unnatural so the video texture is the technique that makes that seemless. 
(good example video)  https://www.youtube.com/watch?v=dF6hbn_ExhE 
 
how ? 
- suppose you have 90 frames, then for each frame, compute the similarity(distance) metrics to every other frame. so you get the idea of which frame to jump to to maintain the flickerless loop. 
(good visual example)  https://www.youtube.com/watch?v=9mmubrRftU0 
                       https://www.youtube.com/watch?v=k4gIt4Re708 
 
what similarity metrics ? 
- L2 norm : compute the sum of squared euclidean distance of each pixel between two frames 
- L1 norm : compute the sum of abs(p1-p1') aka manhattan distance of each pixel between two frames 
 
which way to jump ? 
- similar frames can be represented as markov chain 
(video)  https://www.youtube.com/watch?v=gkF1GEcAFRg 
 
how to preserve dynamics with transitions 
- certain video, you break the dynamics if you simply apply the jumping with similar metrics. 
- this you basically have to model the transitions, and that can be coded into your markov model 
(super good example video)  https://www.youtube.com/watch?v=-PNDam1o1Y8 
 
fading, blending, morphing, warping 
- all can be applied to video as well 
 
video portrait  (view morphing) 
- you can apply this looping to "stereo" video ! 
(good example)  https://www.youtube.com/watch?v=-O519cNPsFE 
 
video sprites 
- you can merge a moving object into different background. like you collect video of hamster running in all directions, and create a video with a circle where the hamster keeps running on the edge. 
(good example) https://www.youtube.com/watch?v=ICjDHaOFJn4 
 
cliplets / cinemagrams 
- playing only parts of a video (like you let one person walk while freezing the rest of the crowd move) to highlight or emphasize. 
(good video)  https://www.youtube.com/watch?v=8ImD17cPGk0 
 
 
####################################### 
####  (6.3)  Video Stabilization   #### 
####################################### 
 
- removing(stabilizing) the shake and jitter as post processing. 
 
(good example)  https://www.youtube.com/watch?v=tfA8VNWakXI 
 
 
## 
##  pre-processing stabilization  (not the scope of this lecture) 
## 
- optical / in-camera stabilization 
-- sensor shift 
-- floating lens (electromagnets) 
-- accelerometer + gyro 
-- high frequency perturbations (small buffer) 
 
e.g. steadicam 
 
 
## 
##  post processing stablization 
## 
- removes low frequency perturbations (large buffers) 
- distributed backend processing (cloud computing) 
- can be applied to any camera, any video. 
 
- main steps 
(1) estimate camera motion 
-- find corners (i.e. high gradient in x & y) and track them across frames 
-- background VS foreground motions. we want background. there is a weighting algo. 
-- 8 degrees of freedom (x & y translation, scale/rotation, skew and perspectives) 
(2) stabilize camera path 
-- stationary or linear displacement, parabolic path, etc (lots of smoothing algo) 
-- (video) https://www.youtube.com/watch?v=CfmlNQw-cAg 
(3) crop and re-synthesize 
-- crop window size can adapatively change 
 
=> create a virtual camera frame. 
--- can deviate too much from original camera 
 
(example video)  https://www.youtube.com/watch?v=xBHMJ9kfS0M 
                 https://www.youtube.com/watch?v=QbRAtoqOLhY 
                 https://www.youtube.com/watch?v=sv43Vd4n5R8 
 
-- rolling shutter VS global shutter 
--- recall CMOS sensor uses rolling shutter 
(video)  https://www.youtube.com/watch?v=QYFd9YhmDXI 
 
 
 
############################################ 
####   (6.4) Panoramic Video Textures   #### 
############################################ 
 
PVT: a video that has been stitched into a single wide field of view. 
- appears to play continously and indefinitely 
 
(quiz)  https://www.youtube.com/watch?v=HGpoYP8Z5i0 
 
1. take each frame, and create a panorama (just like you do for images) 
2. then separate static & dynamic regions (either manually or using some automated method) 
3. then apply video texture technique to the dynamic region. 
done 
 
(video)  https://www.youtube.com/watch?v=_hNZW5ChIoI 
 

# video texture of dynamic region 

- map a continous diagonal slice of the input video volume to the output panorama 
- restricts boundaries to frames 
- shears spatial structures across time 
(example) https://www.youtube.com/watch?v=6RTOmjyQ0gg 
--> can be improved with a graph "cut" algo, in addition to fade/blend. 
(example) https://www.youtube.com/watch?v=k5xZ19jrw2A 
          https://www.youtube.com/watch?v=ebGxoaxDixY 
 
 
############################# 
####  (7.1) Light Field  #### 
############################# 
 
- we get an image of a scene as 2D pixels 
- but the fundamental pirimitive raw data are the rays of light, (normally) following a straight path from the scene to the sensor. 
 

#  what is a light field? 

- any point in a 3D real world scene, you can put a sensor and that will capture lights from all directions. 
- hence litarally a light field. 
 

#  7 parameters of plenoptic function 

P()  : plenoptic pixel intensity function 
 
plenoptic is latin for full optic (i.e. light field) 
 
here are the 7 params 
 
Theta: angle 1 
Gamma: angle 2 
L    : wavelength (i.e. color) 
t    : time 
x,y,z: coordinates in 3D world 
 
- captures the complete 3D scene --> leads to things like holographic image, video! 
- think of multiple pinhole camera. you can caputre all rays of light info, then generate pixels later. 
 
(example) https://www.youtube.com/watch?v=hQeF150nX8g 
          https://www.youtube.com/watch?v=1D8xRIw4W9g 
          https://www.youtube.com/watch?v=MWz_3pbo3AE    # microlens plenoptic/light-field camera 
          https://www.youtube.com/watch?v=JyheHp7lgE4    # demo 
 
 
########################################### 
####  (7.2) projector camera systems   #### 
########################################### 
 
- basically (computationally) controlloing the illumination(light source) together with the sensor/lens/camera itself. 
i.e. coded exposure # controlling the light at the source 
 
 
##################################### 
####  (7.3)  Coded Photography   #### 
##################################### 
 
- recall epsilon photography where you take photos where each has only one parameter changed 
- we do this within camera via code 
- e.g. coded aperture # controlling the light at the camera (as opposed to coded exposure where you control the light at the source) 
 
 

  1. 2016-01-17 12:32:02 |
  2. Category : gatech
  3. Page View:

Google Ads