### computational photography

Georgia Tech, Computational Photography. cs6475 notes

Computer Vision textbook PDF : http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.pdf

# prerequisite: linear algebra (you can assume Euclidean space unless otherwise specified)

- cartegian coordinate systems : points
- vector : magnitude & direction. scalar multiplication.
- normalization : process of finding a unit vector (a vector of magnitude 1) in the same direction as a given vector. (obviously cannot normalize zero vector as it has no direction)
- inner products of vectors : Caucht Schwarz Inequality
- parallel and orthogonal vectors
-- two vectors are parallel if one is a scalar multiple of the other.
-- two vectors v & w are orthogonal if v*w = 0
--- zero vector is parallel & orthogonal to any vector.
--- zero vector is the only vector orthogonal to itself.

#########################
####  (1)  intro    #####
#########################

computational and technical aspects of photography : how the light is caputred to generate images.

# environment
- openCV (either python or C++)
- matlab/octave

##
##  what is photography ?
##

https://en.wikipedia.org/wiki/Photography

"Photography is the science, art and practice of creating durable images by recording light or other electromagnetic radiation, either electronically by means of an image sensor, or chemically by means of a light-sensitive material such as photographic film."

===> storing natural lights into digital(or chemical) images.

##
##  what is computational photography ?
##

- a discipline of studying how computing impacts photography.
-- digital sensors
-- modern optics
-- actuators
-- smart lights

##
##  limitations of traditional film cameras
##
- chemicals, darkroom
- one roll can contain only up 36 pics.
==> cannot take many, and no ability to instantly view pics you took, and films are sensitive.

##
##  comp photography enables:
##
- unbounded dynamic range (HDR)
- variable
-- focus
-- depth of firled
-- resolution
-- lighting
-- reflectance

##
##  elements of comp photography
##
given a 3D scene,
(1) illumination
(2) optics/aperture
(3) sensor
(4) processing
(5) display
(6) user

==> convert rays of light into "pixels"
==> computation can control all steps.  we will study how more deeply, in the following lectures.

###################################
####     Dual Photography      ####  a comp photography example
###################################

= the process of measuring the light transport to generate a dual image.

suppose a projector illuminates an object, and a camera captures an image, which we call a primal image. then we can send the light back to illuminator from the sensor(camera) side, and create a dual image, which is an image from the point of view from the illuminator.

recall 6 elems of cp we looked at before, this focuses on everything except for (3) sensor, which can be included as well.

# novel illumination:

- of course, your target 3D scene is already illuminated with natural light.
- but you can use additioinal controllable light source (e.g. a projector) plus a controllable aperture (like a modulator/filter that controls where to let light pass thru)
--> you can have computer algorithm decide how to control this added illumination.

# novel camera (optics/sensor/processing/display)

for optics/sensor part, you may have an aperture/filter that controls what light to take in, and relate back to which light source is illuminating it. hense further understanding what illumination changes the resulting image.

===> by controlling the aperture on both ends, we can do more stuff on image.

##
##  reflective property of rays of light
##
- reflection of light depends on the kind of surface.
-- specular (e.g. mirror)
-- diffuse (e.g. matte)
==> thus, depending on the surface, the light can get to the sensor in different ways.
==> the question : can we control it?  can we observe the controlled change ?

######################
####   Panorama   ####  another comp photography example
######################

recall
given a 3D scene,
(1) illumination
(2) optics/aperture
(3) sensor
(4) processing
(5) display
(6) user

in terms of the 6 elems of cp, panorama will mostly deal with (2)-(6)

# Panorama steps

(1) taking pics
(2) detection and matching (the overlap btwn two pics so we can stich them together)
(3) warping (aligning the pics on top of each other)
- cos lighting exposure may be diff btwn two pics, how much percentage of a pixel to take from one and from the other. or do we take 100% of a pixel from one and none from the other? we need a good algorithm to decide.
(5) cropping (optional)

==> we will revisit full technical details later.

#######################################################
####   computational photography as a discipline   ####
#######################################################

# Camera Process

- lens             # generalized optics
- sensor/detector  # generalized sensor e.g. CCD/CMOS, electronics
==> then we create a pixel/image)

#  photo stats

in 2011, 380 billion photos taken
in 200 years of photo history, 3.5 trillion photos. 4 billion in the last year. i.e. 10% of all the photos ever taken was taken just in the last year.

==> computations with photographs are becoming more relevant.

# SLR vs Smartphone Camera

DSLR (digital single lense reflex)
- more light (great lens)
- depth of field (zoom, etc)
- shutter lag
- control field of view
- other features (flash, modes, etc)

Smartphone camera
- computations (takes multiple pics and does fusion )
- data (location, etc)
- programmers/API for controlling some of the elements of CP

#  Film vs Digital cameras

- film and digital cameras have roughly the same features and controls
-- zoom & focus
-- aperture & exposure
-- one shutter press - one snapshot

#  CP extends FP/DP

- for FP/DP, we can USE, but CP allows us to CHANGE
-- optics, illumination, sensor, movement
-- exploit wavelength, speed, depth, polarization, etc
-- probes, actuators, network

- also CP offers better specification and support for
-- dynamic range
-- varying focus point-by-point
-- field of view & resolution
-- exposure time & frame rate
-- bursts (taking many pics at once)

#  growing impact on society

- pics/images are used to record history, analyze crime, etc.
-- kennedy assassination
-- september 11
-- meteor
-- boston bombing

#  computer vision  vs  computer graphics

they all work on the same things, but in the opposite direction

computer vision  : take 2D, and infer 3D world(geometry, shape, photometry)
computer graphics: generating nice 2,3 D images out of real world

#  ultimate camera : human eyes

- CP ultimately lets us understand human biology more

###################################
####   (2.1)  Digital Image    ####
###################################

how to represent an image.

(3) sensor  : generate signals to represent a computable image (i.e. digital image )
(4) processing
(5) display

#  overview

1. digital image : pixels & resolution
: x & y coordiantes. witdth * height = resolution

2.1 discrete (=matrix) e.g. I(i,j)
2.2 continuous (=function) e.g. I(x,y)

3. grayscale (=black & white) & color

4. digital image format

#  Pixel

- a picture element that contains the light intensity at some location (i,j) in an image.

I(i,j) = some numeric value

- thus an image can be representated as a matrix.
- in 8-bit-pixel grayscale images, intensity values range from 0=black to 255=white
- 1 bit pixel means 0 or 1, black or white. two colors
- 4 bit pixel means 16 colors.

#  digital image as a function

given a matrix of pixel values, we can extract
- continuous signal
- discrete signal

given a matrix of pixel values,  I(x,y) = intensity value of the pixel x,y

#  Sampling and Quantization

- Sampling(deciding the measurement frequency/interval)
- Quantization (=rounding to nearest value)

#  image statistics

- image histogram (distribution graph)
-- take a region of an image, and draw a graph where x-axis is the intensity bins (grayscale 0 to 255), and y-axis is the occurrence of each grayscale value.
-- you can do lots of stats analysis (average, median, mode, etc)

#  Color digital image

- each pixel has 3 channels (e.g. blue channel, red channel, green channel)
- 8 bit each, thus 24 bit in total.

#  tools

- openCV (computer vision)
-- API for image processing, available in C++/python
- matlab/octave
- proce55ing (java based)

#  understanding image format

- order of color channels
- compression info
- meta data about photos (exchangeable image file format, geo-location info, width height, pixel, etc)

#######################################
####    (2.2) point processing     ####
#######################################

PP : pixel-based arithmetic manipulation/computation of images.
- alpha-blending

e.g.
you take two pictures of the same target, say a class room, but one when empty, the other with a teacher. then if you subtract, you get a shape of the teacher. here you get an idea, this may work as security video camera processing.

question: what do we do if we add/subtract and go out of 0-255 value range for a pixel ?
- rescale (either before or after PP)

#  alpha-blending

suppose again, you have a photo of an empty class room, and another photo of the same class room with a teacher.
if you multiple each photo's pixel value by 1/2, and add them together, your teacher shows half transparent and everything else as 100% visible.
==> this transparency(aka opacity) is represented as "alpha" which ranges 0 to 1.
i.e. 0 = invisible
1 = fully visible

#################################
####  (2.3) Blending Modes   ####
#################################

blending pixels
e.g.
gievn two images, a & b
f_blend_ave(a,b) = (a+b) / 2      #  0.5 alpha for each
f_blend_normal(a,b) = b           # just taking the base image

#  common blend modes

- divide : brighten photos
- addition : too many whites
- subtract : too many blacks
- difference : subtract with scaling
- darken   : f_blend(a,b) = min(a,b) for RGB
- lighten  : f_blend(a,b) = max(a,b) for RGB
- multiply : f_blend(a,b) = ab                # darker
- screen   : f_blend(a,b) = 1 - (1-a)(1-b)    # brighter  # cos you invert both a & b, multiply, and invert again
- overlay  : f_blend(a,b) = 2ab            if a < 0.5
= 1-2(1-a)(1-b)  otherwise

#  dodge and burn

(techniques used in traditional film photography, in a dark room)
- dodge : brighten an image  # screen mode
- burn  : darkens an image   # multiply mode

==> of course there are variants of each.

#############################
####   (2.4) smoothing   ####
#############################

1. smooth an image over a neighborhood of pixels (as opposed to a pin point one particular pixel)
2. median filtering as a special non-linear filtering and smoothing approach

smoothing :
- can be construed as blurring or removing noise
- commonly done with averaging. e.g. you take 3-by-3 or 5-by-5 neighborhoods.
-- edge rows/columns, just expand the rows/columns to wrap around, and copy over the edge values, apply the neighborhood averaging strategy, etc.

===> we use the notion of the neighborhood size k, "kernel", where the window size is 2k+1
e.g.
k = 1, then it is 3-by-3
k = 2, then it is 5-by-5

#  generalized mathematical representation for neighborhood-averaging smoothing

G(i,j) = 1/(2k+1)^2 * Sum of all F(i+u,j+v) in the neighborhood.

==> if you wanna assign(aka attribute) an non-uniform weight to each elem in the neighborhood, instead of place common 1/(2k+1)^2 factor, you use h(u,v) which tells the weight for the pixel.
==> the whole concept known as "cross correlation" (to be revisited at a later lecture)

#  Median Filtering

- a non linear operation often used in image processing
===> just another statistical approach. instead of neighborhhod average(= mean), take the median.
[benefits]
- reduce noise
- preserve edges (sharp lines !)

####################################################
####   (2.5)  cross-correlation & convolution   ####
####################################################

# Cross Correlation

- in signal processing, CC is a measure of similarity of two waveforms as a function of a timelag applied to one of them.
i.e. we have two diff wave forms (like two diff images of matrix pixels) and combine them in such a way that best correlates the two.
- aka "a sliding dot product" or "sliding inner-product"

Filtering an image: replace each pixel with a linear combination of its neighbors.
- filter "kernel" or "mask" which is the prescription for weights in the linear combination.

# Gaussian Filter

- a smoothing/filtering example
-- 21 by 21 normal distribution filter. darken the edge, and brighten the center of the pixels.

# Convolution

- a mathematical operation on two functions F and h
- produces a third function that is typically viewed as a modified version
- gives the area of overlap btwn the two functions
- in a form of the amount that one of the original functions is translated.

#  properties of concolution

- linear and shift invariants
-- behaves the same everywhere
-- i.e. the value of the output depends on the pattern in the image neighborhood, not the position of the neighborhood.
- commutative :  F*G = G*F
- associative :  (F*G)*H = F*(G*H)
- identity : unit impulse
-- kernel E = [..000010000..] # like a matrix with only one in the center, every other elem zero, not an identity matrix
-- F*E = F    # because E only takes the focused pixel and don't mix with other neighbor pixels
NOTE: this identity is true of cross correlation too
- separable
-- possible to convolute all rows, then all columns

( Linear filter example )  https://www.youtube.com/watch?v=WeNpd_YEF6I

#############################
#############################

- use an image gradient to compute/detect edges
- image gradient in continous form for a function
in discrete form for an image

#  using filters to find features

- extract higher level features
-- map raw pixels to an intermediate representation
-- reduce amount of data, preserve useful information

#  what are the good features to match between images ?

- features
-- parts of an image that encode it in a compact form
-- like discontinuities in a scene
e.g.
- surface, depth, color, illumination, edge

- edge
-- information theory view that edges encode change, therefore edges efficiently encode an image.

#  edge detection

- basic idea: look for a neighborhood with strong signs of change
(issues to consider)
-- the size of the nrighborhood ?
-- what metrics represent a strong "change" ? pixel intensity diff of above threshold XYZ ?

- recall an image(its pixel intensity values) can be expressed as a function of coordinates
- an edge is where there is rapid change in the image intensity function.
-- take the derivatives of F(x,y)

# differential operators for images

- need an operation that, when applied to an image, returns its derivatives.
-- model these "operators" as mask or kernel
--- when applied, yields a new function that is the image gradient
-- then "threshold" this gradient function to select edge pixels

- gradient of an image : measure of directional change in image function F(x,y) in x, across columns, and y, across rows.
(partial derivative, and discrete approximation) https://www.youtube.com/watch?v=kj4vpaiE1KI

Gradient direction is the angle at which greatest positive change occurs.

############################
#####   (2.7) Edges    #####
############################

recall : we can differentiate an image in x & y

derivative as a local product
==> basically we can interpret the process of differentiation as cross correlation, with a kernel and an input image of pixel arrays.

- desired : an "operator" (aka mask/kernel) that effectively computes discrete derivative values with cross-correlation (i.e. using finite differences)
- finite differences provide a numerical solution for differential equations using approximation of derivatives.

# 3 examples
(1) Prewitt kernel
(2) Sobel kernel
(3) Roberts kernel

NOTE: significant noise in the signal makes it hard to detect edges. we will revisit this later.

recall convolution  G = g * F
derivative of a convolution dG/dx = d(h*F)/dx

if D is a kernel to compute derivatives, and H is the kernel for smoothing, we could define kernels with derivative and smoothing in one:
D*(H*F) = (D*H)*F

- smoothing (use a filter like Gaussian filter to suppress noise)
- apply edge enhancement
- edge localiztion. edge VS noise.
- threshold, thinning

#  Canny edge detector   (a very common edge detector)

1. filter image with derivative of Gaussian
2. find magnitude and orientation of gradient
3. non-maximum suppression
- thin multi-pixel wide "ridges" down to single pixel width
- define two thresholds (low & high).
- use the high th to start edge curves, and the low th to continue them.

############################
####   (3.1)  Cameras   ####
############################

n- rays of lights to pixels
- a camera without optics
- lens in the camera system
- the lens equation

#  rays VS pixels

- illumination (light rays) follows a path from the source to the scene
- rays are fundamental primitives
- scene via a 2D array of pixels
- computation can control the parameters of the optics, sensor and illuminations

#  single lense reflex camera

- view finder
- shutter release
- focal plane shutters
- photographic film (later replaced by CMOS sensos)
- focus/zoom ring
- frontal glass lens

#  when you take a picture, you try to capture

(1) geometry (3D-ness, perspective)
(2) light scattering

# how rays of lights, illumination is captured.

(why it gets captured up side down, inverted, on the sensor. see video) https://www.youtube.com/watch?v=U5WsCFi7h4Y

==> the machanism behind the "pinhole camera" (= camera obscura)

# pinhole photograph

in theory,
- straight lines remain straight
- infinite depth of field, i.e. everything in focus. (but there may be optical blur)
- light diffracts (wave nature of light, smaler aperture means more diffraction)

[pinhole size] = aperture
==> the bigger, the more light, the more geometric blur, the less diffraction blur
==> the smaller, the less light, the sharper image quality, the more diffraction blur
===> best pinhole = very little light

d : pinhole diameter
f : focal length: distance from pinhole to sensor
p : wave length of light

d = 2 * sqrt(1/2 * f * p)

#  replacing the pinhole with a lens

==> capture more lights, but still maintain the pinhole concept

#  geometrical optics

- parallel (to the lens) rays converge to a point located at focal length f from lens
- rays going thru the center of lens do NOT deviate (= funcstions like a pinhole)

#  ray tracing with lenses

- rays from points on a plane parallel to the lens, focus on a plane parallel to the lens on the other side (and upside down)
o : distance between object and lens
i : distance between lens and image
f : focal length

thin lens equation :   1/o + 1/i = 1/f

######################
####  (3.2) Lens  ####
######################

- focal length
- field of view
- sensor size
- image formation & capture
- perspective projection (how to capture 3D-ness)

lens : concave(positive) and convex(negative) lenses
==> modern cameras use a combination of both

changes in object distance, focal length.
==> changes in the size of the object captured at the focal plane.

# focusing

- achieved by controlling the position of both the lens and the sensor.
-- moving the lens lets you decide the size of the object to capture
-- the sensor must be placed at the focal length from the lens.

sensor == film == screen

# FOV(field of view) = how wide the angle of your view is.

h = sensor size
f = focal length

FOV = 2 * tan^(-1)(h/2f)

==> clearly, smaller h leads to smaller FOV
bigger f leads to smaller FOV

by changing focal length, you can change both view-point and perspectives (= geometry).

# a camera coordinate model

given the coordinate of the object, and its distance from the lens, and focal length, you can solve for the distance of the ideal sensor  position from the lens.

# changes in focal length (and viewpoint)

############################
####   (3.3) Exposure   ####
############################

exposure triangle
(1) aperture
(2) shutter speed
(3) IOS

==> photographers try to optimize those 3 parameters.

# exposure

H = E * T

aperture : an opening, a hole, a gap
irradiance : amount of light falling on a unit area of sensor per second, controlled by lens aperture

exposure time T : how long the shutter is kept open.

# SLR camera (single lens reflex)

see the structure.
https://en.wikipedia.org/wiki/Single-lens_reflex_camera#Optical_components

# shutter speed

- amount of time the sensor is exposed to light
- usually denoted as a fraction of a second. e.g. 1/200, 1/30,, 10, 15, bulb(i.e. shutter is open as long as you press it)
-- longer shutter speed: you get more blur
(good visual example): https://youtu.be/59LMCZWi1kU

# aperture

area = pi * (f/2N)^2

f = focal length
N = aperture number # often denoted in f/N
= gives irradiance irrespective of the lens

low f-number N on telephoto lens means BIG lens

e.g.
- doubling N reduces Area by 2x, and therefore reduces light by 4x
- from f/2.8 to f/5.6 cuts light by 4x
- to cut light by 3x, increase N by sqrt(2)

#  ISO = sensitivity

film : sensitivity VS grain (of film)
digital : sensitivity VS noise (of sensor)

ISO is linear : ISO 200 needs half the light of ISO 100
you need a higher ISO value when taking a pic in a dark place.

# adjusting exposure variables (shutter speed, aperture, ISO)

in recap,
aperture     : depth of field
shutter speed: motion blur
ISO          : more grain

#############################
####  (3.4)  sensor     #####
#############################

#  film VS digital

- two primary sensors.
- essentially the same. chemical for film. electronic for digital.
- differentce is how the light is trapped and preserved.

# film
- converts light into chemicals
- a film consists of many layers of color filters.

# digital
- converts light into data
-- CCD: "charge-coupled device", a device for converting electrical charge, into a digital value.
-- pixels are represented by capacitors, which convert and store incoming photons as electron charges
- Bayer Filter: a kind of a color filter.
- "demosaicing" : RGB values collected from Bayer needs to be processed = demosaiced.
- CMOS: "complementary metal oxide semiconductor"
-- photo sites in CCD are passive, and do no "work", just sent to amplifier later.
-- photo sites in CMOS have local amplifiers for each photo site, and can do local processing

# camera "raw" format
--> contains minimally processed data from the sensor (image as viewed by the sensor)
--> image encoded in device dependent color space
--> captures radiometric characteristics of the scene
--> like a photographic negative.
---> has a wider dynamic range or color that preserves most of the information of the image.

####################################
####  (4.1) Fourier Transform   ####
####################################

# reconstructing a signal

A = amplitude
w = frequency
t = time
n = number of periods
target signal:  f(t) = A*cos(nwt)

inf
f^T (t) = Sum A*cos(nwt)
n=1

##
##  a fourier transform
##
- periodic function: a weighted sum of sines and cosines of diff frequencies
- transforms f(t) into F(w), a frequency spectrum of f(t)
- a reversible operation
- for every w from 0 to inf, F(w) holds amplitude A, and phase G, of the sine function

F(w) = A*cos(wt+G)

# frequency domain for a signal

how many samples N do we need?
- smaller N : coarse
- biiger N : fine signals

# combining (1) time frequency and (2) frequency spectra

# convolution theorem and the Fourier Transform

- convolution in spatial domain is equivalent to multiplication in frequency domain

#############################
####  (4.2)  Blending    ####
#############################

- merge two images
-- window size
-- advantages of using the fourier domain

- pixel averaging
- cross fading # applying a different weighting for each pixel, adding up to 100%.
-- window size for blending becomes important. see video.

- factors for optimal blending window size
-- to avoid seams    : window = sizeof largest prominent feature
-- to avoid ghosting : window <= 2* size of smallest prominet feature
==> use Fourier domain
-- largest frequency < 2* size of smallest frequency
-- image frequency content should occupy one "octave" * power of 2

# an octave = a frequency spectrum

- frequency spread needs to be modeled.
FFT(Image_left) = F_left
FFT(Image_right)= F_right
-- decompose Fourier image into octaves(bands)
-- feather the corresponding octaves of F_left & F_right
-- compute inverse FFT and feather in spatial domain
-- sum feathered octave images in frequency domain

- what is feathering?
-- blurring of the edge before applying the blend operations
-- makes the merged resulting image smoother

###########################
####  (4.3) Pyramids   ####
###########################

- the whole FFT blending mumbo jumbo can be done with pyramids (Gaussian and Laplacian)

- pyramid representation : A Gaussian Pyramid
-- just using the same old Gaussian kernel filtering to scale an image down to a lower resolution, and repeat.
-- this process is called "reduce" function
-- its inverse is called "expand" which does not produce the original image, but at least attempts to.
--- the diff between the original image and the expanded image is the error called "Laplacian"

g1 = reduce(g0)
g1_e = expand(g2)
L1 = g1 - expand(g2)

#  pyramid blending process

given image a,b, and region R
- build Laplacian pyramids (which require buidling Gaussian pyramids first)
- build a Gaussian pyramid from selected region R
- form a combined pyramid using Gr as weghts
Lout(i,j) = = Gr(i,j)*La(i,j) + (1-Gr(i,j)) * Lr(i,j)
- collapse Lout pyramid to get the final belnded image

############################
####  (4.4)  cuts       ####
############################

- cuts as opposed to blending
- finding an optimal "seam" between images

- done with graph-cuts algo, as well as dynamic programming

############################
####  (4.5)  Features   ####
############################

detecting features of an image, to be able to, for example, match with other images.

some famous feature detection methods
- Harris corner detection algo
- SIFT detector

common feature transformation:
- transition (location movement)
- rotation
- scale (size change)
- affine (shape change)
- perspective (e.g. original feature is somebody's face, this can be the same face taken from the side)
- lighting (pixel values)

==> can come in combination

characteristics of good features
- repeatability/precision
- saliency/matchability
- compactness/efficiency
- locality

##
## find corners
##
- key property : in the region around a corner, image gradient has two or more dominant directions
- corners are repeatable and distinctive

##
##  Harris Detector Algo overview
##
- compute Gaussian derivatives at each pixel
- compute second moment matrix M in a Gaussian window around each pixel
- compute corner response function R
- threshold R
- find local maxima of response function (non-maximum suppression)

#  properties of Harris Detector

- rotation invariant(=constant,unaffected) ?
-- ellipses rotates, its but shape(=eigenvalues) remains the same
-- corner response R is invariant to rotation

- intensity invariant ?
-- partial invariance to additive and multiplicative intensity changes (threshold issue for multiplicative)
-- only image derivatives are used
--- invariance to intensity shift: I -> I+b
--- invariance to intensity scale: I -> a*I
==> threshold needs to be adaptive

- scale invariant
-- No! dependent on window size.
-- use pyramids (or frequency domain)
==> this is why we need SIFT

# examples of Scale Invariant Detectors

(1) Harris Laplacian
- find local maximum of:
-- harris corner detector in space (image coordinates, for x,y)
-- laplacian in scale

(2) SIFT
-- find local maximum of:
-- difference of Gaussians(DoG) in space and scale
-- DoG is simply a pyramid of the DoG within each octave
- orientation assignment
-- compute the best orientation for each keypoint region
- keypoint description
-- use local image gradients at selected scale and rotation to describe each keypoint region

##################################################
####  (4.6)  Feature Detection and Matching   ####
##################################################

Harris Detector: step by step
- compute horizontal & vertical derivatives of the image. (convolve with derivative of Gaussians)
- compute outer products of gradients M
- convolve with larger Gaussian
- compute scalar interest measure R
- find local maxima above some threshold, detect features

#  Scale Invariant Detection

- consider regions of different sizes around a point

- a region(circle) which is "scale invariant"
- not affected by the size but will be the same for "corresponding regions"
- e.g. : average intensity. for corresponding regions (even of different size) it will be the same.

- compute the scale invariant function for different region sizes, and find the max point.
- a "good" function for scale detectionhas one stable sharp peak.
- for usual images: a good function would be one which responds to contrast (sharp local intensity change)

#  key point localization

- find robust extremum (maximum or minimum) both in space and in scale
-- SIFT: scale invariant feature transform
--- specific suggestion: use pyramid to find maximum values (remember edge detection) - then eliminate "edges" and pick only corners. remove low contrasts, edge bound

#######################################
####  (5.1)  Image Transformation  ####
#######################################

image filtering: change the "range" of an image(=function)  i.e. pixel value
image warping: change the "domain" of an image(=function)   i.e. pixel pos

##
## parametric global warping
##
- translation (x,y coordinate transition) # 2 DoF :degree of freedom
- euclidian    # 3 DoF
- aspect
- scale        #
- perspective  # 8 DoF
- affine　   　# 6 DoF

transformation function T(p) = p'    where p = original pixel

T() is usually some sort of parametric matrix (in combination with trigonometry)
p is x,y pixel value

NOTE: transformation in general means simple x,y coordinate trans plus rotation, while warping is more point to point mapping.

##
##  fwd / inverse warping
##
- problems
-- fwd : holes, overlap
-- inv : minification (causes aliasing, blocking)

#  image morphing

- mesh based warping
- find feature points (via feature detection algo we studied before), and create a mesh
-- lots of useful functions/examples in openCV library

##############################
####  (5.3)  Panorama     ####
##############################

1. capture images
2. detection and matching
3. warping (aligning images)  # simple translation works but warp is better
5. cropping (optional)

# a bundle of rays contain all vews
-> able to a create synthetic view  (as long as there is the same center of projection)

-> RANSAC (random sample consensus)
--> find "average" translation vector

#  3 types of panorama projection plane

- plane
- cylinder
- sphere

#  openCV panorama code example

#########################
####   (5.4) HDR     ####
#########################

high dynamic range : HDR
==> basically, 8 bit pixel(0 to 255) intensity does not suffice to capture real world. should be much wider. like 5 to 10 mil values.

luminance : a photometric measure of the luminous intensity per unit area of light travelling in a given direction. measured in candela per square meter (cd/m^2)

##
##  camera calibration
##
- geometric : how pixel coordinates relate to directions in the world.
- radiometric/photometric : how pixel values relate to radiance amounts in the world.

===> basically taking an image at diff exposure values, and taking the right range for each pixel.

(good example video on mechanics)  https://www.youtube.com/watch?v=EyVr1104yUs

#  tone mapping

- diff algos exist to cope with the problem of HDR images looking unnatural
- trying to compact the range into a smaller range.

#########################
####  (5.5) Stereo   ####
#########################

- Depth of a scene
-- capture a 3D scene with geometry
-- recall the focal length and x,y,z coordinate expression of a scene depth/position.
-- fundamentaldepth  ambiguity : any points along the same ray map to the same pixel in the image
e.g. your hand looks like holding a building (playing with depth ambiguity)

- how to estimate(infer) depth/shape from a single view point
-- illumination of structure
-- occulusion (where an object sits in front of other objects, etc)
-- using objects of known sizes/textures
-- perspectives
-- motion
-- focus

====> need to resolve depth ambiguity !
====> stereo : image captured from two view points

##
## Depth Parallax
##
- parallax : apparent motion of scene features located at diff distances
-- basically nearby objects tend to move(displace) more than far away objects.

##
##  Anaglyph
##
- anaglyph encodes parallax in a single picture. two slightly diff perspectives of the same subject are superimposed on each other in contrasting colors, producing a three dimensional effect, when viewed thru two correspondingly colored filters. basically two images taken thru diff filters (red and cyan)

##
##  can you compute depth if you have stereo (two view points) for a scene?
##
- yes, it;s a simple geometry.
- basically you match objects (via feature detection/matching) between two images, and then you can compute the disparity(i.e. distance in this case) between left and right eyes.
- epipolar constraints (basically it says, you only need to focus 1D epipolar lines to do matching, assuming y-coordinate does not change between left and right eyes.)
- occulusion: depending on obstacles, matching is not quite easy.
- RGBD camera (red green blue depth) using time of flight(time it takes for light to travel and bounce back)

#############################
####  (5.6)  Photosynth  ####
#############################

- go beyond panoramas
- photo tourism (aka photosynth)
-- scene reconstruction of 3D geometry from many photos
--- relative camera positions, orientations, focal length of cameras, point cloud, correspondence of feature points
--- scene construction process:
---- 1. feature detection (e.g. SIFT)
---- 2. pairwise feature matching (e.g. RANSAC)
---- 3. correspondence estimate
---- 4. incremental structure from motion (pretty much the same fundamental logic as panorama)

(a must see impressive video)  https://www.youtube.com/watch?v=q957FNyq6bw

good website to play around the topic.   https://photosynth.net

#####################################
####   (6.1)  Video Processing   ####
#####################################

video : a stack of images displayed sequentially over time.

- aspect ratio
- frame rate
- codec/compression algo

# persistence of vision

- human vision system perceives frames of images as flicker-less smooth continuous motion picture
- specifically more than 24 frames per second

the same image procesing techniques (filtering, feature detection and matching, feature tracking/registration/blending/morphing/warping) apply to video, just we add another dimension of time.

##################################
####  (6.2)  Video Textures   ####
##################################

say you have a 10 sec video of for example a candle light. it does tremble a bit and say you loop the video, so you have a continous infinite duration video of a candle light. but when you stitch together the end to the beginning, it will flicker, which feels unnatural so the video texture is the technique that makes that seemless.

how ?
- suppose you have 90 frames, then for each frame, compute the similarity(distance) metrics to every other frame. so you get the idea of which frame to jump to to maintain the flickerless loop.

what similarity metrics ?
- L2 norm : compute the sum of squared euclidean distance of each pixel between two frames
- L1 norm : compute the sum of abs(p1-p1') aka manhattan distance of each pixel between two frames

which way to jump ?
- similar frames can be represented as markov chain

how to preserve dynamics with transitions
- certain video, you break the dynamics if you simply apply the jumping with similar metrics.
- this you basically have to model the transitions, and that can be coded into your markov model

- all can be applied to video as well

video portrait  (view morphing)
- you can apply this looping to "stereo" video !

video sprites
- you can merge a moving object into different background. like you collect video of hamster running in all directions, and create a video with a circle where the hamster keeps running on the edge.

cliplets / cinemagrams
- playing only parts of a video (like you let one person walk while freezing the rest of the crowd move) to highlight or emphasize.

#######################################
####  (6.3)  Video Stabilization   ####
#######################################

- removing(stabilizing) the shake and jitter as post processing.

##
##  pre-processing stabilization  (not the scope of this lecture)
##
- optical / in-camera stabilization
-- sensor shift
-- floating lens (electromagnets)
-- accelerometer + gyro
-- high frequency perturbations (small buffer)

##
##  post processing stablization
##
- removes low frequency perturbations (large buffers)
- distributed backend processing (cloud computing)
- can be applied to any camera, any video.

- main steps
(1) estimate camera motion
-- find corners (i.e. high gradient in x & y) and track them across frames
-- background VS foreground motions. we want background. there is a weighting algo.
-- 8 degrees of freedom (x & y translation, scale/rotation, skew and perspectives)
(2) stabilize camera path
-- stationary or linear displacement, parabolic path, etc (lots of smoothing algo)
(3) crop and re-synthesize
-- crop window size can adapatively change

=> create a virtual camera frame.
--- can deviate too much from original camera

-- rolling shutter VS global shutter
--- recall CMOS sensor uses rolling shutter

############################################
####   (6.4) Panoramic Video Textures   ####
############################################

PVT: a video that has been stitched into a single wide field of view.
- appears to play continously and indefinitely

1. take each frame, and create a panorama (just like you do for images)
2. then separate static & dynamic regions (either manually or using some automated method)
3. then apply video texture technique to the dynamic region.
done

# video texture of dynamic region

- map a continous diagonal slice of the input video volume to the output panorama
- restricts boundaries to frames
- shears spatial structures across time
--> can be improved with a graph "cut" algo, in addition to fade/blend.

#############################
####  (7.1) Light Field  ####
#############################

- we get an image of a scene as 2D pixels
- but the fundamental pirimitive raw data are the rays of light, (normally) following a straight path from the scene to the sensor.

#  what is a light field?

- any point in a 3D real world scene, you can put a sensor and that will capture lights from all directions.
- hence litarally a light field.

#  7 parameters of plenoptic function

P()  : plenoptic pixel intensity function

plenoptic is latin for full optic (i.e. light field)

here are the 7 params

Theta: angle 1
Gamma: angle 2
L    : wavelength (i.e. color)
t    : time
x,y,z: coordinates in 3D world

- captures the complete 3D scene --> leads to things like holographic image, video!
- think of multiple pinhole camera. you can caputre all rays of light info, then generate pixels later.

###########################################
####  (7.2) projector camera systems   ####
###########################################

- basically (computationally) controlloing the illumination(light source) together with the sensor/lens/camera itself.
i.e. coded exposure # controlling the light at the source

#####################################
####  (7.3)  Coded Photography   ####
#####################################

- recall epsilon photography where you take photos where each has only one parameter changed
- we do this within camera via code
- e.g. coded aperture # controlling the light at the camera (as opposed to coded exposure where you control the light at the source)

1. 2016-01-17 12:32:02 |
2. Category : gatech
3. Page View: