### machine learning

CS7641 notes

the discipline existed for many decades but suddenly gaining popularity recently for 2 reasons.
(1) computation power : we now have reached a point where we can meaningfully do deep neural nets
(2) data availability : we now have tons of data available to us

##
##  books
##

other resources:  http://omscs.wikidot.com/courses:cs7641
http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
http://www.cs.cmu.edu/~tom/NewChapters.html

##
##  math
##

(font) http://www.alanwood.net/demos/symbol.html

(ref) http://www.cs.unm.edu/~joel/NonEuclid/noneuclidean.html

euclidean geometry: the one we are intuitively familiar with. in any finite n dimension, points are designated by coordinates (x,y,z,etc) and any two points have only one line to connect/intersect them.
non-euclidean geometry: like sphere, hyperbolic geometry, where two points can be connected by multiple ways. (imagine in a sphere, you can choose two coordinates but there are many ways to connect them.)

(ref) https://en.wikipedia.org/wiki/Norm_(mathematics)

on an n dimensional euclidean space R^n, length (i.e. norm) of a vector (x, x2, x3,,, xn) is as follows

n
||x||_2 = sqrt( Σ xi^2  )        # sometimes simply denoted as |x| or ||x||
i

##
##  max f(x)    and    argmax f(x)
##   x                   x

max f(x) means the maximum f(x) value
argmax f(x) means the value of x that gives max f(x)

##############################
####   machine learning   ####
##############################

induction: you try to figure out a general rule based on sepecific examples.
deduction: you try to figure out specific examples/outcome/results based on general rule.
regression: in ML/stats context, this only means "approximation"

(1) supervised learning
- given "labeled" training data X & Y, so the model can predict Y for unseen X in the future. Y is called "label"
- fair to call it "function approximation" or "function induction"
- two kinds:
-- 1. classification: (e.g. logistic regression) trying to classify given data X into some discrete label Y
e.g. given some data(height, weight, hair color, etc), predict gender female or male.
-- 2. regression: numerical (real number, continuous value) approximation.
e.g. given some data(height, weight, hair color, etc), predict age.

(2) unsupervised learning
- you try to describe/summarize "unlabeled" data into groups/clusters.
- given many animals, you might group them into mammals, insects, fish, etc
- unsupervised learning attempts to offer "data description"

(3) reinforcement learning
- supervised learning, but delayed reward/results/signals.
-- e.g. an agent that plays chess. your agent might make many moves till it finds out it lost. but where ?
-- so it's harder than simple supervised learning where you get feedback at each step if you did it good or bad.
- so it's going to train some behavior that achieves/scores some goal well.

NOTE: input data type doesnt matter in distinguishing between (1),(2),(3)

###
###   classification learning
###
typical vocab
- instances: input
- concept: function that maps input to output
- target concept: an actual answer function
- hypothesis class: all the possible functions you are willing to entertain.
- sample: training set (input/output pairs)
- candidate: a concept you consider to be the target concept.
- testing set: input/output pairs data set which you havent used in the training.

#################################
####  (1.1) decision trees   ####
#################################

- a typical classification algorithm.
suppose you are trying to classify the race of a given person.
- nodes: a question (e.g. what eye color ?  speak english ?)
- edges: answers/attributes/values (blue, red, green, etc,  yes,no )
- leaf nodes: output(asian, african, caucasian, etc)

NOTE: you start from the root node.

##  (from Mitchell book)
##  decision tree is suitable for problems with the following characteristics
##
- discrete data (but works for continuous data if we bin them)
- requirement of expressiveness. DT can cover conjunction(&&), disjunction(||), xor
- training data may have errors/missing labels (DT is robust for these)

##
##  decision trees: expressiveness
##
- boolean:
e.g.  A && B

A
t/ \f     # t=true, f=false
B   -
t/ \f
+   -

===> because A and B are communitive, you can swap them in the tree, and it still works.

e.g.  A || B

A
t/ \f     # t=true, f=false
+   B
t/ \f
+   -

==> again, A and B are communitive in this case, so they are swappable.

e.g.  A xor B

A
t/   \f
B     B
t/ \f t/ \f
-  +  +   -

===> you need more nodes. in fact, depending on what you want to express with decision trees, you get diff complexity.
===> the first two are essentially linear O(n), and xor is O(2^n)
===> so this means decision tree is expressive (big hypothesis class), but also you need to be careful about how to search among it.

NOTE: this is a fundamental problem in ML. XOR is an example of a function that is not linearly separable.

##
##  ID3 algorithm
##

- an example of top-down decision tree algo

loop:
1.  A = best attribute
2.  assign A as decision attribute for node
3.  for each value of A, create a descendant (edge)
4.  sort training examples to leaves
5.  if examples perfectly classified, stop. else itereate over leaves.

===> what exactly is "best attribute" ?
there are a few well established methods. e.g. entropy, gini index, correlation, etc
the most common method is information gain (i.e. randomness reduction, entropy minimization)
given a set of examples S and an attribute A,

|Sv|
gain(S,A) = entropy(S) -   Σ   ------ * entropy(Sv)     #  V(A) denotes a set of possible values(v1,v2,,vk) for A
v∈V(A)  |S|                    #  Sv is a subset of S where attribute A has a value v

==> what the summation part is doing is computing the expected(or average) entropy you have over each particular example.
(definition): entropy is a measure of the expected encoding bit length.
e.g. to encode weather Sunny or Rain (two choices), you only need 1 bit
e.g. to encode N labels, you need log_2(N) bits
==> we will revisit this later.

entropy(S) = -Σ p(v) log_2 p(v)      # p(v) = proportion of v in S.
v                      # e.g. if S consists of 14 samples each labeled some color, and 5 of them are Blue, then p(Blue) = 5/14
#      if 3 of them are Green, then p(Green) = 3/14

note: the above method easily extends to continuous value attributes  http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/DTs.pdf
e.g. we will simply choose values to split on (like if x1 < 0.5 or not) based on max info gain (aka IG)
there are also ways to extend to covering continuous value "labels" but it's harder and not the common use of DT

##
##  ID3: bias
##

restriction bias: which hypothesis set "H" you care about. (our H = all possible decision trees, in this case, not some numerical equations)
preference bias: aka inductive bias: what sort of individual hypothesis is prefered within the H you care about. aka how an algo prefers certain solutions over others.

inductive bias of ID3 algo: (i.e. what ID3 algo prefers)
-- good splits at top      # obviously you want the first (or nodes near the root) node to split the H in half. (most efficient decision tree'ing)
-- correct over incorrect  # obviously you want a decision tree that gives good prediction
-- shorter trees (implied by the first point)

##
##  decision tree: other considerations
##
1. does it make sense to repeat an attribute ?

e.g.
A
/ \
B
/ \
A        # NO, because that's zero information gain (at least for discrete value attribute)

2. continuous attributes?   e.g. weight, distance, age
-- range:

age
/  |  \    # e.g. 3 way split:  below 20, in 20, over 30
A   B   C

note: maybe later under node A, you might wanna use "age" attribute again, spliting within below-age-20 range, like age 0-10, 11-19, then it's valid.

3. when do we stop the loop ?
- everything classified correctly (not a good idea. sometimes impossible, also prone to overfitting)
- no more attribuets?
- cross validation (train and test. if performance is good on test data)
- prune (one way to deal with overfitting)

4. using decision tree for regression ?
- yes, possible. we revisit this later.

###############################
###  (1.2)    Regression    ###
###############################

regression == function approximation

define the error function in terms of x & y then all you need to do is find the minima/maxima

f(x) = c0 + c1*x + c2*x^2 + c3*x^3 + ...       # c = coefficients. aka weights

- constant regression   e.g. f(x) = c    # the mean is the least squared error value
- linear regression   # degree 1
- polynomial regression  # Nth degree polynomial

##
##  Overfitting
##
- for in-sample data, as you increase degree of freedom (polynomial), then error decreses.

e |
r | __
r |   \_
o |     \_
r |       \
|        -------
-------------------
degree of freedom

===> now, think about out-of-sample data, it's likely that to some extent, increasing the decgree of freedom helps.
but, after a threshold, its error increases. past that inflection point is called "overfitting" zone

e |
r | __          /
r |   \_      _/
o |     \_  _/
r |       --
|
-------------------
degree of freedom

"overfitting" definition: in-sample error decreasing while out-of-sample error increasing
"underfitting" : over-generalization: like too-low degree of freedom, like f(x) = mean(x)

# errors

- training data may have errors. so we may be modeling f()+e, not f()
- where does error come from ?
-- sensor error
-- maliciously. people may lie when answering questions, etc.
-- transcription error
-- unmodeled influence. X:size, Y:price for house is obviously an example, because house price is influenced by other factors/attributes such as location.

#  cross validation

- you train models using training data set
- you test models using test data set
- you may arbitrarily split the data chunk into 80/20 as train/test (with/without replacement)
- obviously, if you deal with chronology-sensitive time-series data, like stock price, you cannot train on future data and test on past data.

#  IID: independently & identically distributed

- statisticians/ML scientists like to assume/believe IID when training/cross-validating ML models
- fundamental assumption of many supervised learning algos. the data we use for train/test is representative of the world.

#  representation for regression

- how to represent discrete input as numerical quantity ?
- well if it's binary, like gender male/female, it's easy: 1 or 0
- if it's skin color, then mapping arbitrary number for each color doesn't work.
e.g. brown=0, black=1, white=2, yellow=4, etc because if we decide a positive coefficient, then more yellow means a bigger value for Y ?
-- so we have to get smarter about the representation. one way is to create many binary attributes from one discrete attribute, like skin white 0|1, brown 0|1, orange 0|1, so they can have independent coefficients.
(called one hot encoding https://en.wikipedia.org/wiki/One-hot )

####################################
####   (1.3)  Neural Networks   ####
####################################

in general, unlike human brain where neurons die and born all the time, neural network in ML is considered a pre fixed architecture, the number of layers and nodes and all. so if you re-configure it, then you have to re-train everything.

###
###  perceptron
###

a neuron unit called perceptron, first proposed in the 50s, can be perceived as follows:
- it accepts signals (vector X), and corresponding weights(vector w)
- computes the inner product (we call it activation), and if it's bigger than the threshold Θ, then it fires a signal, as in y=1, otherwise no action, as in y=0

k
activation a = Σ xi*wi
i

if  a >=  Θ , then y=1, oterwise y=0

it is a linear function.

we can use a perceptron unit to implement boolean functions. (by configuring weights and threshold Θ)

(NOT function) https://www.youtube.com/watch?v=pE5SPhHzk5c       # now we are increasing dimension
https://www.youtube.com/watch?v=dPzpnAEe1hg       # so it's non-linear transformation in a way

==> so we can represent anything by combining the above.

###
###  perceptron training
###

- given examples, find weights that map inputs to outputs
- well we should consider threshold Θ too, but we just set Θ to some number, and only adjust weights, so math gets simpler.

(1)  perceptron rule                    # threshold based output
(2)  gradient descent (aka delta rule)  # non threasholded output

###
###  perceptron rule
###

given many examples of x1,x2,x3,,,xn y, you want to decide weights w, so you can apply w to unseen x vector, to (hopefully accurately) predict y.

in this case, assume y = 0|1

here is the perceptron rule:
- decide some init values for weights.

input     output
x1,x2,,,,xn   y     # example data 1
x1,x2,,,,xn   y     # example data 2
x1,x2,,,,xn   y     # example data 3
x1,x2,,,,xn   y     # example data 4
-----------
w1,w2,,,,wn         # weights

y^ = ( Σ(wi*xi) >= Θ )    # y^ is just the predicted y value(0 or 1) based on threshold theta Θ, using the current weights for a given input vector x
i

==>  here is the trick: we wanna treat threshold Θ = 0 zero  to simplify math
==>  just add a nagative Θ, in the weight vector, and we set the corresponding input for that to be always 1, as below.
(that 1 is called bias)

bias    input   output
1,x1,x2,,,,xn   y    # example data 1
1,x1,x2,,,,xn   y    # example data 2
1,x1,x2,,,,xn   y    # example data 3
1,x1,x2,,,,xn   y    # example data 4
---------------
-Θ,w1,w2,,,,wn         # weights

repeat:
y^ = ( Σ(wi*xi) >= 0 )    # now replaced Θ theta with 0 zero
i

Δwi = R * (y-y^) * xi     # Δwi = delta wi, it's how much we adjust the existing wi, at each iteration of perceptrol rule based training
# take the difference y-y^, and multiply by xi * R
# recall y = 0|1 binary, so
# if y == y^ , Δwi = 0, no weight adjustment needed
# if y=1, y^=0, then weights need to get bigger, hence R * 1 * xi
# if y=0, y^=1, then weights need to get smaller, hence R* -1 * xi
# R = learning rate   # if small R, then it learns slowly, if big R, then it magnifies the adjustment val Δwi
wi = wi + Δwi             # and finally, adjust the weights w
stop if y == y^ for all x # i.e. Δwi == 0

===> simple yet powerful. (trivial to implement in code)
===> visualizing in a plane below, separating y=0 from y=1

|\    x    x
| \     x   x
|  \  x
|o  \     x
|  o \ x   x
| o  o\
|----------------

===> if there is a line that separates y=0 from y=1, then we call the data "linearly separable"
===> if the data is linearly separable, perceptron rule will find a right w in finite iterations.
===> if not liniearly separable, then it can run forever (bad), so we should set some condition like if Δw is small enough, then complete. or after K iterations, complete. etc

###
###

- we wanna have an algo that is more robust to non linear-separability.

a = Σ(wi*xi)    #  a = activation
i

==> instead of computing a threasholded value y^ = {a > 0} = 1 or 0, we just try to adjust w so a becomes closer to actual y

E(w) = 1/2 Σ(y-a)^2    # E = error metric. square is just so neg,pos don't cancel each other. 1/2 is explained below.
x,y          # x,y is all x,y data we have as training data

==> so we just wanna know w that minimizes E(w)
==> just take the partial derivative of E, with respect to w

# ∂ is a greek "d" for partial "d"erivative.
# partial deriv simply means taking the derivative of a function with respect to one variable, with other vars kept constant. (as opposed to total derivative which allow all vars to vary)

∂E       ∂
---  =  --- * 1/2 *  Σ(y-a)^2    # just use the chain rule in calculus
∂wi     ∂wi         x,y          # here that 1/2 will be gone, which is why we put 1/2 in the E() definition, it gets cleaner here

=  Σ(y-a)(∂/∂wi) - Σ(wi' * xi)
x,y

=  Σ(y-a)(-xi)        # and if we set this to zero, then it turns out setting a = average of y is the best estimate
x,y

Δwi = R * (y-a) * xi       # R = learning rate
wi  = wi + Δwi

(ref) https://math.stackexchange.com/questions/1352726/minimizing-a-function-sum-of-squares
(ref) https://en.wikipedia.org/wiki/Minimum_mean_square_error

###
###  comparison of learning rules:  perceptron VS gradient descenet
###

perceptron rule:  Δwi = R * (y-y^) * xi    # guarantee finite convergence, but ONLY in linear separable data
gradient descent rule:  Δwi = R * (y-a)  * xi    # robust to non lin-sep data, but converge to local optimum

(quiz) why not do gradient descent on y^  ?   -- cos it's not differentiable (y^ not continuous)

==> what if we can make "differentiable" threshold ?

###
###  Sigmoid         # sig = sigma,  moid = like
###

σ : lower case sigma

let's define a sigmoid function:  σ(a) =  1 / (1 + e^-a)       # a = activation

a -> -∞ ,  σ(a) -> 0     # as a goes negative infinity, σ(a) goes zero
a -> +∞ ,  σ(a) -> 1     # as a goes positive infinity, σ(a) goes one

===> now the thresholded value is not an abrupt 0 or 1, but a nice continuous differentiable function.

let's take the derivative of this Sigmoid function.

Dσ(a) = σ(a)(1-σ(a))      # D = derivative.  yes, this is elegantly simple

###
###  Neural Network Sketch
###

input layer : x vector
hidden layer: sigmoid units
output      : y

x1\
x2--  s1 - s4 - s8\
x3--  s2 - s5 - . --- y
.  /- s3 - s6 - sn/
.  /     \ s7
xn/

==>  the beauty of this is the mapping from input to output is differentiable in terms of weights, despite the non linearity of hidden layer.

"back propagation" : computationally beneficial organization of the chain rule.
--> it's just saying we compute output with x and weights, and propagate the error back toward the input, to update the weights.

NOTE: this suffers many local optima. also no guarantee of finite time convergence.

in a way, you can consider NN == representation learning

###
###  optimizing weights
###

- gradient descent suffers local optima/minima
-- momentum: don't stop at a minima, keep going/searching for other minima
-- higher order derivatives
-- randomized optimization: to be revisited
-- penalty for "complexity" e.g. regression overfitting, e.g. large tree overfitting

complexity in network:
- more nodes
- more layers
- large numbers (i.e. magnitude for weights), like a weight of 27582409238 is more complex than a weight of 1585

in a way, optimization = learning

###
###  Restriction Bias
###
- gives you a restricted set of hypothesis(bias) you are willing to consider, to inform the representational power of the structure you are using (in this case, nueral networks)
- tells you what you can/cannot represent.

perceptron: half spaces by threshold
sigmoids : much more complex, not much restriction
boolean: network of threshold-like units
continous function: hiddden layer
arbitrary function: stitch together all of the above.

==> you can represent any function ultimately, but unbounded complexity leads to overfitting.
so usually, you bound it by fixing the max number of layers, nodes, magnitude of weights.
and do the usual cross-validation (train VS test) to stop where you need to stop.

###
###  Preference Bias
###
- tells you about algorithm to select a representation over others.
- what algorithm?
e.g. how do you decide initial weights ?
- small random values
-- small is preferred to big weights (less complexity, as super big weight might help you overfit)
-- randomizing init weights help you avoid getting stuck in one same local minima. "variability"

"Occam's Razor" : entities should not be multiplied(made more complex) unnecessarily.
i.e. simplicity is preferred to complexity

#############################################
#####   (1.4) Instance Based Learning    ####
#############################################

- another paradigm to approach supervised learning.
- just store all instances in some DB, and so your regression result is f(x) = lookup(x)
- aka "non-parametric"

###
###  KNN : K nearest neighbor  (instance based regression)
###

very simple.

we keep data points (i.e. store all known instances), and use them when we predict. (aka data centric, instance based approach)
given a new X, you find K nearest known Xs in your collection of data points, then take the mean of their Y values.

if we weigh each Y based on the distance of the Xs to your target X, then it is called "Kernel regression"

note: what defines the "nearest" "distance" ?  -- it's really conceptually "distance" == "similarity"

---------------------------------------------
Given: training data D = {xi,yi}
number of neighbors k        # domain knowledge
query point q
distance metric d(q,x)       # domain knowledge

NN = {i: d(q,xi) k smallest}    # NN is a set of k smallest(=nearest) neighbors

return
- classification   : vote (most frequent, or random) yi ∈ NN    # flexibility in implementing "vote"
- regression       : mean of yi ∈ NN                            # flexibility in implementing "mean" like weighted mean, etc
---------------------------------------------

note: in practice, you can tweak the logic a bit. e.g. what if you set k=5, but you find 8 equally nearest neighbors? then maybe get all 8 of them, so you took "k or greater than k nearest neighbors"

|  running time  |  space
---------------------------------------------------
1-NN  learn|  O(1)          |  O(n)
query|  O(log(n))     |  O(1)
K-NN  learn|  O(1)          |  O(n)
query|  O(log(n) + k) |  O(1)
linear regression  learn|  O(n)          |  O(1)
query|  O(1)          |  O(1)

###
###  parametric VS non-parametric regression
###

parametric:  # linear regression
- no need to store data, space efficient,
- but new evidence comes then you have to re-do the whole training
- thus training is slow, but query is fast   # faster than BDD

non-parametric:  # KNN
- need to store data, space consuming
- but able to quickly add new evidence
- training is fast (no need to learn new parameters), query can be "potentially" slow
- suitable for complex problems where we are not sure of the underlying mathematical model

###
###  KNN bias (assumption)
###
- preference bias = our belief about what makes a good hypothesis   # i.e. the ways we express our preference across various hypotheses
-- locality: near points are similar
-- smoothness: averaging
-- all features matter equally

###
###  curse of dimensionality
###
- as the number of features|dimensions grows, the amount of data we need to generalize accurately grows exponentially.
- applies to ML in general, not just to KNN method

O(n^d)

##
## to recap
##
- distance metric  d(q,x)
-- e.g. euclidean, manhattan, weighted version, mismatch (if it's discrete instances like color)
- k
-- how to determine k value (you kinda have to know someting about the data)
-- how to process k chosen instances. weighted average? locally weighted linear regression?

note: if you do "locally weighted linear regression" then you do it different k nearest neighbors across your whole universe of instances, then when you stitch together the approximated functions from each local neighbor, then you get a complex yet effective good approximate function.
this is conceptually significant, becaure you went from   H ∈ lines   to    lines ∈ H      # H = hypothesis space

#################################################
#####  (1.5)  Ensemble Learning : Boosting   ####
#################################################

ensemble learning - a class of learning algos.
e.g.
- bagging
- boosting

(1) learn a (relatively simple) rule over a subset of data
learn another rule over another subset of data
repeat
(2) combine them in "some" way to make one effective/complex rule

# note: if you take the whole data to learn a rule, it cannot produce a simple rule, instead it might do something complex which causes overfitting. (more on this later)

# important questions:            | "bagging"          | "boosting"       |
- Q. how do we decide a subset ?  | uniformly randomly | hardest example  |
- Q. how do we combine ?          | average            | weighted average |

###
###  bagging
###
- suppose you have 20 data points
- you take 5 subsets, each subset has 5 uniformly randomly chosen data points
- for each subset, you do 3rd order polynomial regression training (so you get five models as a result)
- then for any given test data point, you do the average of the output from each model.
(called "bagging")
===> it has been proven that this performs better (on test data) than a single 4th order polynomial which overfits more.

###
###  terminology
###

"error": essentially mismatch. but we wanna define this more formally.

E = Σ(y-y^)^2              # y = correct label, y^ = prediction
RMSE = sqrt(Σ(y-y^)^2 / N)

==> however, it is not good enough. see below.

e.g.

suppose you get 4 test data points: x1 x2 x3 x4  (and associated y1 y2 y3 y4)
and your model predicts Y correctly for x1 and x3
==> we may say it's 50%, or we may calc RMSE
==> suppose the following frequency of data.
x1 : 20%
x2 : 5%
x3 : 70%
x4 : 5%

==> then getting x1 & x3 means getting 90% right, as opposed to 50% or whatever RMSE you calculated. so we address data distrib D

Pr_D[h(x) != c(x)]     # given distribution (= D) of data, probability of hypothesis (= y^) not matching concept (= y)
# i.e. error rate that takes distribution of data into account

"weak learner" : learner that does better than chance (= 50%) regardless of distribution

∀D, Pr_D[h(x) != c(x)] <= 1/2 - ϵ    # ∀D means for all D
# ϵ = epsilon, some small number

e.g. your learner may consists of multiple rules/hypothesis, and given the input data set, x1 x2 .. xn, if you can find some distrib such that it does not beat 50% error rate, then it is not a weak learner.

###
###  boosting
###

(1) given a training set {xi,yi}  where yi in {-1,+1}    # binary classification
(2) for t = 1 to T
- construct Dt      # distribution at a given time t
- find weak classifier h_t(x) with small error ϵ_t = Pr_Dt[h_t(x) != yi]   #   0 < ϵ_t <= 1/2
- output H_final

e.g.
suppose we have n examples  x1 x2 x3 ... xn
D_1(i) = 1/n       # uniform distrib (because at this point, you have no notion of which example is harder/important over the others)
# obviously, Σ Dt(i) = 1
i
D_t+1(i) = [ D_t(i) * e^(-α_t * yi * h_t(xi)) ] / Zt     # note h_t(xi) will predict 1 or -1
# notice if h_t(xi) == yi, then their product will be 1
where α_t = 1/2 * ln( (1- ϵ_t) / ϵ_t )                   # alpha is always a positive value. the smaller error ϵ_t, the bigger α_t, and we use α_t to control Zt
where Zt = whatever normalization constant at time t, which you need to define to ensure Σ D_t+1(i) = 1
i
Zt = Σ [ Dt(i) * e(-α_t * yi * h_t(xi)) ]
i

==> so you reduce the distrib for an example if your classifier gets it correct
i.e. you increase the distrib for an example if your classifier gets it wrong

==> reduce/increase by how much ? that's exactly where your ϵ_t comes into the picture.
the examples you get with bigger error rate gets increased in distrib more.

H_final(x) = sgn(Σ α_t * h_t(x))       # sgn(x) is just a sign function that extracts the sign of the input real number
t                     # so you get an overall consensus, basically weighting smaller error (more correct) hypothesis bigger
# because α_t is bigger for hypothesis with smaller error
# in other words, α_t is a confidence indicator. (to be revisited)

(paper that explains α_t logic) https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boosting-survey.pdf
in short, α_t is the way it is (we are only considering binary classification, for real value estimation, α_t gets more complicated, see the paper) because it has been mathematically proven to provide a very quick and effective learning. (assigning appropriate weighting to the errorneous predicted instances)

suppose you have three examples (x1 x2 x3). again we assume binary classification, i.e. yi in {-1,+1}
at the first iteration, lets say you got x1 & x2 right, and x3 wrong. # 66.6% correct, so that's better than chance
at the second iteration, because you increased the weight for x3, you get x3 correct, and at least of one of x1 or x2 correct, to get over 50% correct. lets say you got x1 wrong, and x2 correct.
at the third iteration, your weights on x3 & x1 are far bigger than x2, so you will x1 & x3 right, and may or may not get x2 right. but as you can see, your error is decreasing. because your distrib on x1 & x3 have been increased while x2 weight has been decreased than prev iterations, and you got x1 & x3 right.

##
##  error VS confidence
##

we tend to analyze learning performance in terms of error, but also we should take advantage of confidence info.
e.g.
in KNN, if all K neighbors agree on one answer, then it's more confident than only some of K voting on the answer.
in Boosting, α_t is confidence indicator
also, for other algo, we can repeat and measure variance.

###
###  boosting & overfitting
###
- boosting is not very prone to overfitting. at least not in traditional sense where error on training data decreases while error on test data increasing.
(to be discussed in the next section on SVM)

############################################
####   (1.6) Support Vector Machines    ####
############################################

SVM is a classifier algorithm, trained based on labeled data sets (i.e. supervised learning) that classifies examples by finding an optimal hyperplane that linearly separates examples.

here using a 2D plane, trying to linearly separate o & +, then you can draw a line close to the cluster of "+", or the cluster of "o", but it feels more generalized if you can draw a line that is right in between, i.e. as far away from the data as possible while remaining of course consistent with the data.
in fact, the more optimally "righ in the middle" you can draw the line, it is more generalized and reduces overfitting.

|    + + +  +
| o     + + + ++
| o o    ++  ++
| ooo o    + +
|o o oo o    +
|__________________

To formalize with algebra, let's generalize to hyperplane.  # but to keep intuitively thinking about this in the above 2D plane, you can substitute hyperplane as "line"

given labelled training set (x1,y1),(x2,y2),,,(xn,yn)

y = w^T * x + b   # y = label   # we assume yi in {-1,+1}  in this case
# x = training data/examples
# w,b = parameter of the plane.  w = weight vector, b = bias
# w^T is just w tranpose so you can get the inner product of w & x vectors
# to be precise, you may write this as   yi = w^T * xi + b

so the hyperplane (the line right in between, furtherst from both clusters in the above 2D visual) should be:  w^T * x + b = 0
==> so (again, speaking in 2D for the sake of intuitive example), anything on one side will be classified as "+", and the other side as "o"
the further away that hyperplane is from the training data examples, the more "generalized" we consider the classifier is.

to define that optimal hyperplane(or line in 2D), you need to define hyperplanes (or two parallel lines) at the boundary of training examples.
for the maximum generalized-ness of the classifier(optimal hyperplane), we wanna find the two boundary hyperplanes that maximize the distance(we call "margin") between them, i.e. giving the most generalized (optimal) hyperplane.
the reason for two parallel lines is because otherwise you cannot define the middle line that stays the furthest from both boundary lines.

the hyperplane that is:
- the closest to the border of the "+" cluster will have   w^T * x + b = +1      # this is the definition
- the closest to the border of the "o" cluster will have   w^T * x + b = -1      # you just have to accept them

(ref) http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
(ref) https://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html
(ref) http://web.mit.edu/6.034/wwwbob/svm.pdf    # idiot's guide to SVM

Definition:   # from the ref
define the hyperplanes H such that
w^T * xi + b <= -1   when yi = -1
w^T * xi + b >= +1   when yi = +1

H1 and H2 are the planes:    # two boundary planes for us to find the middle plane
H1 : w^T * xi + b = +1       # sometimes (like Tom Mitchell)
H2 : w^T * xi + b = -1       # we use -a +a, instead of -1 +1    # a is for arbitrary

(ref) https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kernels_SVM2_04_12_2011-ann.pdf

H0 is the (optimal) hyperplane in between: w^T * xi + b = 0

2
the distance (called "margin") between H1 & H2 is:  -------  =  2 / ||w||     # or 2a / ||w||
||w||

+   + ++  + ++
+ ++  +    +  +
+  +  +  + +    +
----+--------------------H1     ---   # here, examples on H1, H2 are called "support vectors"
|
-------------------------H0      | margin = 2 / ||w||
|
-----o------o------------H2     ---
o       o  o  o
oo  o   o   o
o  o o     o

# note : how we get 2 / ||w||

this is nothing but just linear algebra in high dimension.

what is the distance bewteen a given point and a line?

a point:  (m,n)
a line :  B*y = A*x + c

|Am + Bn + c|
distance d =  ---------------
sprt(A^2 + B^2)    # L2 norm

(ref) https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line
(ref) http://www.intmath.com/plane-analytic-geometry/perpendicular-distance-point-line.php   # very good
(ref) https://en.wikipedia.org/wiki/Norm_(mathematics)

e.g.

given a line:  y = A*x + C
what is the distance to the origin (0,0)

|c|
d = -----
||A||

(ref) http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

###
###  Support Vector Machines
###

maximize the margin 2 / ||w||  while classifying every training example correctly.

note: "classifying every training example correctly" can be written as

w^T * xi + b <= -1   when yi = -1
w^T * xi + b >= +1   when yi = +1

==> that can be simplified by multiplying by yi, as below.

yi * (w^T * xi + b) >= 1   ∀i      #  ∀i means for all i

also, maximizing 2 / ||w|| means minimizing ||w||
so we can re-write the entire statement as follows:

minimize ||w||  while  yi * (w^T * xi + b) >= 1   ∀i

mathematically, there is a technique called quadratic programming from linear algebra that can solve this minimization(optimization) problem. (but we don't go into the details in this course)

maximize Q(α) =  ∑αi - 1/2 * ∑ αi * αj * yi * yj * xi^T * xj      # where i = training examples,  j = test examples
i           ij

s.t. αi >= 0  and  ∑ (αi * yi) = 0       # s.t. means subject to
i

note: α is just a new set of parameters we use for quadratic programming

note:
xi^T * xj  is just a dot product, a number that shows the length of one vector projected onto the other.

in general, given two vectors xi,xj      # again, for our purpose, i = training examples,  j = test examples

- if they are orthogonal (i.e. perpendicular), then xi^T * xj = 0             # dissimilar
- if they point to the same direction (i.e. parallel), then xi^T * xj = big   # veeery similar
- if they point to the opposite direction, then xi^T * xj = negative number   # veeery dissimilar

===> this intuition that the dot product (aka inner product) provides some measure of "similarity" is the key

(ref) http://web.mit.edu/6.034/wwwbob/svm.pdf  # page 17 ~, excellent visuals

there are 2 cases:
case 1: xi,xj are dissimilar, then their dot product is 0, so they don't contribute to Q(α)
case 2: xi,xj are similar
- and classify the same class: decrease Q(α)  because the algo downgrades similar vectors that makes the same prediction
- and classify the diff class: increase Q(α)  because the algo wants to maximize Q(α) for similar vectors that tell the two classes apart. i.e. maximizing the margin.

right now, we are just using  xi^T * xj  as our similarity measurement, but as we will see below, for not linearly separable problems, we will use a transform function φ() to make xi,xj linearly separable. (called kernel trick)
so our similiarity measurement will change
from : xi^T * xj
to : φ(xi)^T *  φ(xj)     #  we simply re-write this as K(xi,xj)    # K = Kernel

note: the definition of φ() is different problem-by-problem.

###
###  properties of SVM
###

1.  w = ∑ (αi * yi * xi)    # once w is obtained, b is easy to obtain
i

2.  αi mostly 0      # most of the alpha is zero, i.e. only few xi matter

==> only xi that are support vectors matter.

###
###  non linearly separable problems
###

so we know how to find the optimal hyperplane for linearly separable problem.

e.g.
++  +
oo    + +
o oo   +  +
o oo    +
ooo

how about not linearly separable problems?

e.g.
++  +
oo    + + o
o oo   + o+
o oo    +
ooo

===> we can still come up with a way to define outliers, and find the hyperplane that seprates examples with some notion of minimized error.

then, how about the example below?

ooooo
o +++ o
o +++++ o
o +++ o
ooooo

we use "kernel trick" (aka kernel function) to transform data points to another space (often higher dimension) where it becomes linearly separable.

in this case with the above visual example, we kind of wanna draw a circle to separate o from +
a circle is where x^2 + y^2 + b = d

now let's design our kernel function for this problem.

given a point q = (q1,q2)
φ(q) = <q1^2, q2^2, sprt(2)*q1*q2>    # φ() is just Phi, a transform function

e.g.

ooooo            φ(o)        φ(+)φ(+)
o +++ o            φ(o)        φ(+) φ(+)
o +++++ o    ===>     φ(o)       φ(+)
o +++ o             φ(o) φ(o)      φ(+)
ooooo               φ(o)         φ(+)

given two points: x and y     # i.e. x = (x1,x2), y = (y1,y2)    # previously, we used xi,xj instead of x,y
φ(x)^T * φ(y) = <x1^2, x2^2, sqrt(2) * x1 * x2>^T * <y1^2, y2^2, sqrt(2) * y1 * y2>
= x1^2 * y1^2 + x2^2 * y2^2 + 2 * x1 * x2 * y1 * y2
= (x1 * y1 + x2 * y2)^2
= (x^T * y)^2            # yes, this is essentially an equation for circle
= K(x,y)                 # we write this as K(x,y)

note: so we got  K(x,y) = φ(x)^T * φ(y)
= (x^T * y)^2

===>  as long as we have K(x,y) = (x^T * y)^2 , we don't even need to know|compute φ()  # in fact φ() is often unknown, expensive to compute
this is the neat aspect of kernel trick

what makes a good kernel function K()?  well that's where you inject your "domain knowledge" of what constitutes "similarity" that separates + and o

e.g.

K = x^T * y
K = (x^T * y)^2
K = (x^T * y + c)^p               # p for polynomial
K = exp(-1 * (||x-y||^2 / 2σ) )   # radial basis function (gaussians)
K = tanh(β * x^T * y - σ)         # sigmoid

note: SVM with a radial basis function has infinite model capacity, so it can overfit.

recall the SVM problem in quadratic programming form:

maximize Q(α) =  ∑αi - 1/2 * ∑ αi * αj * yi * yj * xi^T * xj
i           ij

we can re-write it as:

maximize Q(α) =  ∑αi - 1/2 * ∑ αi * αj * yi * yj * K(xi^T, xj)
i           ij

note: kernels can be defined over almost anything. not just vectors. but also strings, trees, structures, etc etc
===> extremely powerful
===> SVM is used to compare DNA, sentence structures, etc

"Mercer Condition": well behaved distance(or similarity) function

###
###  Boosting revisited
###

- recall we claimed boosting is not volunerable to overfitting (at least not in traditional sense, where error on training data decreases while error on test data starts increasing)
- recall our boosting algo, the final hypothesis

H_final(x) = sgn(Σ α_t * h_t(x))       # sgn(x) is just a sign function that extracts the sign of the input real number
t                     # so you get an overall consensus, basically weighting smaller error (more correct) hypothesis bigger

--> look at the inside of sgn()

Σ α_t * h_t(x)        # α_t is bigger for h_t() with smaller error
t                     # so α_t can be seen as confidence indicator

suppose you have H_final(x) that predicts everything correctly.
among the examples correctly predicted, there are high confidence ones, and not-so-confident ones.
suppose your boosting learner prediction is binary classification of "+" or "o"
e.g.

o oo  o     o o  +  +   ++ +++ +
|----------------|---------------|
confident     not-confident       confident

as you keep training the boosting learner more, it assigns bigger α_t for h_t() that correctly predicts not-so-confident examples.

e.g.

oooo oo              ++++ ++++
|----------------|---------------|
confident     not-confident       confident

==> it may reduce α_t for some of already-super-confident ones but at least it increases α_t for very-unconfident examples.
and the overall error rate stays the same while increasing confidence.
and notice the distance between the most unconfident "+" and the most unconfident "o" are wider.
this can be seen as "margin" from SVM.
the bigger margin, the more generalized, the less overfitting !!
and this is an intuitive explanation of why boosting is not prone to overfitting
(it's not completely intuitive so read the paper for the formal proof)

Still, however, Boosting is known to (tend to) overfit when:
- (typically a neural network with many layers and nodes) underlying learners/hypothesis all overfit (and all perfectly predict correct) then you cannot stop them from overfitting.
- when data has pink noise (= uniform noise)      # white noise = gaussian noise

#################################################
####  (1.7)  Computational Learning Theory   ####
#################################################

- defining learning problems.
- showing specific algorithms work.
- show these problems are fundamentally hard.

theory of computing analyzes how algorithms use resources:
- time
- space
- samples/examples/data    # i.e. how much data does the algo require ?

###
### inductive learning: learning from examples, to derive general form
###

important questions in inductive learning algo.

1. probability of successful training :   1-δ  where δ is the probability of failure
2. number of examples to train on
3. complexity of hypothesis class
4. accuracy to which target concept is approximiated: ε
5. manner in which training examples are presented: batch or online(one by one) ?
6. manner in which training examples are selected

###
###  selecting training examples
###

1. learner selects x, and ask for y
2. teacher/feeder select x and tells y
3. fixed distrib: x chosen from d by nature
4. worst distrib: evil

quiz:
H: set of possible people(i.e. hypothesis)
X: set of yes|no questions

if teacher (who knows the answer) chooses X, then only 1 question needed. - "is this person XYZ(correct answer person)" then answer yes, done.
if learner (who doesn't know the answer) chooses X, then if has to pick a question like "is this person A or B or C or D or... ?" selecting half of H, then the rest is binary search. thus log_2|H|

==> but the above teacher scenario is unrealistic. usually there are some "constraints".

###
###  teacher with constrained queries
###

e.g.
X: x1,x2,,,xk  # k-bit input
H: conjunctions of literals or negation

e.g.
k = 5
h: x1 && x3 && !x5
- deduce the h value for each input X vector

x1 x2 x3 x4 x5  | h
-------------------
0  1  0  1  1  | ?     # it's 0
1  1  1  0  0  | ?     # it's 1
1  0  0  1  0  | ?     # it's 0

==> then, given the below table, how do you induce the right h ?   (could be more than one)

x1 x2 x3 x4 x5  | h
-------------------
1  0  1  1  0  | 1
0  0  0  1  0  | 1
1  1  1  1  0  | 0
1  0  1  0  0  | 0
1  0  1  1  1  | 0

==> compare the first two rows, then you see x1 and x3 are irrelevant to the output h value 1

h:   positive  absent  negated
------------------------------   # this means 3^k potential hypothesis
x1:              +
x2:                       +
x3:              +
x4:      +
x5:                       +

==> so we got  h: !x2 && x4 && !x5

how many questions does teacher need to get the above answer h ?
(1) show what's irrelevant: 2
(2) show what's relevant: k

===> so the answer is  k + 2      #  linear time !

###
###  learner with constrained queries
###

e.g.
X: x1,x2,,,xk  # k-bit input
H: conjunctions of literals or negation

e.g.
k = 5

now you have to come up with input X then you can learn h.
but what X values should you ask to induce h ?

(x1,x2,x3,x4,x5) = (0,0,0,0,0)  ?
(x1,x2,x3,x4,x5) = (1,0,1,0,0)  ?
..
..

x1 x2 x3 x4 x5  | h
-------------------
?  ?  ?  ?  ?  |
?  ?  ?  ?  ?  |
?  ?  ?  ?  ?  |
?  ?  ?  ?  ?  |
..
..

==> you need to ask  O(2^k)  questions.   # there are 2^k possible input X
# exponential sample complexity is bad

==> but it's because of the harsh constraints. if we were able to ask questions like "is x1 positive? " then we could induce h more quickly, like in linear time.
so the constraints matter.

###
###  learner with mistake bounds
###

(1) input arrives            # doesnt matter how the input is selected. (randomly by learner or benevolent/malicious teacher, etc)
(2) learner guesses answer   # no need to know about hypothesis space
(3) wrong answer charged     # learner learns
(4) go to (1)

==>  bound the total number of mistakes

[candidate eliminattion algo]   # illustrated in Mitchell book
(1) assume it's possible each variable positive and negated
(2) given input, compute output
(2.5) eliminate impossible hypothesis   # essentially the same as (3)
(3) if wrong, set all positive variables that were 0 to absent, negate variables that were 1 to absent, goto (2)

e.g.

h:   positive  absent  negated
------------------------------
x1:      +                +
x2:      +                +
x3:      +                +
x4:      +                +
x5:      +                +

but if you compute output based on your current hypothesis: (x1 && !x1 && x2 && !x2 && x3 && !x3 && x4 && !x4 && x5 && !x5)
==> this will only produce 0 for any input because you are taking the AND of every variable being positive and negated.
that means, to get a wrong answer to get charged/penalized, you need an input example with the correct output 1

here is an input. (any prior input with the correct output = 0 doesn't change the H table)

x1 x2 x3 x4 x5  | h
-------------------
1  0  1  1  0  | 1      # so correct output is 1

following (2.5) logic, we can eliminate some possibility

h:   positive  absent  negated
------------------------------
x1:      +                     # given the input, x1 can still be positive or absent, but we know it's not negated
x2:                       +    # similarly update the rest
x3:      +
x4:      +
x5:                       +

actually, if you skip (2.5) and go straight to (3) then you get the below, which is essentially the same.
it says we know it's either positive|negated and/or absent, then ignore absent.

h:   positive  absent  negated
------------------------------
x1:      +       +
x2:              +        +
x3:      +       +
x4:      +       +
x5:              +        +

===> so we essentially eliminated the half of the possibile literals.
(so far this is the first "mistake")

here is a next input.

x1 x2 x3 x4 x5  | h
-------------------
1  0  1  1  1  | 1

==> following the logic (3),  you update x5 in the H table, as below.

h:   positive  absent  negated
------------------------------
x1:      +       +
x2:              +        +
x3:      +       +
x4:      +       +
x5:              +           # now we definitely know x5 is absent

===> this is the 2nd "mistaken"

===> repeat.

==> how many mistakes do we need ?
well, the first mistake eliminated the half of the literals, and the rest is we need at most k mistakes to correct x1 ~ xk variables.
i.e. it never makes more than k+1 mistakes

note: this learning algo will converge to a correct hypothesis, assuming (1) no error in the training sample, (2) there is at least some h ∈ H s.t. h(x) = c(x).  otherwise the algo will converge to an empty hypothesis set.

###
###  definitions
###

computational complexity : how much computational effort is needed for a learner to converge to an answer?

sample complexity : (in case of batch) how many training examples are needed for a learner to create a successful hypothesis?

mistake bounds : how many misclassifications can a learner make over an intinite run ?

training set : S ⊆ X, and c(x) ∀x ∈ S
==> it's just saying out of ALL input space X, we only get a sample subset S, and for every x in S, we know y == c(x)
hypothesis space : H
true hypothesis : c ∈ H         # c = concept, i.e. y
candidate hypothesis : h ∈ H
"consistent" learner : produces c(x) = h(x) ∀x ∈ S
(more formally)
A learner L is a consistent learner if L outputs hypotheses that perfectly fit the training data.
note: for a consistent learner, VS(H) == H
note: for a consistent learner, if H is finite, we can bound the number of training examples needed for a ceoncept to be PAC-learnable by L. (more details to follow)

version space:  VS(S) = {h s.t. h ∈ H consistent wrt S}   # i.e. a set of hypothesis consistent with examples
i.e.                                      # also denoted as VS(S,H)
VS(S) = {h ∈ H | h(x) = c(x) ∀x ∈ S}
e.g.

c: target concept     # well, it's just the XOR function, in this case

x1 x2 | c(x) aka y
------------------
0  0 |   0
0  1 |   1
1  0 |   1
1  1 |   0

suppose, S (i.e. training set) is below

x1 x2 | c(x)
------------
0  0 |  0
1  0 |  1

==> what is VS(S) ?

H = {x1, !x1, x2, !x2, x1 && x2, x1 || x2, x1 ^ x2, x1==x2}   # let's say H consists of 8 hypothesis
VS(S) = {x1, x1 || x2, x1 ^ x2}                                   # 3 of them are all consistent wrt S

==> obviously, we know in this case (x1 ^ x2) is the answer, but as far as S goes, then VS(S) is the above.

##
##  a simplistic VS algo   (from Mitchell book)
##

aka list-then-eliminate algo
1. VS = H
2. for each training example <x,c(x)>
remove from VS any h where h(x) != c(x)
3. output VS

==> this only works if H is finite. if H is huge, then VS might get huge, which is not ideal.

see "learner with mistaken bounds" for a more elaborate approach.

##
##  a more compact representation for Version Space (from Mitchell book p32)
##

define:
D: sample training data
G: the most general (aka "maximally general") hypothesis group, consistent with D
S: the most specific (aka "maximally specific") hypothesis group, consistent with D

formally:

G = {h ∈ H|consistent(h,D) ∧ (¬∃h' ∈ H)[(h' >g h) ∧ consistent(h',D)]}
S = {h ∈ H|consistent(h,D) ∧ (¬∃h' ∈ H)[(h >g h') ∧ consistent(h',D)]}

VS(H,D) = {h ∈ H|(∃s ∈ S)(∃g ∈ G)(g >=g h >=g s)}

note: when a learner is asking for new training examples (or a helpful teacher is giving more training examples), good examples are the ones that help shrink the version space.

##
##  partially learned concepts (from Mitchell book)
##

suppose your VS contains multiple hypotheses, you haven't yet got enough training examples to converge to one h, and you have to classify some new test data x_test.
==> it is possible to classify x_test with the same confidence as you would with fully converged VS with a single h.
e.g.
if every h ∈ H gives h(x_test)=1

more specifically, you don't need to check every h ∈ H, you only need to check if every h ∈ S  where S = most specific hypothesis group in VS.

generally speaking,
if a majority of h ∈ VS classify h(x_test)=1, then you are fairly confident.
if a minority does, then not so confident.

##
##    inductive bias - formally defined  (from Mitchell book)
##

let L(xi,Dc) denote the classification that learner L assigns to test data xi after training on sample data Dc
L(xi,Dc) is inductively inferred from (Dc ∧ xi)

but without some sort of assumptions, L may be completely overfit or simply have memorized (rote learn) every example as such, and never generalize to unobserved data. so there should be some sort of assumption. we call it inductive bias.

intuitively, it's a set of assumptions(preferences) so that the learning algorithm works(effectively) at all.

[definition]
consider a concept learning algorithm L for the set of instances X.
let c be an arbitrary concept defined over X, and let Dc = {<x,c(x)>} be an arbitrary set of training examples of c.
let L(xi,Dc) denote the classification assigned to the instance xi by L after training on the data Dc.
the inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc.

(∀xi ∈ X)[L(xi,Dc) follows deductively (B ∧ Dc ∧ xi)]

to recap,

L(xi,Dc) is inductively inferred from (Dc ∧ xi)

and (∀xi ∈ X)[L(xi,Dc) follows deductively (B ∧ Dc ∧ xi)]

such B is called inductive bias

###
###   Error of hypothesis h
###

(1) training error: fraction of traning examples misclassified by h
(2) test error: fraction of testing examples misclassified by h
(3) true error: fraction of examples that would be misclassified on sample drawn from D   # D = some distrib, i.e. we are talking infinite data set

Definition: the true error of a hypothesis h w.r.t a target concept c and an instance distribution D is the probability that h will misclassify an instance drawn at random according to D.
we write this as below:

error_D(h) = Pr_x~D[c(x) != h(x)]     # very simple, error of h in some distrib D is the probability of c(x) != h(x) for x drawn from D

===> so when we think about penalizing error, we can penalize the misclassification on the examples that have higher distrib, than rare examples.
(pretty much the same idea from the boosting algo)

note: the best h() you pick may not be identical to c() but it's possible at least for training examples you get, you find such h() s.t. h(x)=c(x) ∀x ∈ S

note: true error is theoretical probability. often in real world, we don't fully know the distribution D, and don't have infinite training examples. but if we have a good idea to estimate D, and sufficiently many examples then we can learn accurate-enough true error. let's formally define "accurate enough" & "sufficiently many".

###
###   PAC (probably approximately correct) learning
###

C : a class of target concepts
L : learner, i.e. algorithm
H : the hypothesis space of L
n : |H| = size of hypothesis space
D : distrbution over inputs
0 <= ε <= 1/2  : error goal
0 <= δ <= 1/2  : certainty goal. i.e. for probability of (1-δ), the algo must achieve true error(h) <= ε

C is PAC-learnable by L using H iff (if and only if) learner L will, with probability 1-δ (or better), output a hypothesis h ∈ H s.t. true error_D(h) <= ε in time and sample polynomial in terms of 1/ε, 1/δ and n. (polynomial far better than exponential)

i.e.
Probably == 1-δ
Approximately == ε
Correct == error_D(h)=0

note: of course, such h() is in VS(H)

###
###  ε-exhausted version space
###

VS(S) is ε-exhausted iff ∀h ∈ VS(S), error_D(h) <= ε

e.g.

let's revisit the same quiz setup.

c: target concept     # wel, it's just the XOR function, in this case

x1 x2 | c(x) |  D
------------------
0  0 |   0  | 0.1
0  1 |   1  | 0.5
1  0 |   1  | 0.4
1  1 |   0  | 0

suppose, S (i.e. training set) is below

x1 x2 | c(x)
------------
0  0 |  0
1  0 |  1

==> what is VS(S) ?

H = {x1, !x1, x2, !x2, x1 && x2, x1 || x2, x1 ^ x2, x1==x2}   # let's say H consists of 8 hypothesis
VS(S) = {x1, x1 || x2, x1 ^ x2}                                   # 3 of them are all consistent wrt S

==> obviously, we know in this case (x1 ^ x2) is the answer, but as far as S goes, then VS(S) is the above.

find the smallest ε s.t. we ε-exhausted VS(S)

h   : error_D(h)
----------------------
x1  : 0.5
x1 || x2  : 0
x1 ^ x2  : 0

====> therefore, ε = 0.5      # well, this satisfies  0 <= ε <= 0.5  so it's valid

###
###  Haussler Theorem - bound true error as a function of training examples
###

"how many training examples would it take to ε-exhaust version space with probability 1-δ (or better)"
i.e. "how many traning examples would it take to PAC-learn a concept ?"

Theorem: if H is finite, and M is a sequence of independently randomly drawn training points from data D (i.e. M is i.i.d and represent the distrib of data D), then for any 0 ≤ ε ≤ 1, then Probability[VS(H) not ε-exhausted] ≤ |H|exp(-εM)

(proof)
think about all the hypothesis in H (let's say H = {h1,h2,,,hn} i.e. |H| = n), some have high true error, while others have low true error.

let error_D(h1,h2,,,,hk ∈ H) > ε     # high true error hypothesis set. (high, relative to ε)
# how much training samples do we need to be sure these are really bad?
# note: h1,h2,,,hk are a subset of H that gives error_D(hi) > ε
# i.e.  k < |H|

in other words,

Pr_x~D[hi(x) = c(x)] <= 1-ε    # i.e.  Pr_x~D[hi(x) != c(x)] > ε
where i = 1~k                  # basically saying low probability of h(x)=c(x) for those high true error hypothesis
# low, relative to 1-ε

Pr[an hi consistent with c on M examples] <= (1-ε)^M     # i.e. in version space
# each probability is independent
# so Pr[h1 in VS(S)] && Pr[h2 in VS(S)] && .. Pr[hk in VS(S)]  would be a multiplication (1-ε)^Mk but we don't really use this.

Pr[at least one of h1,h2,,,hk consistent with c on M example]  <=  k(1-ε)^M  <=  |H|(1-ε)^M  <= |H|exp(-εM)  <=  δ

note: exp(x) == e^x
note: bounding   Pr[VS(H) not ε-exhausted] <= δ    means  bounding  Pr[VS(H) is ε-exhausted] > 1-δ
note: calculus(taylor series) says  -ε >= ln(1-ε)    thus   exp(-ε) >= 1-ε, then  exp(-εM) >= (1-ε)^M
note: δ = failure probability, which we decide to bound the upper bound of Pr[VS(H) not ε-exhausted] > 1-δ

note:  why  Pr[at least one of h1,h2,,,hk consistent with c on M examples]  <=  k(1-ε)^M   ?
technically speaking, it should be
Pr[an hi not consistent with c on M examples]    > 1 - (1-ε)^M
Pr[every hi not consistent with c on M examples] > (1 - (1-ε)^M)^k
Pr[at least one of h1,h2,,,hk consistent with c on M example]  <=  1 - (1 - (1-ε)^M)^k    # but it's too complex
here, the trick we use is "union-bound inequality"
i.e.
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)     # by definition of set theory
P(A ∪ B) ≤ P(A) + P(B)
==> that's why we could simply add up (1-ε)^M for k times, to obtain  ≤ k(1-ε)^M

so, to recap, |H|exp(-εM) is the upper bound of version space not ε-exhausted after M samples

now lets rewrite this in terms of M

|H|exp(-εM)  <= δ
ln|H| -εM <= ln(δ)
ln|H| - ln(δ) <= εM
(1/ε)(ln|H| + ln(1/δ)) <= M       # sample complexity lower bound M is polynomial in terms of ε,δ,|H| !!

note: "sample complexity lower bound M" == at least how many training samples we need, to achieve Pr[VS(H) is ε-exhausted] > 1-δ]
(recall) Pr[VS(H) not ε-exhausted] <= δ
Pr[VS(H) is ε-exhausted] > 1-δ
i.e. if Pr[VS(H) is ε-exhausted] > 1-δ, then C is PAC-learnable.

note: recall properties of logarithm
log(x^y) = y*log(x)           # e.g. -log(δ) = log(δ^-1) = log(1/δ)
log(x) + log(y) = log(xy)
log_b(x) = log_a(x) / log_a(b)
b^(log_b(x)) = x

(ref) https://www.andrews.edu/~calkins/math/webtexts/numb17.htm

NOTE: in Haussler theorem, this sample complexity bound M assumes H includes h(x)=c(x) but what if not? then we call it "agnostic" and try to find h from H s.t. h(x) best fits(closest to) c(x).
NOTE: what if |H| = n is infinity ?  then M becomes infinity. this is an important issue. we talk about this in the next lesson.

##
##  PAC learnable - an exercise quiz
##

given

H = {hi(x) = xi}
X : 10 bits   i.e. i = 1~10   #  i.e. |H| = 10
ε : 0.1   # 10% error
δ : 0.2
D : uniform

How many samples do we need to PAC-learn this hypothesis set ?
just use the equation.

M >= 39.12

let's say we need 40 examples. this is good because the input space is 2^k = 2^10 = 1024, so M is less than 4% of the input space.

===> but now think about, we want the error to be lower, like 1%, then M >= 400, so now that's almost 40% of the input space.
i.e. we need a bigger M (i.e. more samples) to PAC-learn with lower error.

#################################
####  (1.8)  VC Dimensions   ####
#################################

###
###  infinite hypothesis spaces
###

recall Haussler theorem

(1/ε)(ln|H| + ln(1/δ)) <= M       # sample complexity bound M is polynomial in terms of ε,δ,n   where n = |H|

what if |H| is huge (or infinite) ?   then M is infinite,,,, bad.
e.g.
- linear speratators           # you can pick an infinite combination of coordinates
- artificial nueral networkds  # weight(s) = real number, so your choice is infinite
- decision trees (continuous input)  # you can infinitely keep splitting. e.g. is x > 1.1 ?  is x > 1.11 ?  is x > 1.111 ? and so on
(but decision trees with discrete input has a finite H, because there is only a finite number of features)
- KNN is tricky, one can argue there is an infinite number of possible input data points, so the classifier that comes out of it, there can be an infinite kind.
(but if you think about it differently, well, the data set is fixed for a given KNN classifier, so its behavior is finite. it depends on whether you consider data as part of H space)

###
###  how to approach infinite |H| learning problem
###

X : {1,2,3,4,5,6,7,8,9,10}     # input space is integers from 1 to 10
H : h(x) =  1 if(x >= θ)       # θ theta is just a parameter, a real number
=  0 otherwise        # so yes, |H| can be infinite, because there is an infinite choice of θ value

like always,
- track all hypotheses    # only need to track non-negative integers ten or below. then it's finite AND gives the same answer!
- keep (and ε-exhaust) version space

i.e. if θ is one or below, then the answer is always zero, so we don't need to track any hypothesis where θ is one or below.
similarly, if θ is 10 or above, then we can ignore all such hypothesis, because the answer is always 1.

i.e. we say "syntactically" the hypothesis space is infinite, but "semantically" it is finite.

##
##  how to approach inifite |H| learning problem (from Tom Mitchell textbook)
##
- it helps to have some way to order (or give structure) to H
e.g.
general-to-specific ordering    #  textbook page24

suppose input x = <x1,x2,x3,x4,x5>

h1(x) = 1 if x4 >= 0     # only cares about x4
0 otherwise

h2(x) = 1 if x2 >= 0 and x4 >= 0    # cares about x2 and x4 only
= 0 otherwise

[definition]
any x that gives h2(x)=1 will give h1(x)=1           # we say "give"=="satisfy"
then we say h1 is more_general_than_or_equal_to h2   # nicely intuitive

h1 >=g h2      # let's denote it like this

we say h1 is more_general_than h2 if h1 >=g h2 ∧ h2 !>=g h1

h1 >g h2  iff  h1 >=g h2 ∧ h2 !>=g h1    # also nicely intuitive

importantly, "h1 more_general_than h2" == "h2 more_specific_than h1"

NOTE: this doesn't allow every single h ∈ H to be ordered. because it's possible hj !>=g hk ∧ hk !>= hj
(we say this general-to-specific relation ordering method gives a "partial" order, not a "total" order, over H)
(formally, we call the relation reflexive, antisymmetric, and transitive)

[a simple algo using general-to-specific ordering]
2. iterate thru training data instance, and if the current h() doesn't fit, then choose more general h() that fits training data.
3. you end up with the most specific h() consistent with input data. # obviously there can be multiple such h() and we randomly arrive at one such h()
(very naive but it's just an approach. we need to deal with outliers/noise/error, overfitting, etc)

###
###   VC dimension  as a measure of the complexity of hypothesis space
###

e.g.

X : {1,2,3,4,5,6,7,8,9,10}
H : h(x) =  1 if(x >= θ)   # well, h(x) output can ever be 1 or 0
=  0 otherwise    # so "all possible ways" are 1 or 0 in this case

- what is the largest set of inputs that the hypothesis class can label in all possible ways?

==> the answer = 1, because suppose S = {6}, then
if θ = 5, then the answer is 1
if θ = 7, then the answer is 0

why not two (or bigger) ? because suppose you pick S = {4,6}

then, "all possible ways" mean all four cases below.

h(4) | h(6)
-----------
0  |  0
0  |  1
1  |  0   # this is never possible because if 4 >= θ, then 6 >= θ too.
1  |  1

==> so we can only label 3 out of 4 all possible ways.

###
###  VC dimension
###

"label in all possible ways" == "shatter"    # just the formal ML term

"the largest set of inputs that the hypothesis class can shatter" is known as "VC dimension"

(VC = Vapnik-Chervonenkis, two russian scientists who came up with this theory)

so even if |H| is infinite, as long as VC dimension is finite, we can finitely figure the amount of data needed to learn.

we write it as  e.g.   VC(H) = 3
e.g.   VC(H) = ∞

###
###  VC dimension : quiz - interval
###

X = R                   # any real number, so |H| = infinite
H = {h(x) = x∈[a,b]}    # is a <= x <=b ?

so all possible labels in this case are just 2^i where i = |S|, each h(xi) = 0 or 1

is the VC dimension >= 1 ?  yes, because you pick any one real number, and by choosing a & b such that you can get all possible labels, as below.

x1
-------------------------    # so h(x1) = 1
a      b

or

x1                      # or you can put x1 > b
-------------------------    # so h(x1) = 0
a      b

is the VC dimension >= 2 ?   Yes!

x1   x2             #  h(x1) h(x2)
----------------------   #    0     1
a     b

x1   x2        #  h(x1) h(x2)
----------------------   #    1     0
a     b

x1 x2           #  h(x1) h(x2)
----------------------   #    1     1
a     b

x1        x2        #  h(x1) h(x2)
----------------------   #    0     0
a     b

is the VC dimension >= 3 ?  no,

x1    x2     x3       #  x1 <= x2 <= x3
----------------------  #  there is no way you can get h(x1)=1, h(x2)=0, h(x3)=1
a      b          #

==> therefore, overall, VC dimension = 2

(this was a simplistic intuitive example, let's look at a more realistic example next)

###
###  VC dimension : quiz - linear separators
###

X = R^2                       # 2 dimensional space
H = {h(x) = w^T * x >= Θ}     # this equation is generalized as a linear separator in hyperplane
# but just for simplicity, let's just think only in R^2, 2-dimensional space
# e.g. classify any points below Θ as "-" and else "+"

|    + + +  +
| o     + + + ++
| o o    ++  ++
| ooo o    + +
|o o oo o    +
|__________________

basically you have control over each value of W vector elements

is VC >= 1 ?  yes because if input size is 1, just pick some W so that x1 is left or right of your linear separator
is VC >= 2 ?  yes because imagine two input points (x1,x2) in a 2d plane below.

|              # h(x1) | h(x2)
|  x1  x2      # --------------
|              #   1   |   1     # this is easy
-----------    #   0   |   0     # just draw your line left or right (or above or below) both points
#   1   |   0     # just draw a line to separate x1 and x2
#   0   |   1     # then flip the sign for theta Θ, so you get this

is VC >= 3 ?  yes

|       x2

|   x1      x3

----------------

h(x1) | h(x2) | h(x3)
---------------------
1   |   1   |   1   # these are always easy
0   |   0   |   0
1   |   0   |   0   # easy also
0   |   1   |   1   # just flip the sign for theta Θ
1   |   1   |   0
0   |   0   |   1
1   |   0   |   1   # just draw a line to separate x2 from x1,x3
0   |   1   |   0   # flip the sign again

==> notice we kind of chose x1,x2,x3 relative coordinates like the above by ourselves, which is ok.
==> VC dimension doesn't have to satisfy every single possible values of X vector.
==> because otherwise think about the case where x1=x2=x3, then we can only cover the case where h(x1)=h(x2)=h(x3)

is VC >= 4 ?  no because no matter how you choose your X and W, you cannot shatter.

e.g.
|                 # here, there is no way you can get h(x1)=h(x4)=1 and h(x3)=h(x2)=0
|  x3     x4      #

|  x1     x2

---------------

e.g.
|      x4        # again, there is no way you can get
|                # h(x1)=h(x2)=h(x4)=0 and h(x3)=1
|      x3

| x1        x2
---------------

###
###  generalize the relationship btwn VC dimension and Hypothesis space dimension
###

hypothesis space dimension | parameters | VC dimension
------------------------------------------------------
0              |  theta Θ   |     1
1              |    a,b     |     2
2              |    W,Θ     |     3      # W is a vector of size 2
3              |    W,Θ     |     4      # W is a vector of size 3
d              |    W,Θ     |    d+1     # W is a vector of size d

##
##  quiz:  polygons(convex)
##
(ref) https://en.wikipedia.org/wiki/Convex_polygon

a polygon is just a shape consisting of lines and edges.
e.g. a triangle, a rectancle, a pentagon, etc
e.g. a circle can be considered a polygon of infinite degree
a convex polygon is where each edge has angle <= 180

X : R^2     #  2d plane
H : points inside some convex polygon (on-edge counts as inside)
parameters : infinite, because the degree of polygon is not specified
VC = infinite

===> see the video for proof, the point is VC=d+1 observation still applies.

##
##  sample complexity & VC dimension
##

m >= (1/ε)(8*VC(H) * log_2(13/ε) + 4log_2(2/δ))    # infinite case  (we don't go into its proof)
m >= (1/ε)(ln|H| + ln(1/δ))                        # finite case

(Q) what is VC dimension of finite H ?
if H is finite, then VC(H) is finite, then we know the following upper bound.

suppose VC(H) = d,  then for those d points from X, there must be 2^d possible labelings of these points. Each of these labelings requires a separate hypothesis.
i.e.
suppose VC(H) = d,  then  ∃ 2^d distinct concepts(=labels) (each gets a different h)     # ∃ means "there exists"

thus,  2^d < |H|     # i.e.  d < log_2|H|
thus,  VC(H) = d < log_2|H|

note:  2^d because we assume binary classification

Theorem: H is PAC-learnabe if and only if VC dimension is finite   # i.e. if VC dim = ∞, then it's not PAC learnable

######################################
####  (1.9)  Bayesian Learning    ####
######################################

- learn the best (= most probable, most likely correct) hypothesis h, given data & some domain knowledge

argmax Pr(h|D)     # probability of hypothesis h (drawn from some hypothesis class H) given data D
h∈H               # notice D is data, not distribution in this case

note:  argmax f(x) means pick x such that f(x) is maximized
x

# probability chain rule

Pr(a,b) = Pr(a|b) * Pr(b)   # probability of a & b is probability of a given b times probability of b
= Pr(b|a) * Pr(a)   # because the order doesn't matter, you can rewrite it as this also

==> in other words,

Pr(b|a) * Pr(a)
Pr(a|b) =  ----------------     # here you got the Bayes theorem
Pr(b)

##
##  Bayes theorem
##
P(D|h) * Pr(h)
Pr(h|D) =  --------------
P(D)

Pr(D) : prior belief of seeing some particular data set D
because Pr(D) doesn't depend on H, so we typically ignore this.

Pr(D|h) : probability of data D given the hypothesis h (assuming h is true)
==> accuracy of h

D = {(xi,yi)}   # training data set D
# so Pr(D|h) means the probability of h(xi) = yi
# i.e. Pr(D|h) can be interpreted as the accuracy of h on D

e.g. given h(x) = {x >= 10}
x=7, then probability of h(x)=1 is zero
and probability of h(x)=0 is 1

Pr(h) : prior belief of seeing h out of H

"prior belief" == "domain knowledge"

NOTE: to argmax Pr(h|D), you can either increase Pr(h) and/or Pr(D|h), as you analyze/pick h.
h∈H

##
##  Bayes rule - quiz
##

a man goes to a hostpital to test for a desease K. his lab test result is positive "+".
the test returns correct positive 98% of the time. and correct negative 97% of the time.
0.8 % of the population has the disease K.

does he really have K ?

what's the probability he has K ?

Pr(K|+) = Pr(+|K) * Pr(K) / Pr(+)
= 0.98 * 0.008 / Pr(+)
= 0.00784 / Pr(+)       # see below for Pr(+) derivation
= 21%

what's the probability he doesn't has K ?

Pr(!K|+) = Pr(+|!K) * Pr(!K) / Pr(+)
= 0.03 * 0.992 / Pr(+)
= 0.02976 / Pr(+)
= 79%

===> Pr(K|+) + Pr(!K|+) = 1  so you can derive Pr(+) = 0.0376

===> despite the positive test result, he likely doesn't have K. (only 21% chance)
===> this is exactly why doctors don't test you for every rare disease even if the test is fairly accurate.
===> but thru some other test, or other prior "domain knowledge", if doctors can classify you into some group of population in which 0.8% changes to a higher probability, e.g. if you have symptoms XYZ, then you are in a group of people who have 10% chance of having K, then the test makes sense.
===> also, from mathematical perspective, you can calculate at what percentage (instead of 0.08%) does the test make sense. interesting to think about.

###
###  MLE vs MAP
###

[definitioin]
a priori knowledge : domain/background generalized ground truth/knowledge
a posteriori knowldge : inductive/derived knowledge from experience/examples

(ref) https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation

MLE: maximum likelihood estimation
- choose value that maximizes the probability of observed data
_
θ_MLE = argmax Pr(D|θ)
θ
(ref) http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf

MAP: maximum a posteriori
- choose value that is most probable given observed data and prior belief
_
θ_MAP = argmax Pr(θ|D)  = argmax Pr(D|θ) * Pr(θ)
θ                 θ

###
###  Bayesian Learning
###

[algorithm]
for each h ∈ H, calculate Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)
output: h = argmax Pr(h|D)
h

NOTE: you can derive Pr(D) based on the constraint that  Σ Pr(hi|D) = 1
hi∈H
but because Pr(D) is the same for every Pr(hi|D), for the purpose of computing argmax, you can ignore Pr(D)
so let's simplify the algo.

Pr(h|D) ≈ Pr(D|h) * Pr(h)         #  "≈" denotes approximately equal to
argmax Pr(h|D) = argmax Pr(D|h) * Pr(h)  #  this is called MAP(maximum a posteriori) estimation
#  again, notice the removal of Pr(D) doesn't affect argmax

furthermore, even a more simplified version is below.

Pr(h|D) ≈ Pr(D|h)            # lets assumes each Pr(hi) is equally likely, i.e. uniform distrib of h in H
# in that case, it's ok to drop Pr(h) for our argmax computation

===> so now we have a few versions of h

h = argmax Pr(h|D)
h

h_map = argmax Pr(D|h) * Pr(h)
h

h_ML = argmax Pr(D|h)             #  ML = maximum likelihood (estimation)  aka MLE
h

NOTE: it's mathematically/conceptually/theoretically elegant, but looking at each h may not be computationally practical if H is big.

###
###  Bayesian Learning in Action
###
(1) given {<xi,yi>} as "noise-free" examples of C    # C = concept
(2) C ∈ H             # let's say H is a finite class
(3) uniform prior     # i.e. Pr(hi) is equal for every i

note: noise-free means c(xi) = yi
note: (1) ~ (3) are kind of strong assumption, but not entirely unreasonable.

let's try compute Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

Pr(h) = 1 / |H|
Pr(D|h) = 1 if h(xi) = yi  ∀ xi,yi ∈ D     #  ∀ means "for all"
0 otherwise                      # notice this is the same as "version space"

Pr(D) =  Σ Pr(D|hi) * Pr(hi)     # nothing fancy, just so  Σ Pr(hi|D) = 1
hi∈H                                              hi∈H
# assuming each h is mutually exclusive
=  Σ 1 * (1/|H|)           # notice we already know Pr(D|hi) and Pr(hi)
hi∈VS(H,D)

= |VS| / |H|

1 * (1/|H|)       1                       # this is neat
Pr(h|D) = ------------  =  -----  for h ∈ VS(H,D)    # probability of h being correct given D
|VS|/|H|       |VS|                      # is just 1 / |VS|
# because we only consider h ∈ VS(H,D), Pr(D|h)=0 case is implicitly skipped

###
###  noisy data - quiz
###

<xi,yi>  where  yi = k * xi
where  k is an integer from {1,2,3,,,,,∞}
where  k is chosen with probability Pr(1/2^k)
e.g. probability of k=5 is 1/2^5 = 1/32
∞
Σ 1/(2^k) = 1
k=1

===> so that's the noise process

x | y   | k  |  Pr(1/2^k)
--------------------------
1 | 5   | 5  |  1/32
3 | 6   | 2  |  1/4
11 | 11  | 1  |  1/2
12 | 36  | 3  |  1/8
20 | 100 | 5  |  1/32

here is a candidate hypothesis: h(x) = x      #  f(x)=x is called "identity function"
(plus the afore-mentioned noise process)

Pr(D|h) = (1/32) * (1/4) * (1/2) * (1/8) * (1/32) = 1/65536
= Π Pr(yi|h)
i

= Π 1/(2^k)     where k = y1/xi
i

if yi mod xi = 0  ∀ xi,yi

###
###   Bayesian Learning - contd
###
given {<xi,yi>}
yi = f(xi) + εi                       # ε = error term, to make data noisy, aka "noise" corruption
εi is chosen from norm(0,δ^2) i.i.d.  # normal(aka gaussian) distrib of zero mean, with variance δ
# iid = independently and identically distributed

NOTE: the above assumption is important. f(x) is just some deterministic function, which we corrupt with gaussian noise ε.
notice it's not dependent on Hypothesis class

h_ML = argmax Pr(h|D)
h

= argmax Pr(D|h)
h

= argmax Π Pr(yi|h)   # Pr(yi|h) is probability of seeing correct yi given h
h   i

= argmax Π 1/sqrt(2πδ^2) * e^(-1/2 * [(yi - h(xi))^2 / δ^2])    # just using gaussian distrib
h   i                                                      # see below note and wikipedia

## at this point, to simply, we can get rid of coefficient that doesn't affect argmax computation
## we can remove  1/sqrt(2πδ^2)   as it doesn't depend on i
## also we can apply ln() because log() is a monotonic function (i.e. doesn't change the argmax)
## recall log(A) + log(B) = log(A*B),  and ln(e^(abc)) = abc

= argmax Σ -1/2 * (yi - h(xi))^2 / δ^2       # well simplified
h   i                                   # we can do more, get rid of δ^2
# get rid of 1/2
= argmin Σ (yi - h(xi))^2                    # a minus sign can be eliminated by changing argmax to argmin
h   i

## turns out this is just the sum of squared error, just like other supervised ML algo we looked at. (neural net, gradient descent, linear regression, etc)
## whoever decided to model noise as gaussian was onto something

note: gaussian probability distrib function = 1/sqrt(2πδ^2) * e^(-1/2 * (yi - h(xi))^2 / δ^2)

(ref) https://en.wikipedia.org/wiki/Normal_distribution
(ref) https://en.wikipedia.org/wiki/Monotonic_function

###
###  best hypothesis - quiz
###

let's apply Bayesian learning algo.

given {<xi,yi>}
yi = f(xi) + εi

sample data set S is:
x | y
------
1 | 1
3 | 0
6 | 5
10 | 2
11 | 1
13 | 4

which one is the best h ?
1.  h(x) = x mod 9
2.  h(x) = x/3
3.  h(x) = 2

==> just compute Σ (yi-h(xi))^2   and you will find 1. is the smallest error
i

###
###  Minimum Description Length
###

log_2() = lg()    # let's just agree this notation of log base 2

h_MAP = argmax Pr(D|h) * Pr(h)
= argmax [lg(Pr(D|h)) + lg(Pr(h))]
= argmin [-lg(Pr(D|h)) - lg(Pr(h))]    # see NOTE_1 below
= argmin [length(D|h) + length(h)]     # see NOTE_2 below
= argmin [error(h(D)) + size(h)]

==> so you want the hypothesis that minimizes error and size.
(smaller size means simpler)
==> "occam's razor" !!!                 # note, in reality, error and size are in a trade-off relation.
==> called "minimum description length"

NOTE_1:

(from information theory, there is the notion of entrophy)
an event with probability P
--> its optimal entropy is -lg(P)     # "entropy" aka "length"

(ref) https://en.wikipedia.org/wiki/Entropy_(information_theory)

NOTE_2:

length(h) : "length of hypothesis"
: basically how many bits needed to represent a hypothesis
: i.e. "size of h"
: e.g. if your hypothesis representation is a decision tree, then smaller tree size 2^N is better.
: e.g. if your h representation is neural network, then the overall neurons/layer size may be fixed, but still smaller weights are better (because bigger wrights require more bits, in fact that leads to overfitting). so this embracing smaller size really captures the idea for neural network as well.

length(D|h) : recall an earlier discussion.
- so Pr(D|h) means the probability of h(xi) = yi
- i.e. Pr(D|h) can be interpreted as the accuracy of h on D
- so length(D|h) is shorter if accuracy is better
- i.e. length(D|h) = error

###
###   Bayesian Learning  VS  Bayesian Optimal Classification
###

so far, we've been talking about Bayesian "learning"
i.e.

argmax Pr(h|D)     # pick the best (most probable, most likely correct) hypothesis h, given data D (and other domain knowledge)
h∈H               # i.e. for h∈H compute Pr(h|D), and output h that maximizes Pr(h|D)

[quiz]  given below, what should be the best label(i.e. y) for x ?   + or - ?

hi(x)  |  Pr(hi|D)
-----------------------
h1 |   +    |   40%
h2 |   -    |   30%
h3 |   -    |   30%

===> well, according to Bayesian "learning" rule, we pick h1(x) = + as our best h, so y should be +
but that's just picking the best "hypothesis", we want the best "label"
so we do weighted voting, aka Bayesian optimal "classification", as expressed below

v_MAP = argmax Σ Pr(v|h)Pr(h|D)      #  Pr(v|h) is the accuracy of v, given h
v   h∈H                    #  pick v (== label y) such that the sum of (the probability of v given h, i.e. the accuracy of h on v) * (probability of h)

if v = +,

Pr(+|h1) * Pr(h1|D)  +  Pr(+|h2) * Pr(h2|D)  +  Pr(+|h3) * Pr(h3|D)
1 * 0.4       +         0 * 0.3       +         0 * 0.3      = 0.4

if v = -,

Pr(-|h1) * Pr(h1|D)  +  Pr(-|h2) * Pr(h2|D)  +  Pr(-|h3) * Pr(h3|D)
0 * 0.4       +         1 * 0.3       +         1 * 0.3      = 0.6     # thus argmax will pick v = -

########################################
####  (1.10)  Bayesian Inference    ####
########################################

Bayesian network: representing and reasoning with probabilistic quantities.

###
###  random variable
###

what is random variable: let's say "X"
- it's a variable that can take any value from a set of values. Ω_X (dubbed its domain)
- can be discrete or finite, or letters of alphabet, or {0,1}
- a trial: an instance of observation of a particular value of random variable
- a sample: a collection of trials
- every random variable (we assume) has some underlying probability distribution. P(X)

Σ P(x) = 1
x∈Ω_X

e.g.
suppose X is pittsburgh weather.
then we may get something like below.
P(X=cloudy) = 0.9
P(X=snow)   = 0.07
P(X=sunny)  = 0.02
P(X=other)  = 0.01

note: obviously when X is contiunous, then P(x) for any particular value of x is almost zero, so we usually integrate ∫P(x) for some range of x, and call it probability "density".

###
###  joint/marginal/conditional distribution
###

note: "joint distribution" == "joint probability distribution" == "joint frequency"    # same goes for marginal/conditional

joint probability distribution is probability of X=x and Y=y
P(X=x,Y=y) = some_value_between_[0,1]   # also denoted as  P(X=x ∧ Y=y)

e.g.

atlanta summer afternoon weather:

storm | lightening | probability (should add up to 1.0)
--------------------------------
T   |    T       |    0.25
T   |    F       |    0.4
F   |    F       |    0.05
F   |    T       |    0.3

P(storm=T,lightening=T) = 0.25
P(storm=T,lightening=F) = 0.4    # note we usually use variables names like X,Y instead of "storm" "lightening"
P(storm=F,lightening=F) = 0.05
P(storm=F,lightening=T) = 0.3

note: X and Y are independent iff P(X=x) * P(Y=y) = P(X=x,Y=y)   # sometimes we write P(X)*P(Y)=P(X,Y) to simplify

"marginal distribution" is defined as below:  # very obvious
P(X=x) = Σ P(X=x,Y=y)                         # suppose we only have joint distrib data like above table
y∈Ω_Y                                 # then we can derive marginal distrib P(X=x) from the joint distrib data

"conditional distribution" is defined as below:
P(X=x,Y=y)
P(X=x|Y=y) = ------------            # probability of X=x, given Y=y
P(X=x)

e.g.

P(X,Y,Z) = P(X|Y,Z) * P(Y|Z) * P(Z)     # notice, this means we can recover joint distrib from conditional distrib

e.g.

if X,Y,Z are independent, then P(X,Y,Z) = P(X) * P(Y) * P(Z)

and of course, from the above, we derived Bayes Rule:
P(Y|X) * P(X)
P(X|Y) = ---------------
P(Y)

here are exercise quizes;

Pr(¬storm) = 0.3 + 0.05     #  "¬" = logical not. i.e.  "!"
= 0.35           # this is marginal distrib

Pr(lightening|storm) = 0.25/0.65     #  probability of lightening given storm
= 5/13          #  so the denominator is 0.25 + 0.4
= 0.3846        #  this is conditional distrib

NOTE: with only two binary variables (storm and lightening) we have 4 possible cases.
with N binary variables, we get 2^N cases. but a variable may not be binary, e.g. a variable like "food_type" may take a few possible labels. then the complexity grows more quickly.
well we can factor these. more details to follow.

###
###  Conditional Independence
###

definition: X is conditionally independent of Y given Z
if the probability distribution governing X is independent of the value of Y given the value of Z
i.e.  if ∀ x,y,x,  P(X=x|Y=y,Z=z) = P(X=x|Z=z)   # notice this is just expressing exactly what's described above.
(more compactly we write)  P(X|Y,Z) = P(X|Z)             # i.e. probability of X is independent of Y given Z

FYI:

chain rule:  P(X,Y) = P(X|Y) * P(Y)

normal independence:  P(X,Y) = P(X) * P(Y)     # simple. probability of X & Y is the probability of X times the probability of Y
P(X|Y) = P(X)            # this is true under normal independence scenario

###
###   Belief Networks   (aka Bayes Network, Bayesian Nets, aka Graphical Models)
###

Bayes Nets - it's a representation of joint distribution of some variables (aka features, attributes) of particular conditional dependency

e.g.

S = storm         # assume all these variables take binary value. 1 or 0
L = lightening    # so we have 2^3 scenarios
T = Thunder       # but we can simply express all cases with the below steps, ASSUMING T is conditionally independent of S given L

[storm]         [lightening]               [thunder]
given            you can calc              then you can calc
P(S)    --->   P(L|S) or P(L|¬S)    --->    P(T|L) or P(T|¬L)
a                b         c                  d         e      # some value

e.g.  what's the probability of  S=1,L=0,T=1 ?
==>  a * (1-b) * e

NOTE: we assumed T is conditionally independent of S given L.  as above, it's a nice graphical representation of statistical dependence/independence.
but otherwise we need an arrow connecting directly from [storm] to [thunder] as well, i.e. to calc [thunder] probability you need both [storm] and [lightening] probabilities. but that would make the graph grow exponentially. bad.

==> this is much like a neural network (though it's not the same, because this is just a probabilistic/statistical dependence/independence visualization)

Bayes Nets - example

---> e          (for better visual) https://www.youtube.com/watch?v=0Te3m5y3CYA
/     ^
|     |           # this graph just denotes
a ---> c --> d           #  a: deson't depend on anything, completely independent
^     ^           #  b: also completely independent
|     |           #  c: depends on a and b
b----/            #  d: depends on b and c
#  e: depends on c and d

node |  what it represents
--------------------------
a  |  P(a)
b  |  P(b)
c  |  P(c|a,b)
d  |  P(d|b,c)
e  |  P(e|c,d)

- must be acyclic
- nodes visited (computation done) in topological order.  # e.g. all required parents nodes must be visited.

###
###  recovering joint distribution (from conditional probability)
###

recall "joint distribution" tables, i.e. we were given the probability for every combination of variable values.
from such a table, we calculated conditional probability, such as P(a|¬b)

now we can try the other direction: calculating joint distribution from conditonal probability table.

joint distrib | conditional probability
------------------------------------------------------------    # assuming the same above Bayes Net
P(a,b,c,d,e) = P(a) * P(b) * P(c|a,b) * P(d|b,c) * P(e|c,d)

what's the representation complexity (i.e. table size) ?   # assume each variable is boolean

P(a,b,c,d,e) : 2^5 = 32 (combinations) that's gonna be a exponentially growing table size, O(2^n)             31
technically 2^5 - 1, because you can derive the 32nd combination's probability as 1 - Σp(xi)
i=1
P(a) : 2^1   # a is true or false, hence 2. but you can always derive P(a) = 1-P(¬a)
P(b) : 2^1   # so the size is 1
P(c|a,b) : 2^2   # you store probability of c for every combination of a & b (=2^2, so that's size 4), and P(¬c|a,b) can be derived.
P(d|b,c) : 2^2   # (good ref) https://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
P(e|c,d) : 2^2

==> if you construct the conditional probability table, all you need is 1 + 1 + 4 + 4 + 4 = 14 in this case.
31 > 14
O(2^n) > O(n * 2^k)  # k is how many parents a node has

so in a way, storing info in a conditional probability table is more efficient.

NOTE: table size (how many things you have to remember) illustration below

a=T | a=F
P(a):  0.3 | 0.7    # then you need just 1, because P(a=F) can be derived by 1-P(a=T)

b=T | b=F
P(a|b): a=T 0.6 | 0.1     # so table size is 2^2
a=F 0.4 | 0.9     # but this row P(¬a|b) can be derived from the first row P(a|b), so it's really 2
# by now you see the pattern
P(a|b,c): 2^2 = 4         # it's  2^k where k = number of dependent variables (parents in the bayes network).

###
###  sampling (from some distribution)
###

distribution:
- probability of value
- generate values

==> simulates a complex process
==> approximate inference         #  approximation is because maybe exact determination is hard or slow
==> visualization

###
###  inference rule
###

marginalization:   P(x) = Σ P(x,y)     # this is trivial and redundant but useful
y            #  y = 1 or 0, so it's like saying P(x) = P(x,y=1) + P(x,y=0)

chain rule: P(x,y) = P(x) * P(y|x)     # of course, if x and y are completely independent,
= P(y) * P(x|y)     # then you have P(x,y) = P(x) * P(y)
= P(y,x)            # x & y are commutative

NOTE: though P(x,y) = P(y,x), they are different in Belief Nets representation
e.g.
x-->y    # represents P(x)*P(y|x)
y-->x    # represents P(y)*P(x|y)

P(x|y) * P(y)
Bayes rule: P(y|x) = ---------------
P(x)

(quiz)  https://www.youtube.com/watch?v=Vt8fRJwvdwA   # inference by hand
https://www.youtube.com/watch?v=uACvhE19c_M   # messy but very good demonstration of marginalization/chain/bayesian rules

###
###  Naive Bayes  inference
###

naive bayes - it's a special case of bayes nets, where we assume all attributes are conditionally independent of other attributes given a label v
- grahically below

/---> a1
|---> a2
.     .        # every node (a1,a2,,,,an) has one parent "v"
v --.     .        # so it's a depth=1 tree
|     .
----> an

v = some value     # can be a class of hypothesis
a = some attribute

given the above particular setup, the joint probability distribution is as below

P(v)
P(v|a1,a2,,,,an) = ----- * Π P(ai|v)      # z = just some normalization factor (we discuss below)
z     i

MAP class = argmax P(v) * Π P(ai|v)     # pick v ∈ V such that it maximizes P(v|a1,a2,,,,an)
v           i             # notice it's the same as h_map
# so we dropped z also

==> so we infer the best fit hypothesis class given attributes

NOTE: this inference rule is "naive" because the underlying assumption is the probability of each attribute is conitionally independent of other attributes given v

given an email (possibly spam), then what's the probability of the email containing certain words ?

P(spam) = 0.4
P('rayban'|spam) = 0.2    # probability of the word rayban appearing given a spam email
P('rayban'|¬spam) = 0.1   # probability of the word rayban appearing given a non-spam email
P('viagra'|spam) = 0.3
P('viagra'|¬spam) = 0.001
P('gatech'|spam) = 0.0001
P('gatech'|¬spam) = 0.1

==> we can visualize in a bayesian net

---->gatech
/
spam----->rayban
\
---->viagra

now, compute  P(spam|'rayban',¬'viagra',¬'gatech')    # simply apply Bayes rule
= P('rayban',¬'viagra',¬'gatech'|spam) * P(spam) / P('rayban',¬'viagra',¬'gatech')
= P('rayban'|spam) * P(¬'viagra'|spam) * P(¬'gatech'|spam) * P(spam) / P('rayban',¬'viagra',¬'gatech')   # now apply the numbers
=          0.3     *      0.7          *      0.9999       * 0.4     / P('rayban',¬'viagra',¬'gatech')   # we don't know the denominator
=                              some_number_foo                       / P('rayban',¬'viagra',¬'gatech')

we don't know the denominator P('rayban',¬'viagra',¬'gatech'), but that's ok we can normalize, to find its value, aka normalization factor.

remember that  P(spam|'rayban',¬'viagra',¬'gatech') + P(¬spam|'rayban',¬'viagra',¬'gatech') = 1   # obvious, P(a|b) + P(¬a|b) = 1

now try compute  P(¬spam|'rayban',¬'viagra',¬'gatech')   # just apply Bayes rule
= P('rayban',¬'viagra',¬'gatech'|¬spam) * P(¬spam) / P('rayban',¬'viagra',¬'gatech')
= P('rayban'|¬spam) * P(¬'viagra'|¬spam) * P(¬'gatech'|¬spam) * P(¬spam) / P('rayban',¬'viagra',¬'gatech')   # now apply the numbers
=        0.1        *        0.999       *       0.9          * 0.6      / P('rayban',¬'viagra',¬'gatech')
=                         some_number_bar                                / P('rayban',¬'viagra',¬'gatech')

some_number_foo + some_number_bar        # we know both in real values
---------------------------------  = 1
P('rayban',¬'viagra',¬'gatech')        # so we can derive this normalization factor

note: if we only care about "classification" then we just need to know if some_number_foo > some_number_bar, and not worry about the denominator

###
###  why Naive Bayes is cool
###
- inference is cheap    # inference itself is NP-hard, but it's cheap in Naive Bayes form
- few parameters        # like we saw in conditional independence table, its space complexity is linear
- estimate parameters with labeled data (we can do the other direction too)

# of occurrence of ai,v    # look at how many instances of ai in a given labeled data set
P(ai|v) =  -------------------------
# of v                     # v = label. in practice, v is initialized to some super small number, to smooth.
# assumption(or we may call it inductive bias in this case) is all things are mildly possible to begin with. (avoids overfitting)

- connects inference and classification   # inference may be hard, but classification may be simpler
- empirically successsful (handles missing attributes well, because it's just all probabilistic inference. in other algo like decision tree, if you miss attributes, you might get stuck)

#######################################
####  (2)  Unsupervised Learning   ####
#######################################

###########################################
####  (2.1)  Randomized Optimization   ####
###########################################

[optimization]
- input space X
- objective function (aka fitness function) f:X -> R
- goal: find x*∈X s.t. f(x*) = max f(x)                 # i.e. argmax
x

note: x can be discrete

e.g.
- stock portfolio
- route finding
- factory, chemical composition (process control in general)
- neural networks   # pick weights that minimize error
- decision trees    # structure

###
###  optimization approaches
###

- generate & test    # useful when input space is small, or complex function
- calculus           # useful when function is differentiable, solvable
- newtons method     # hopefully function has single optimum
- hillclimbing: pick a point x, evaluate itself and its neighbors f(x-1) f(x) (x+1) and decide which direction to go, repeat until converge, i.e. reaching an optimum.

in other words, optimization can be hard if
- big input space
- complex function
- no derivative (or hard to find)
- possibly many local optima

==> "randomization" helps overcome those obstacles.

##
##  RHC : randomized hill clibming
##

e.g. pick randomly different starting points for your newton's (or hillclimb) method. and compare results.

note: one observation is to find the global optimum, its basin of attraction matters.

|          .
|         . .
|   .    .   ..
|  . .   .     .
| .   . .       .
|.     .         .
|____________________
<--a---><---b---->

a = basin of attraction for a local optimum
b = basin of attraction for the global optimum

NOTE: always improving is prone to local optimization (loosely relating to overfitting maybe) but if you allow more search/exploration, like going down in the above algebraic function, instead of always going up i.e. improving your f(x), then you may find a better solution.
this brings to the concept of annealing algorithm.

###
###  Simulated Annealing Algorithm     (one of randomized optimization algos)
###

annealing is how you can reinforce metal by heating and cooling it. (by controlling the temperature carefully, molecules line up better and therefore your metal, like a sward or whatever, strengthens)
so let's have a fitness function f(x)

1. for a finite set of iterations:
a. sample a point x, then sample a new point xt in N(x)    #  xt = N(x)   N() is a neighbor function, N(x) finds a neighbor of x
b. jump to new sample with probability given by an acceptance probability function P(x,xt,T)   # i.e. probabilistically decide whether to move x to xt
c. decrease temperature T (slowly)   # assume T > 0

P(x,xt,T) = 1  if  f(xt) >= f(x)                    # improving
= exp( [f(xt) - f(x)] / T )  otherwise    # searching/exploring

e.g.
if f(xt) - f(x) = infinitestimally small, then it's essentially e^0 = 1                             # large T causes this effectively also
if f(xt) - f(x) = a huge negative number, then it's essentially 1 / e^a_huge_pos_number = almost 0  # tiny T causes this effectively also

===> this intuitively makes sense. we call this adoption of annleaing to our optimization context: "simulated annealing"

###
###  properties of simulated annealing
###

T -> ∞ : P(x,xt,T) = e^0 = 1  so it always moves x to xt regardless of f(xt)-f(x) i.e. "random walk"
T -> 0 : e^(-∞) = 0 so it never explores, only improves. (may end up in a local optimum)

exp( f(x)/T )     # we don't go into proof
Pr(ending_at_some_point_x) = ----------------    # this is called Boltzmann distribution in physics community
Zt           # Z is just a normalization factor

###
###  Genetic Algorithms    (another randomized optimization algo)
###

conceptually,,

population of individuals

mutation : local search N(x)
crossover : like a child inheriting properties(DNA) from two individuals(parents)   # we may call it "combination" essentially.
generations : iterations of improvement

[genetic algo - sketch]

(1) P_0 = initial population of size K      # how to represent input is a topic of interest
(2) repeat the below until converge
a. compute fitness of all x
b. select "most fit" individuals (top half, weighted probability)
c. pair up individuals, replacing "least fit" individuals via crossover/mutation

note: obviously you have to define fitness(x) function, and also have to define how to select most fit individuals.
"top half" : take the top half. this kind of slection is called "truncate selection"
"weighted probability" : you pick individuals randomly, but you assign higher probability to better fitness-score individuals. (called "roulette wheel" selection)

note: because of its generic applicability, genetic algo is sometimes called the 2nd best solution to any problem

###
###  crossover example
###

x = 8-bit strings

x1: 01101100
x2: 11010111

there are many ways to crossover them.

(1) one point crossover
- choose one position to slice them, then flip flop
e.g.
x1: 0110|1100
x2: 1101|0111
offspring: 0110 0111

==> what properties(assumption/inductive bias) exist in this crossover method?
a. locality of bits   # are preserved (and assumed to matter)
b. subspaces to optimize (independently)   # related to a. locality property
# assume you have some subspace optimized already, so you can focus on other subspaces, etc

(2) uniform crossover
- randomize at each bit position
e.g.
x1: 01101100
x2: 11010111
offspring: 11001110    # this happens in real human gene reproduction

###
###  MIMIC    (mutual info maximizing input clustering)
###

two issues with those randomized optimization algos we've looked at so far.

(1) structured information
you go thru all the trouble of choosing a point, evaluating fitness, iteration, but you don't really keep track of all the info you gain along the way.

(2) unclear probability distribution
ok, annealing algo gives boltzmann distrib, which is good. but not all randomized optimization algos give that kind of nice clear probability distribution.

==> MIMIC algo directly models distribution of the underlying data (via dependency tree), and refines the model over time, thereby conveying structure.

(paper) https://www.cc.gatech.edu/~isbell/papers/isbell-mimic-nips-1997.pdf

[MIMIC algo]

P_Θ(x) = 1/z_Θ  if f(x) >= Θ    # probability of x given threshold theta Θ
0  otherwise           # z is just a normalization factor, out of all x whose f(x) >= Θ   (this point is important)

Θmin = smallest value f(x) can return
Θmax = largest value f(x) can return

P_Θmin(x) = uniform distrib     # because everything is accepted
P_Θmax(x) = optima              # only accepting the max but there can be multiple of them on the same value

MIMIC algo starts with P_Θmin(x) and aims at P_Θmax(x), as below.

[MIMIC algo - pseudo code]
1. generate (many) samples from P_Θt(x)   # kinda like 'population' in genetic algo
2. set Θt+1 to n'th percentile            # pick the best(fittest) n'th percentile out of samples from 1.  # btw,  Θt+1 == Θ_(t+1)
3. retain only those samples s.t. f(x) >= Θt+1
4. estimate P_Θt+1(x)  via dependency tree    # this is where you encode structure, sort of.
5. goto 1.

a couple of important assumptions:
- re: step 4. : given a finite set of data, we can estimate its probability distribution P_Θ(x)
- P_Θ(x) ≈ P_Θt+1(x)   # these two should be close, so that we can generate samples for Θt+1 from the samples of Θ

###
###  estimating distribution (from a finite set of data samples)
###

x = [x1 x2 x3 ... xn]
P(x) = P(x1|x2....xn) * P(x2|x3....xn) * ... * P(xn)     # joint distrib expressed in the usual formula from conditional independence

==> this is hard to compute, as it requires exponential sized table O(2^n), but becomes easy with an assumption: we always have "dependency trees"

dependency tree: a special type of Bayes Nets where network itself is a tree (i.e. every node has one parent, except for the root)
it's useful because (1) it captures relationships between variables, and (2) sampling is easy, as you just pick some parts of the tree.
_                                                                      _
Then P(x) = P(x) = Π P(xi|π(xi))    # π(xi) means xi's parent.  Π is just product, P is P-hat, hat == approximation
π                      # the root node doesn't have a parent, so just assume π(x_root) returns itself, or something
# O(n * 2^k) where k = 1, so it's O(2n) = O(n)

NOTE: conceptually, MIMIC algo just wants a simple yet effective structure to represent inter-variable probabilistic dependency of input variables, so "dependency tree" is just one simplistic way.
we certainly don't want to assume every variable is completely independent from each other, so with dependency tree, we at least allow a (simple enough) structure in which every variable probabilistically depends on (in this case strictly one) other variable. one can argue that's just too arbitrary and not good enough, but it's been proven effective empirically and theoretically. more details below.

###
###  finding dependency trees      # in such a way that it models prob distrib well
###
_
KL divergence: a measure of distance(divergence) btwn true probability distrib P and our estimation P

_                     _
D_kl(P||P_π) = Σ P[lg(P) - lg(P_π)]
= -h(p) + Σh(xi|π(xi))     # h(p) means entropy of probability distrib p.  h(xi|π(xi)) means conditional entropy of xi given its parent
= Σh(xi|π(xi))             # all we want is to find the best π() that minimizes the sum, so we can drop -h(p)
= J_π                      # we denote it like this

===> obviously, our goal is to minimize the KL divergence, i.e.   min J_π

min J_π is hard to compute, so we define a slightly different version that makes it easier to compute.
_
J_π = -Σh(xi) + Σh(xi|π(xi))     # we just added a term -Σh(xi)  which is ok because it doesn't depent on π at all.
= -Σ I(xi,π(xi))             # so in terms of minimizing π(), it's the same as
_
min J_π = max Σ I(xi,π(xi))    # maximize mutual information between xi and its parent

suppose a fully connected graph: each node is xi, and each edge is mutual information between two nodes

o---o
|\ /|
| \ |
|/ \|
o---o

===> we simply wanna pick a set of edages that form a highest total information content tree: maximum spanning tree !!

(so we successfully transformed the problem of finding the best dependency tree to a problem of finding a maximum spaning tree)
(i.e. you can negate the edge value and find a mimimum spanning tree)
(just use Prim's algo, which finds a maximum spanning tree in a densely connected graph)
(pick any node as root, then that dictates the rest as a tree)

###
###  MIMIC algo: pseudo code
###

now we know how to find the best dependency tree, let's get back to MIMIC algo

[MIMIC algo - pseudo code]                                      _
1. generate (many) samples from P_Θt(x)   # we use its estimate Pπ  represented in dependency tree
2. set Θt+1 to n'th percentile            # pick the best(fittest) n'th percentile out of samples from 1.  # btw,  Θt+1 == Θ_(t+1)
3. retain only those samples s.t. f(x) >= Θt+1
4. estimate P_Θt+1(x)  via dependency tree    # find MST over I(x,π(x))
5. goto 1.

a couple of important assumptions:
- re: step 4. : given a finite set of data, we can estimate its probability distribution P_Θ(x)
- P_Θ(x) ≈ P_Θt+1(x)   # these two should be close, so that we can generate samples for Θt+1 from the samples of Θ

note: again, dependency tree is just a simple but effective example of representing probability distrib for MIMIC, and not a must.

###
###  MIMIC: practical matters
###

- MIMIC does well with structure
- representing P_Θ vΘ
- local optima           # but you can usually avoid this because restart is free
- probability theory
- time complexity      # fewer iteration than others(random hillclimb, genetic algo, etc)
# MIMIC does well when the cost of fitness function is expensive

==> overall, MIMIC gives you a lot more structure (i.e. information) about the (probability distrib/dependency of) data we train a learner on.

###########################################################
####  (2.2)  Clustering and Expectation Maximization   ####
###########################################################

one of the classical unsupervised learning problems is "clustering" : taking a set of objects and putting them into groups.

[clustering - high level description]

given:
- set of objects X
- inter-object distances D(x,y)=D(y,x) where x,y ∈ X      # at least some meta notion of "distance" (or similarity)
# no need to satisfy the law of triangular inequality
output:
- partition Pd(x) = Pd(y) if x & y in same cluster    # Pd(a) = b  means partion an input 'a' into a cluster 'b' using distance metric 'd'

(quiz) how do we partition every input into the same cluster ?
Pd(x) = a

(quiz) how do we partition every input into its own cluster ?
Pd(x) = x

===> there are many specific clustering algorithms, so we go thru some of them.

###
###  distance (aka similarity) metrics
###

finding the right distance metric is one of the hardest tasks in clustering problems.
you need domain knowledge of your data, and some theoretical and empirical analysis.

sklearn defines them so well, so i'm gonna quote their tables here.
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

(1) Metrics intended for real-valued vector spaces:

identifier       class name             args     distance function
---------------------------------------------------------------------
"euclidean"      EuclideanDistance      .        sqrt(sum((x - y)^2))
"manhattan"      ManhattanDistance      .        sum(|x - y|)
"chebyshev"      ChebyshevDistance      .        max(|x - y|)
"minkowski"      MinkowskiDistance      p        sum(|x - y|^p)^(1/p)
"wminkowski"     WMinkowskiDistance     p,w      sum(w * |x - y|^p)^(1/p)
"seuclidean"     SEuclideanDistance     V        sqrt(sum((x - y)^2 / V))
"mahalanobis"    MahalanobisDistance    V or VI  sqrt((x - y)' V^-1 (x - y))

(2) Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.

identifier    class name          distance function
------------------------------------------------------------------------
"hamming"     HammingDistance     N_unequal(x, y) / N_tot
"canberra"    CanberraDistance    sum(|x - y| / (|x| + |y|))
"braycurtis"  BrayCurtisDistance  sum(|x - y|) / (sum(|x|) + sum(|y|))

(3) Metrics intended for boolean-valued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:

N : number of dimensions
NTT : number of dims in which both values are True
NTF : number of dims in which the first value is True, second is False
NFT : number of dims in which the first value is False, second is True
NFF : number of dims in which both values are False
NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT

identifier        class name               distance function
-------------------------------------------------------------------
“jaccard”         JaccardDistance          NNEQ / NNZ
“matching”        MatchingDistance         NNEQ / N
“dice”            DiceDistance             NNEQ / (NTT + NNZ)
“kulsinski”       KulsinskiDistance        (NNEQ + N - NTT) / (NNEQ + N)
“rogerstanimoto”  RogersTanimotoDistance   2 * NNEQ / (N + NNEQ)
“russellrao”      RussellRaoDistance       NNZ / N
“sokalmichener”   SokalMichenerDistance    2 * NNEQ / (N + NNEQ)
“sokalsneath”     SokalSneathDistance      NNEQ / (NNEQ + 0.5 * NTT)

###
###  single linkage clustering (SLC)  algo
###

[SLC algo]
- consider each object a cluster (n objects)    # i.e. Pd(x) = x
- define intercluster distance as the distance between the closest two points in the two clusters   # this step is O(n^2)
- merge two closest clusters
- repeat n-k times to make k clusters      # this step is O(n-k)

note: we took two "closest" points to define intercluster distance. but, depending on your "domain knowledge", you might choose median,average,farthest points. (then the algo has a different name, like average/median/maximal linkage clustering, etc)
note: SLC is in a class of unsupervised learning algo called "hierarchical agglomorative clustering" aka HAC
note: a tree representation of clustering is informative: it tells you which one is closer together and got clustered first

e.g. n=7, k=2    # i.e. group these n objects into k clusters

a  b       d
e   f         # probably you connect d-e first, then maybe a-b, e-g, g-f, b-c
c         g

===> notice you can represent the clustering steps/result as a tree, very informative

/\
/\      /\ \
/  \    /  \ \
/ \  \  / \  \ \
a b  c  d e  g  f

[SLC properties]
- SLC is deterministic
- running time of SLC with n objects, k clusters: O(n^3)    # we can improve with memoization, etc but still O(n^3)
- if distance is represented as edge length of a graph, then SLC algo is the same as a minimum spanning tree algo.
- it fails with an example data set like below
e.g. k=2

............
.          .
.    ..    .          . . .        # ideally we want to cluster the left and right
.    ..    .  .. . . ......        # but SLC will cluster the 4 dots in the big rectangle on the left as a cluster and the rest as the other cluster
.          .         . . ..        # other clustering algo like "K means clustering" will slove this.
............

###
###  K means clustering
###

put n objects into k clusters

[algo]
1. pick k centers (at random)    # "center" doesn't need to be one of the n objects
2. each center "claims" its closest points to form k clusters   # every n_i obj should find its closest center
3. recompute the centers by averaging the clustered points      # hence "k means"
4. repeat 2 & 3 until converge   # this iteration is counted as t

##
##  "k means" in Euclidean space
##

Pt(x) = y : partition(i.e. cluster) object x into some cluster y, t is just a counter of iteration t=1,2,3,,, so on

Ct_i : set of all points in cluster i = {x s.t. Pt(x) = i}

t
center  =   ∑ y   / |Ci|        # the center (i.e. "mean" aka "centroid") of cluster i in iteration t is simply the sum of all objects y in the cluster, divided by the number of objects in the cluster
i    y∈Ct_i               # there are k centers: so we get "k means" literally

[algo]    # just re-written in math notation. "2" just denotes euclidean

0                                       t-1  2                        t
center    --->  Pt(x) = i = argmin|| x - center  ||    ------------>  center =  ∑y   / |Ci|
i                        i               i   2   <--(t=t+1)---        i  y∈Ct_i

[step 1]                  [step 2]                                           [step 3]

##
##  "k means" as optimization problem
##

configuration (aka input): center, P
2
scores : error E(P,center) = ∑ ||center - x||      # want to minimize this error
x       P(x)    2

neighborhood : P,center = {(P',center)} ∪ {(P,center')}    # i.e. we either move the partition or move the center
# and take its ∪ = union to define neighborhood

==> notice the stunning similarity to hillclimbing algo.

when we update Pt(x), we stay in the same cluster or pick better score (less error) cluster  <==>  we evaluate neighbors and stay or go up if improve.
we take the average point of each cluster as a new center  <==> we only go up if it improves
(average is the best estimate if you want to minimize the sum of squared error, it's just calculus)

(ref) https://math.stackexchange.com/questions/1352726/minimizing-a-function-sum-of-squares

so the way we update center ensures the error is monotonically non-increasing (because we stay or improve)
also, configuration set is finite, because the input object set is finite, so the centers are defined deterministically.
also, it's possible that you may thrash back and forth between some equally optimal partitining, so you need to have a simple tie breaking rule.
because you never go into higher error configuration, basically you never repeat configurations, i.e. finite (possibly exponential) iteration, thus eventual convergence.

##
##  K-means clustering - properties
##

(1) complexity

[algo]
1. pick k centers (at random)     # O(k)
2. each center "claims" its closest points to form k clusters   #  O(nk)  or O(nkd) if you consider d dimentional points
3. recompute the centers by averaging the clustered points      #  O(n)
4. repeat 2 & 3 until converge    # well, the worst case is you explore every possible k-clustering combination of n objects, so nCk = O(n^k)   or O(n^kd) if you consider d dimentions

note: step 2 & 3 are additive, so it's just O(nk). and multiply by step 4.
thus overall,

O(nkd) * O(n^kd) = O(n^(kd+1))        # so this is much worse than SLC algo which gives O(n^3)

(ref)  https://en.wikipedia.org/wiki/K-means_clustering#Complexity

(2) problematic property
- suppose below problem. n = 6, k = 3

c d
a b                   # visually, we think it should be the clusters of {a,b}{c,d}{e,f}
e f

- step 1: suppose you pick a,b,f as your initial centers
- step 2: cluster 1 = {a}, cluster 2 = {b}, cluster 3 = {c,d,e,f}
- notice you converged, but not correct.

(random restart helps, or maybe try to pick initial random k centers in such a way that they are far apart, spread, but not rigorous enough, soft clustering is one solution)

###
###   Soft Clustering        # aka fuzzy clustering
###

abc   d   efg    # suppose n=7,k=2 then you wanna allow 'd' to kind of softly belong to both clusters

[steps]    # note this is just one conceptual approach. there are others too.
assume the data was generated by
1. select one of k Gaussian distributions [with fixed known variance σ^2] uniformly
2. sample xi from that Gaussian
3. repeat n times

task: find a hypothesis h={M1,M2,,,Mk} that maximizes the probability of the data (i.e. maximum likelihood)

note: gaussian distrib == normal distrib
note: k is the number of clusters. and we assume those clusters come from k distinct gaussian distributions = {g1,g2,,,,gk}
(by distinct, we mean we expect the mean of each gaussian distrib is different from others)
so if n>k, then obviously we pick some points from the same gaussian distrib.

note: M1 = the mean of g1
M2 = the mean of g2

note: it's really a probabilistic process

note: so the above defined task is saying let's find a hypothesis that assumes those M1,M2,,Mk means which fit the data the most.

###
###  Maximum Likelihood Gaussian
###

- suppose you have some data points like below, and you know they all come from one same Gaussian distrib.
- then how do you estimate the maximum likelihood (ML) mean of this Gaussian distrib?
- well it just happens to be the mean of the data points! easy and simple.

x x
x  x    x
x
x

- but the data points don't come from one same Gaussian distrib. they come from k distinct Gaussian distribs.
- how do we represent which data point came from which Gaussian distrib?
- use hidden variables.  # aka "latent" variables.
e.g.
x1 = <x1,z1,z2,z3,,,,zk>   # z is some random variable, all zero except one of them is 1, indicating x belongs to that Gaussian distrib (i.e. cluster)
x2 = <x2,z1,z2,z3,,,,zk>   # soft clustering  assigns probability, e.g. x2 may have z1=0.34, z2=0, z3=0.60, z4=0.01, so on
# i.e. x2 comes from cluster 1 with 34% probability, from clsuter 2 with 0% probability, from cluster 3 with 60% probability, from cluster 4 with 1%, so on

###
###  expectation maximization algo
###

[algo]  repeat between the below 2 phases until they don't change much

Σ E[Zij] xi
P(x=xi|M=Mj)                        i
E[Zij]  =   -----------------    ------>   Mj = -----------------
k                   <------           Σ E[Zij]         # these denominators are just for normalization
Σ P(x=xi|M=Mj)                        i
j=1

(define expectation of z from M)  <---->   (define max mean M from z)

note: P(x=xi|M=Mi) denotes probability of xi produced by cluster j
note: Zij denotes likelihood of xi coming from cluster j
note: E[Zij] denotes the expectation of xi coming from cluster j where  1 <= i <= n and 1 <= j <= k
using bayes rule, then we can drop P(Mj) because it's uniform, also since we are in MLE setting, E[Zij] is simply proportional to P(x=xi|M=Mj)
note: algorithmically, it's very similar to k-means clustering. in fact this becomes k-means if cluster assignments use argmax.

P(x=xi|M=Mi) = e^( -1/2 * σ^2 * (xi-Mi)^2 )

###
###  Properties of EM
###

- monotonically non-decreasing likelihood
- does not converge (practically does)
- will not diverge   # because it's within the bound of probability
- can get stuck     # just random restart to get around this
- works with any distribution (if E and M are solvable)

(ref) https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

###
###  (desired) properties for clustering
###

P_D(x) = C    # clustering scheme(algo) that "P"artition an input x to cluster C, based on distance matrix D

here are 3 properties we desire for any clustering(partitioning) algo.

(1) richness: (expressiveness)
- for any assignment of objects to clusters, there is some distance matrix D such that P_D returns that clustering.
i.e. ∀C ∃D P_D = C    # simply saying you can find some D to produce C for any input

(2) scale-invariance
- scaling distances by a positive value does not change the clustering
i.e. ∀D ∀k>0 P_D=P_kD     # intuitive, because distance between objects are relative, so scaling them all by some positive value k shouldn't change the clustering(partitioning) result.

(3) consistency
- shrinking "intra"cluster distances and expanding "inter"cluster distances does not change the clustering
i.e. P_D = P_D'      # intuitive if you visualize
e.g.

a   b      e      # let's say cluster 1 = {a,b,c}
c      d   f    #           cluster 2 = {d,e,f}

===> shrink intra-cluster distances, and expand inter-cluster distnaces

ab          e      # hence P_D = P_D'
c          df

SLC algo that stops when                        | richness | scale-invariance | consistency
--------------------------------------------------------------------------------------------
n/2 clusters reached                          |  no      |  yes             | yes
clusters are Θ units apart                      |  yes     |  no              | yes
clusters are Θ/ω units apart where ω=max D(i,j) |  yes     |  yes             | no
i,j

###
###   impossibility theorem  (proven by Kleinberg)
###

no clustering scheme can achieve all three of:
- richness
- scale-invariance
- consistency

# read Kleinberg paper for more

#####################################
####  (2.3)  Feature Selection   ####
#####################################

- knowledge discovery, interpretability, insight
- curse of dimensionality: the amount of data you need grows exponentially to the number of features

e.g.
given 1000 features, you wanna pick the most relevant feature subset.

===> how hard is feature selection ?  # let's say you want to pick m features out of n features, where n >= m
- NP-hard problem
- worst case: have to evaluate every possible combination of feature subset = 2^n
- two known approaches: (they have obvious trade-offs)

##
##  (1) filtering
##
you do feature selection as an independent process, and feed the result(selected feature subset) to learner.
- pros : speed
- cons : ignores the learning problem
often isolated individual feature analysis only
i.e. some features are relevant in combination but not individually

e.g. your feature selection maybe Decision Tree (what with its information gain based split), then you take all features selected by DT and give it to whatever is your learner (which may be also DT or KNN or boosting, etc) because DT and other algo have different inductive bias. so it may make sense to use DT to select features and use other algo to do the learning/training.

select features on what criteria ?
- information gain
- variance, entropy   # variance as in, within a feature, how much variance of labels?
- GINI index
- independent(non-redundant)  # if a feature is linearly correlated to a combination of other features, then it's redundant to keep that feature.
note: recall we really wanna get rid of features to combat the curse of dimentionality, to help the learner.

##
##  (2) wrapping
##
your feature selection is more closely tied to learner, so you pick some features then learner feedbacks "this feature is useless, and this feature is important" etc
and you iterate thru that feedback cycle.
- pros : takes into account model/learner bias
- cons : learning speed is slow

select features on what criteria?
- hill climbing
- randomized optimization
- forward search : given feature {x1,x2,,,xn}, you give it individually to learner, and get the score for each feature, as in score(xi). pick the best (suppose x3), then pick the 2nd best (suppose x5), evaluate the choice of these two, as in score(x3,x5), if score(x3,x5) >> score(x3) then keep going. pick the 3rd best feature (suppose x2), then compare score(x3,x5) and score(x2,x3,x5). suppose they are not much different(i.e. tiny improvement), then decalare x3,x5 as the selected subset. basically, when adding the next best feature doesn't improve your overall score (score of the combination of features) much, then you stop.
- backward search: the opposite of the forward search. suppose you have five features {x1,x2,x3,x4,x5}, you let learner evaluate (i.e. assign a score to) each feature, and eliminate the worst feature(suppose x4). compoare score(x1,x2,x3,x4,x5) and score(x1,x2,x3,x5), and suppose they are not much different (i.e. only slight degradation), then you proceed. pick the 2nd worst feature (suppose x1), then compare score(x1,x2,x3,x5) and score(x2,x3,x5), and suppose it got much worse, as in score(x1,x2,x3,x5) >> score(x2,x3,x5) then we know removing x1 was not a good idea, and we assume other features are even more important (because their individual scores are higher than score(x1)) thus we declare {1,x2,x3,x5} to be the selected feature subset.

##
##  quiz : smallest set of features to guarantee zero training error
##

x1 x2 x3 x4 | y    # let's just spoil that it's AND(x1,x2) function
----------------   # x3 looks useless, in terms of information gain
0  1  1  1 | 0
0  1  1  0 | 0
1  0  1  0 | 0
1  1  1  0 | 1

if you use DT, then you only need {x1,x2}      x1
0/  \1
0  x2
0/ \1
0 1

if you use perceptron y = 1 if (W^T * x) > 0
0 otherwise

then if you only pick {x1,x2} you cannot find weight vector W that lets you express AND(x1,x2) with the above perceptron.
but if you pick {x1,x2,x3} then you can pick the below weight, for example, to satisfy. (see the above video)
W = [1 1 -1]

so x3 was considered useless but actually useful in this case. we need to define this notion of relevance VS usefulness formally.

###
###  relevance
###

(1) xi is strongly relevant if removing xi degrades Bayes Optimal Classifier(BOC) that is based on ALL features. # i.e. BOC(all features - xi) < BOC(all features)
(2) xi is weakly relevant if: x1 not strongly relevant, AND
: ∃ subset of features S s.t. adding xi to S improves BOC that is based on S. # i.e. BOC(S+xi) > BOC(S)
(3) xi is irrelevant otherwise

note: BOC is the one that takes the weighted average of all the hypotheses based on the probability of being correct hypothesis given the data. so essentially it's the best you could do (on average).

x1 x2 x3 x4 | y    # let's just spoil that it's AND(x1,x2) function
----------------   # x1 and x2 are strongly relevant
0  1  1  1 | 0    # x3 is irrelavent
0  1  1  0 | 0
1  0  1  0 | 0
1  1  1  0 | 1

to illustrate "weakly" relevant, let's add x5 = ¬x1

x1 x2 x3 x4 x5 | y    # let's spoil again that it's AND(x1,x2) function
-------------------   # x2 is strongly relevant, because no other feature, out of ALL features, can represent its info
0  1  1  1  1 | 0    # x3 is still irrelavent
0  1  1  0  1 | 0    # x1 and x5 are "weakly" relevant. because, when considering ALL features, removing x1 does not degrade BOC(all feature) because x5 can cover.
1  0  1  0  0 | 0    # but in a subset S that doesn't include x5, then x1 is relevant. hence "weakly" relevant.
1  1  1  0  0 | 1

###
###  usefulness
###

- relevance measures effect on BOC      # measures information-ness/entropy of variable,feature in a data set
- usefulness measures effect on a "particular" predictor    # measures error given some particular model(or learner,algo,classifier)

note: BOC is really just the gold/true standard of information, the best you could do if you considered all hypothesis with its probability given the data.
any other classifier has inductive bias.
so it is fair to say relevance is usefulness w.r.t. BOC

##########################################
####  (2.4)  Feature Transformation   ####
##########################################

the problem of pre-processing a set of features to create a new (smarter? more compact?) feature set, while retaining as much (relevant? useful?) information as possible:

note: by this definition, feature selection can be considered just a subset of feature transformation.
note: also, we restrict our transformation to be just linear combination of original features. (i.e. feature "linear" transformation)
note: implicit assumption(i.e. domain knowledge) is that we don't need all n features.

we already looked at examples of feature transformation (as part of some specific learning algos)
e.g.
XOR in perceptron   # non-linear
kernel methods in SVM

##
##

google has lots of documents (i.e. objects) and gets ad-hoc search words(features) from user, and has to retrieve relevant objects.
ad-hoc as opposed to knowing what search words will come beforehand.

what are our features?
- words  # full, partial (like plural to singluar, removing words like "the" "of")
- number of words

==> there are too many words
- one word may mean multiple things: polysemy      # e.g. patient. "false positive": when you type for apple, as food, but you get iphone info.
- many words may mean the same thing: synonymy     # e.g. car and automobile. "false negative": when you google "machine learning" you may not get contents on "data mining" which you'd like.

false positive: flagging something positive when it's not. e.g. giving you something as relevant search result(apple iphone) when you searched for apple as fruit.
false negative: flagging something negative when it's positive. e.g. not giving you "data mining" results when you searched for "machine learning"

==> so feature transformation should eliminate these false pos/neg. like combining words "apple fruit", or adding "machine learning" + "data mining"

##
##  variance
##

expected value: E[X] =  xΣ P(X=x) = (1/n)Σ P(X=x)
x∈Ω_X            x∈Ω_X

n       _                                                         n
Σ (xi - x)^2                                                      Σ xi
i=1                                                          _    i=1
note: variance σ(X)^2 =  --------------- = E[(X-E[X])^2] = E[X^2] - E[X]    where mean x = ------- = E[X]
n                                                              n

σ(X) = stddev(X)

k'th moment is defined as:

E[ (X-E[X])^k ]
-----------------    # different k is used to study the behavior of a random variable.
σ(X)^k

e.g.
k=1 : mean
k=2 : variance
k=4 : kurtosis    # a convenient measure of the peakedness of a distrib

###
###   Principal Components Analysis (PCA) - a linear transformation algo
###

PCA transforms variables into a new set of variables (called principal components) which are linear combination of original variables.
first component is constructed so that it accounts for the most variance(i.e. info) of the data.
2nd component is chosen so that it captures the most (remaining) variance. mathematically each component is just an engenvector, and Nth component is simply orthogonal to (N-1)th component.
it's nothing but linear algebra.

(ref) https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
(ref) https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
(ref) https://onlinecourses.science.psu.edu/stat505/node/51

suppose 2-dimensional data: x,y
there are 2 eigenvectors, and their eigenvalues are variance.   # n-dimensional data have n eigenvectors

- PCA algo treats data as eigen problem
(1) finds direction (an eigenvector) that maximizes variance (eigenvalue)    # 1st principal component
(2) finds direction (the other eigenvector) that is orthogonal               # 2nd principal component
# if it's n-dimensional data, it keeps going to n'th principal component

y
--- |        ..
y        |  |     .. .
variance range |  |    . ...
--- |   . .
|
-------------- x
|-----|
x variance range

so combining y variance range and x variance range, you can get a vector for 1st principal component, then pick orthogonal vector you need to represent 2nd principal component.

conceptually, bigger variance means more information. so you wanna capture as much info as possible about the data.

[PCA for dimensionality reduction]
suppose you have 3-dimensional data (x,y,z), they could be like weight,height,age. you get 3 eigenvectors, and if their eigenvalues are 132,104 and 0.07 then, the third vector probably has not much information, so you can represent the data with just those 2 eigenvectors, without losing much info from the original data. and you successfully reduced dimension!

##
##  PCA properties
##
- global algorithm           # mutual orthogonal
- best reconstruction error  # no information loss
-- minimizes L2 error moving from n -> m dimentions
-- eigenvalue (non-negative, monotonically non-increasing) # if eigenvalue for a variable/feature is zero, that means it has no variance, no entropy, so it's irrelevant. (could be still useful)

note: suppose you had 2D data, and reduced it to 1D, using PCA. then you represented the original data with less dimension (less complexity, more simplicity, which may help computation/classification speed, less mem usage, etc). if you reconstruct the 2D data from 1D representation, there will be some error, as in distance(orig_data,transformed_data), we call it "reconstruction error". the goal is to reduce dimension while minimizing reconstruction error. we can say we want to reduce dimension while maximizing correlation(mutual information) btwn orig features and transformed features. if the orig data has many redundant features, then it's possible to reduce dim with zero recon error.
usually, (as described elsewhere), given N-dim orig data, it'll be the tradeoff btwn dim reduction and recon error growth. but hopefully the first few components capture the most of of eigenvalue(i.e. variance, i.e. info) like below.

| .
|  .
recon |   .                   #  the idea is hopefully  k < d
error |    .                  #  k = number of principal components that capture most of the variance of original data
|     .........         #  d = original feature dimension
|
------------------
# of components

(good ref) http://www.cs.princeton.edu/courses/archive/spr08/cos424/scribe_notes/0424.pdf

note: PCA is sensitive to the scale of measurement, so make sure you scale each variable value range.
also, applying PCA to original data, the new variables (aka principal components) you get are purely the result of variance maximizing liniear algebraic feature transformation.
so while it may help you achieve a computational benefit (dimensionality reduction, or sample complexity is smaller, etc), the new variables no longer have any interpretable meaning.
so if you still need to interpret the result (like data content, which feature meant whatnot), then PCA is not the best option.

###
###  Independent Components Analysis   (ICA)
###

- PCA is about finding correlation, by maximizing variance, which gives you best reconstruction
- ICA is about maximizing statistical independence.
(1) ICA finds linear transformation of features x, s.t. their transformed version features (let's call them y) are independent from each other.
i.e.
x1  -> y1
x2  -> y2     and yi ⊥ yj   i.e.   I(yi,yj)=0    # their mutual info is zero
x3  -> y3
.. so on

(2) ICA maximizes mutual info between original features and transformed features
i.e.
maximize I(x,y)

==> suppose

hidden variables :  y1 y2 y3 y4    # they are all independent from each other
|\/ \ /\ /\
|/\ /\|//\/\
observables : x1 x2 x3 x4 x5  #  linear combination of y1,y2,y3,y4

==> ICA attempts to transform x to y    # this problem is often called "cocktail party problem"

note: the word "latent" is often used in literature to mean "hidden" - they are the same thing.

##
##  Cocktail Party problem      # aka "blind source separation"
##

person_A  person_B  person_C    # suppose three people are talking in a cocktail party, independent of one another

mike_A   mike_B   mike_C       # suppose three microphones are there.
# mike_A, being closest to person_A, probably captures a lot of person_A's voice, then some of person_B's voice, then a little bit of person_C's voice.
# mike_B, mike_C will also have some linear combination of the voices.
# ICA separates them back to person_A to C, independently.

(ICA in-depth paper) "Independent Component Analysis:Algorithms and Applications" https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf

##
##  ICA intuition
##

ICA has a fundamental assumption that the given data is a linear mixture/transform of latent(i.e. hidden, original) features which are not normally distributed. thanks to CLT, the linear mixture itself is normally distributed. so ICA attempts to undo the linear mixture by selecting linear transform(hopefully the opposite way of the mixture) that gives features (hopefully the exactl latent features) that maximize the kurtosis.
kurtosis is imply the 4th moment the data, and we know kurtosis = 3 for Gaussian distrib, so ICA basically wants to recover the original features by maximizing kurtosis-3.

note: 1st moment = mean
2nd moment = variance
3rd moment = skewness
4th moment = kurtosis

(ref) https://en.wikipedia.org/wiki/Central_limit_theorem
(ref) https://en.wikipedia.org/wiki/Kurtosis

##
## ICA:  number of components to reduce to
##
if we use ICA for dim reduction (let's say you have 5 features which can be 5 mikes that recorded 3 people, then you might wanna reduce to 3 components), how do we decide the number of components to reduce to?
you can look at
(1) reconstruction error
(2) excess kurtosis of each component
(3) mutual information

##
##  PCA vs ICA
##

- they are trying to do the same thing(i.e. capture the original data in transformed state) but have different fundamental assumptions.
- PCA is linear algebraic
- ICA is probabilistic, information theory based

|                           ICA                     |  PCA
-----------------------------------------------------------------------------------------
directional    | yes (direction of input data must be consistent)  |  no
faces         | parts(noses, eyes, hairs)                         | eigenfaces (focues on brightness/luminance, ends up finding average face, useful for reconstruction)
natural scenes  | edges (independent components are edges)          |
documents       | topics                                            |
speed           | slower                                            | faster

note: it ultimately depends on the data (if the data has independent latent variablres/features then ICA can shine).
also it depends on the use case. (real time system - ICA may be too slow, but more static system, then ICA may shine)

###
###   alternatives to PCA/ICA
###

RCA: random components analysis      T
- generates random directions!      P       #  finds some matrix to project original data onto.
x
- fast (projection & reconstruction)     # because we pick components randomly
- surprisingly works well (i.e. accurate)   # even random vectors still retain info (and like other components analysis, we get to reduce dimensionality)

LDA: linear discriminant analysis
- finds a projection that discriminates based on the label

(paper on feature transformation) "A Survey of Dimension Reduction Techniques" by Fodor   https://e-reports-ext.llnl.gov/pdf/240921.pdf

######################################
####  (2.5)  Information Theory   ####
######################################

it is a huge topic of its own, originating in physics thermodynamics, then pioneered by Claude Shannon. here we only cover the basics relevant to ML.

recall in decision tree classification, we talked about deciding a feature to split a node based on "information" gain, but what is information?
also, how can we rigorously compare input(or output) vectors(features) to compare "similarity" ?

in ML context, every input vector and output vector can be considered as prbability density function.
information theory gives us a mathematical framework to compare/analyze these probability density functions.

###
###  entropy as a measure of information
###

suppose you want to send the results of 10 coin flips.

case (1) with a normal coin: 50% head, 50% tail. so the result may looks like this: HTHHTHTTHT
case (2) with a head-only coin, then the result is always HHHHHHHHHH

how many bits do you need?    i.e. how much information?

case (1): 10 bits
case (2): 0 bit      # because you know it's always head

so "information" is really about the degree of random-ness, uncertainty: "entropy"

suppose you have an alphabet universe of only 4 letters: ABCD, with the below frequency of occurrence in communication
and think about how to encode each letter's bit pattern efficiently

letter | frequency of occurrence | encode_1 | encode_2
--------------------------------------------------------
A    |     0.5                 |   00     |   0
B    |     0.125               |   01     |   110
C    |     0.125               |   10     |   111
D    |     0.25                |   11     |   10

what's the expected size of the message per symbol ?  (aka entropy)

Σ P(s) * size(s)     # the sum of probability of s occurring  times  the number of bits per s
s                    # size(s) == log(1/P(s))   which is the formal definition. this is base-2 log()

= Σ P(s) * log(1/P(s))  =  -Σ P(s) * log(P(s))    # aka "entropy formula"
s                         s

in encode_1:  0.5*2 + 0.125*2 + 0.125*2 + 0.25*2 = 2      # well, you can tell because every symbol is encoded with two bits
in encode_2:  0.5*1 + 0.125*3 + 0.125*3 + 0.25*2 = 1.75   # improved

note: it's more intuitive to represent in binary trees

encode_1:
.
0/   \1
0/\1 0/\1
A  B C  D

encode_2:           aka "variable length encoding"
.
0/   \1
A  0/ \1
D 0/\1
B  C

###
###  information between two variables
###

[joint entropy]:   H(x,y) =     -Σ P(x,y) * log(P(x,y))
x∈Ω_X,y∈Ω_Y

[conditional entropy]:   H(y|x) =    -Σ P(x,y) * log(P(y|x))
x∈Ω_X,y∈Ω_Y

e.g.
if x & y are independent from each other,
then
H(y|x) = H(y)
H(x,y) = H(x) + H(y)

###
###  Mutual Information
###

H(y|x)    # entropy of y given x
# but how much does y depend on x ?

I(x,y) = H(y) - H(y|x)        # I(x,y) is mutual information
= H(x) - H(x|y)        # smaller I(x,y) indicates x tells a great deal about y  (or H(y) is super small to begin with)
= H(x) + H(y) - H(x,y)
= I(y,x)

###
###  quiz exercise
###

P(A) = P(B) = 0.5
P(A,B) = P(A)*P(B) = 0.25
P(A|B) = P(A) = 0.5
H(A) = -Σ P(A)*log(P(A)) = 1
H(B) = H(A) = 1
H(A,B) = -Σ P(A,B)*log(P(A,B)) = 2
H(A|B) = -Σ P(A,B)*log(P(A|B)) = 1
I(A,B) = H(A) - H(A|B) = 1-1 = 0       # since A & B are independent, no mutual info between them

P(A) = P(B) = 0.5
P(A,B) = P(A)*P(B) = 0.5
P(A|B) = P(A,B)/P(B) = 1
H(A) = -Σ P(A)*log(P(A)) = 1
H(B) = H(A) = 1
H(A,B) = -Σ P(A,B)*log(P(A,B)) = 1
H(A|B) = -Σ P(A,B)*log(P(A|B)) = 0
I(A,B) = H(A) - H(A|B) = 1-0 = 1

###
###   Kullback-Leibler Divergence  (aka KL divergence)
###

KL divergence is a measure of distance between two distributions. (distance not in geometric sense, because this doesn't have any unit nor follow the triangular inequality law, so it's really just "divergence", not distance per se)

formula:
p(x)
D(p||q) = ∫ p(x)log-----       = 0 when p(x)==q(x) i.e. two distributions are identical, then their distance is 0
q(x)         non_zero positive real number otherwise

==> because in supervised learning, we are trying to fit our model to some observed distribution.
let's say p(x) is some well known distribution, and q(x) is our sampling. if we care to sample in such a way so that p(x) ≈ q(x) then we try to minimize D(p||q)

mutal information is a particular case of KL divergence.

#######################################
####  (3) Reinforcement Learning   ####
#######################################

recall

- supervised learning:  y = f(x)    # given a lot of x,y examples, learn f()  i.e. function approximation
- unsupercised learning:  f(x)      # given a lot of x examples, learn f() that clusters/describes x
- reinforcement learning: y = f(x)  # given a lot of x,z examples and delayed y at the end, learn f()
z

###
###  RL world
###

a typical example world: grid

3-by-4 grid

+ + + G    # available actions: UP, DOWN, RIGHT, LEFT
+ W + T    # S = start, W = wall(cannot go into), T = trap(sends you back to S), G = goal(wanna reach)
S + + +    # you stay in the current spot if you try to break out of the grid

==> what's the sequence of actions that leads you to G ?

well, it's easy. we have two routes: UURRR or RRUUR

suppose we have this stochasticity/uncertainty/probability: when executing an action, there is 80% chance it executes as intended. and 20% chance going side ways (10% to left, 10% to right).

what is the (probabilistic) reliability of UURRR action sequence ?

0.8 ^ 5 = 0.32768    # this alone is not the answer, because it's possible you intended UURRR, and you took RRUUR and still reached the goal
(0.1 ^ 4) *  0.8 = 0.00008    # here is the probability of taking RRUUR when your intended action was UURRR
---------

===> we use a framework called MDP (Markov Decision Process) to deal with this.

#############################################
####  (3.1)  Markov Decision Processes   ####   aka MDP
#############################################

##
##  background
##

Markov: a russian known for his work on stochastic process.

often we want to model a real world sequential process/workflow where we have
- finite number of discrete states.
- probabilistic transition between states.
- (each state having some sort of reward/cost)

Markov Chain (aka Markov model/system): models exactly the above.
- but one fundamental assumption: next state is determined only by the "current" state.
(it is fair to argue, sometimes, the next state is determined by a few prev states, but Markov model doesn't let you define a sequential process that way.)

==> we say the current state determines the probability distribution for the next state.
e.g.
suppose we have S = {s1,s2,s3}
current_state = s1
p(next_state=s1 | current_state=s1) = 0.1
p(next_state=s2 | current_state=s1) = 0.3
p(next_state=s3 | current_state=s1) = 0.6

Hidden Markov Model (aka HMM): it's Markov Chain, except we don't always get to know the current state, only partially observable. then we probabilistically try to guess the current state (or past transition path) based on past observation/action.

(see page 21) http://www.cs.cmu.edu/~./awm/tutorials/hmm.html
(ref) https://www.cs.cmu.edu/~ggordon/780-fall07/lectures/POMDP_lecture.pdf

MDP:  we add "action" to Markov Chain.
- finite number of discrete states
- probabilistic transitions between states based on controllable actions in each state.
- (again, the same fundamental assumption, aka Markovian property) next state is determined by only current state and current action.
==> it models sequential decision making process well, like a single player game, etc. also computationally tractable to solve.
(but assumes perfect knowledge of current state. thus MDP is aka COMDP: completely observable MDP)
==> a solution to an MDP problem is a policy π(s) = a that maximizes the expected sum of reward

Partially Observable MDP (POMDP): we add "action" to HMM
- it's MDP with current state only partially observable.
- it generalizes MDP further (to model many real world engineering process/system), by allowing uncertainty of knowledge of current state.
- so POMDP builds observation model p(o|s)
- computationally intractable to solve optimally.
(we study POMDP formally in RL class)

##
##  MDP overview
##

in MDP, we define a problem with "states", "model", "actions" and "reward",  then a solution is "policy"

States: S                        # a set of possible states a player can encounter. e.g. we may use x,y coordinate to represent a state.
Model: T(s,a,s') = Pr(s'|s,a)   # aka (probabilistic) transition function that describes the rules/physics of the world. probability of transitioning to state s' if you take action a in state s
Actions: A(s), A                  # A = a set of actions, or A(s) = a set of actions you can take in state s
Reward: R(s), R(s,a), R(s,a,s')  # encompasses the domain knowledge. normally you only need to think R(s)
--------------------------------
Policy: π(s) = a                 # tells you which action to take in state s.
#  π*() is called "optimal policy" that maximizes the long term reward
# ideally we want training examples <s,a>  that tells correct action a in state s, but we don't get them in RL problems. (otherwise it'd be a classic supervised learning)
# instead we get <s,a,r> that tells taking action a in state s gives reward r

##
##  Markovian properties (assumption)
##
- "only the present matters" :so the above model/transition function T(s,a,s') only needs to know the current state s, to discuss the possibility of moving to s' given a.
- rules/model/transition functions are "stationary"   # i.e. they don't change over time

##
##  MDP: notion of "reward"
##
- delayed reward
- minor changes (of reward function, or reward assignment) matter

e.g.
you play a game of chess, and let's say you win(or lose) after 87 moves. the final outcome is your reward, but how do you assign reward for each move? it's possible i'th move was bad, or j'th move was good, etc.
==> known as "temporal credit assignment" problem

e.g.
recall this grid path finding game.

+ + + G       # G or T are termination/absorbing spots
+ W + T       # S is the start spot
S + + +       # you must take an action of either UP,LEFT,RIGHT,DOWN at each step

let's assign reward to each state as below

-0.04 -0.04 -0.04  +1
-0.04   NA  -0.04  -1
-0.04 -0.04 -0.04  -0.04

global optimal action for each state is below:

UP RIGHT RIGHT na
UP  na   UP    na
UP LEFT  LEFT LEFT    # why is the second last column LEFT ?  UP would lead to the goal quicker.
# this is because the cost of possibly going to the trap state is too big (-1) so it makes sense to go thru the longer route, collecting several -0.04s.

what if we change -0.04 to +2 ?   what about -2 ?

+2 +2 +2 +1
+2 na +2 -1    # this means it's in your interest to forever loop, collecting +2 [unless we impose some constraints like you die if you don't reach G|T within 10 actions]
+2 +2 +2 +2    # instead of reaching either of termination states of -1 or +1

note: again, these minor changes in reward function matter a lot.
i.e. reward == domain knowledge

###
###   assumptions on MDP rewards
###

(1) infinite horizons     # how much time, how many turns do you have ? finite? infinite?

policy function π that decides action a given state s

π(s) = a       # but in finite time horizon, this is not stationary
# so it makes sense to write it as π(s,t)=a   so for different value of s, you may get different action even if state has not changed!
# but we don't worry about finite time horizon in this course.

(2) utility of sequences

"stationarity of preferences"

if U(s0,s1,s2,,,,,sn) > U(s0,s1',s2',,,,sn')     #  U(sequence_of_states) decides its preference-level, i.e. the sum of all immediate and subsequent rewards in this case.
then U(s1,s2,,,,,sn) > U(s1',s2',,,sn')            #  we say this holds

∞
U(s0,s1,s2,,,,)   =  Σ R(s_i)     # naive version
i=0

∞
=  Σ [γ^i * R(s_i)]  where 0 <= γ < 1     # more sophistivated version
i=0                                     # gamma γ is discounted reward rate

∞                    Rmax
≤  Σ [γ^i * Rmax]  =  --------      # upper bound  (just a plain geometric series)
i=0                   1-γ

S1 = [1 1 2 1 1 1 2 .....]   # each sequence has an infinite length
S2 = [1 1 1 1 1 1 1 .....]

==> compare sequence S1 and S2. in naive version, U(S1) = U(S2) = ∞

##
##  geometric series (recap)
##

(ref) https://en.wikipedia.org/wiki/Geometric_series

∞                         ∞
Σ [γ^i * Rmax]  =  Rmax * Σ [γ^i]
i=1                       i=1

assume 0 <= γ < 1

∞
x = Σ [γ^i]  =  (γ^0 + γ^1 + γ^2 + ....)    # till infinity
i=0

x = γ^0 + γx
=  1  + γx       #   x - γx = 1
= 1/(1-γ)        # x(1 - γ) = 1

∞             Rmax
Rmax * Σ [γ^i]  =   ------
i=1             1-γ

###
###   policy : definition
###

E[a|b] denotes expectation of a given b.

∞
optimal policy π* = argmax E[ Σ[γ^i * R(s_i)] | π]    # expectation of total reward given π
π     i=0                      # we are using E[] because it's not always deterministic

∞
U_π(s) = E[ Σ[γ^i * R(s_i)] | π,s0=s]   != R(s)    # utility of being in a particular state s is the total reward from that state onward.
i=0                                     # i.e. U_π(s) = long term(or delayed) reward,  R(s) = short term(or immediate) reward
# so R(s) can be negative as long as U_π(s) is big positive, just like going to college.

NOTE: assume U(s) == U_π*(s) below      #  U_π*(s) = true value of being in that state

π*(s) = argmax Σ T(s,a,s') * U(s')
a   s'

U(s) = R(s) + γ max Σ T(s,a,s') * U(s')      # aka Bellman Equation   (notice its recursive nature)
a  s'

NOTE: utility function U(s) is also known as value function V(s)

###
###  policy : finding policy by "Value Iteration"   (aka VI)
###

U(s) = R(s) + γ max Σ T(s,a,s') * U(s')      # Bellman Equation : notice it's non-linear because of max()
a  s'                       # there are n unknown states, so there are n unknown U(s)
# this non-linearity is not good
s : there are n states
a : there is a finite set of actions
γ : discount factor - we assume it's known

"Value Iteration" [algo]                                                                     _
1. start with arbitrary (estimate or just purely arbitrarily initialized) utilities.  # e.g. U_π(s)  where hat means estimate
2. update utilities based on neighbors
3. repeat 2. until converge                # guaranteed to converge

_                                     _
U_t+1(s) = R(s) + γ max Σ T(s,a,s') * U_t(s')        # why does this work ?                      _
a  s'                           # because you are iteratively updating your U_π(s) with true immediate reward R(s)
# for more rigorous proof, see below PDF, page 25

(ref) https://s3.amazonaws.com/ml-class/notes/MDPIntro.pdf    # "Markov Decision Processes and Reinforcement Learning - an introduction to stochastic planning"

note: you don't need to estimate precise values of utility function U(s) because we only need U(s) to be accurate enough to give optimal policy π(s) that gives correct action a
s
note: "converge" is commonly determined by how much the value of U(s) and U'(s) changes. if it's small (or zero) then we say converge, or we may simply set the max iteration count.
note: once we converge to a particular U(), then the policy can be obtained as follows:

π*(s) = argmax Σ T(s,a,s') * U(s')
a   s'

note: normally each iteration takes an equal amount of time.

###
###  policy : finding policy by "policy iteration"   (aka PI)
###

"policy iteration" [algo]
2. evaluate: given π(s), calculate U_t+1() = U_t|π()

U_t+1(s) = R(s) + γ Σ T(s,π(s),s') * U_t(s')     # another Bellman Equation: notice max is gone, now this is linear
s'                           # this linearity is good. but still O(n^3) but there are more tricks to improve it.
# repeat this U_t+1() update until converge

3. improve: ∀s, π(s) = argmax Σ T(s,a,s') * U_t(s')     # find the best action, with one-step look-ahead, using the existing utility.
a   s'

==>  repeat 2. & 3. until π(s) converges      # guaranteed to converge. look up proof

note: notice how step 2 itself is a loop inside a bigger loop (of repeating steps 2,3)
note: notice "converge" can be tricky in stochastic environment. so a common approach is to run play the game with the latest π() for multiple trials (like 50 times), and take the average total reward. and if average total reward doesn't change much between π_t+1(s) and π_t(s), then we say converged. of course, in deterministic environment, we need to play only once.

##
##  VI versus PI (versus QLearning)
##

notice the overlapping/different aspects of the both methods.

- when the underlying MDP model is known, i.e. R() & T() are known, both VI & PI will converge to optimality.
- because PI updates both policy and utility functions in each itertion, its each iteration takes longer than VI (who only updates utility over iterations, and updates policy in one shot at the end), but precisely for the same reason, PI usually requires less iterations in total. we will see below that Q-learning's each iteration is even quicker but the number of iteration it takes to converge will be significantly larger with Q-learning.

(ref) https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf

- unlike VI, PI's each iteration (specifically step 2) may take less time as it approaches convergence.
- is either of VI or PI definitely faster ?  no. but one way to think about this is, if updating PI at each iteration helps improve policy dramatically (closer to optimal) and it converges before value function converges, then PI could overally be quicker. and of course, for some problems, we may see the opposite effect. e.g. updating policy at each iteration is not as quick-to-converge as updating only utility func and doing one-shot policy production at the end.

- as we will see next, QLearning is a fundamentally a diff approach. it assumes agent doesn't get to know R(),T(), and uses randomized optimization approach. so it doesn't have a guarantee for convergence to optimal policy. but usually it produces a comparable policy, and even quicker total time (far more iterations, but at this point, a single iteration means a diff thing btwn VI,PI,QL, so not critical to compare per-iteration-time).

##########################################
####  (3.2)  Reinforcement Learning   ####
##########################################

history:

in 1930's, psychologists conducted experiments where a rat is given a signal before cheese was fed.
and they empirically observed:
- animal sees a stimulus(i.e. state): s
- takes an action: a
- gets a reward: r
- strengthens(i.e. reinforces) its response to this stimulus.

==> computer scientists interpreted this as "reward maximization" problem, where you choose an action that maximizes reward, as a function of state.

##
##   RL API components
##

planning:    model(T,R)  --> [planner] --> policy π(s)
RL:    transitions <s,a,r,s'>*  -->  [learner]  ->  policy π(s)     # suppose we get lots of <s,a,r,s'> examples
modeler:    transitions  -->  [modeler] --> model
model:    model  -->  [simulator] --> transitions

note: model is like the rules/physics of the world. so for certain games (like chess, backgammon), the model is well defined and well known, but for real world (like financial market), model is not well defined/known.
(hence the need for model-free approach)

note: action (the effects of action on the environment, to be precise) can be deterministic or stochastic. e.g. when you say go left, agent may go left 80%, and go up 10%, stay 10%.
some literature call this "deterministic environment" "stochastic environment"
note: agent may not know T(), R() and only get the experience <s,a,r,s'>
note: important to consider COMDP vs POMDP

##
##  RL types
##

(1) RL-based planner    # this is not really an established term. some call this model-free planner

_____________________________________________
model -->| [simulator] --> transitions  -->  [learner] |-->  policy      # takes model, and produces policy
|_____________________________________________|                 # this works great for problems for which model is well known, like backgammon.
# based on the model, simulate transitions, and learners can learn policy

(2) model-based RL    # this is an established term

___________________________________
transitions -->  | [modeler] --> model --> [planner] |--> policy       # takes transitions, and produces policy
|___________________________________|                 # based on transitions, build a model, then use planner(like value,policy iterations algo) to produce policy

##
##  Three approaches to RL
##

(1) policy search algorithm:  s -> π() -> a

(2) value-function based algo:  s -> U() -> v            # U() = utility function, aka V() = value function, v = value

(3) model-based RL algo:   s,a  ->  T(),R()  ->  s',r    # T() = transition function, R() reward function

generally,

(1)                     (2)                     (3)
<-------------------------------------------------->
directly usable                                  more direct learning (you can directly get s',r from s,a)
(policy gives action which is what we want)
indirect learning                                indirectly usable (because this dosen't give you the policy to direct the best action)
(indirect because you don't get the right s,a examples )

therefore, we can follow the below steps.

(3) --solve Bellman equations--> (2) --argmax--> (1)

note: we focus on (2) where lots of celebrated research has been done.

##
##  A new kind of value function
##

recall:

U(s) = R(s) + γ max Σ [T(s,a,s') * U(s')]    # U(s) = utility of s, i.e. long term value of state s
a  s'                       # γ = discount rate, T(s,a,s') = Pr(s'|s,a),  s' = all possible next state
# recursive, non-linear, but you can use value iteration to solve.

π(s) = argmax Σ [T(s,a,s') * U(s')]          # we can define the best policy like so
a   s'

now we wanna define a new value function called Q()

Q(s,a) = R(s) + γ Σ [T(s,a,s') * max Q(s',a')]     # Q(s,a) = maximum discounted cumulative reward that can be achieved starting from taking action a in state s
s'              a'               #          aka Q value

note: we can trivially re-define U() and π() below

U(s) = max Q(s,a)
a

π(s) = argmax Q(s,a)
a

note: the beauty of Q Learning algo is, if we can learn Q(s,a) well, then we don't ever need to know R() or T() or U*(), to get π*(s)
note: notice in VI,PI, the value function was U(s), but here the value function is Q(s,a).  action is part of it.

##
##   Q learning : model free approach
##

we wanna learn Q(s,a) without models of T(),R()
i.e.
- we only have access to transitions <s,a,s',r>     # we took a in s, and got r, and landed in s'

_
we can estimate Q()  from transitions

_             _                    _
Q(s,a) = (1-α)Q(s,a) + α[r + γ max Q(s',a')]      # α = learning rate
a'

##
##   learning rate α and convergence
##

intuitively, it's how much to learn from the most recent experience. as we set the learning rate bigger, QLearner will consider the most recent experience more heavily.
e.g. if we set α=0, agent will not learn anything. if we set α=1, then only the most recent reward info will be considered.
note: α=1 is optimal in a deterministic world, but in stochastic world, α=1 will likely make Q value fluctuate/oscillate around the max reward.

so all we are doing in Q estimation is to continuously update V = (1-α)V + αX    #  Vt = (1-αt)Vt + αtXt  to be precise
#  assume Xt is drawn from some probability distrib of X
to "converge", learning rate α satisfies two properties

∞
(1)  Σ α_t = ∞
t=0

∞
(2)  Σ (αt)^2  < ∞
t=0

There is a whole range of values for α_t that meets the above properties, but an example is α_t = 1/t
(in practice, a constant α is also commonly used.)

interestingly, Vt = (1-αt)Vt + αtXt   will converge to E[X]    # essentially average of X

##
##   Q learning (cont'd)
##

_             _                    _
Q(s,a) = (1-α)Q(s,a) + α[r + γ max Q(s',a')]      # α = learning rate
a'
_
= E[r + γ max Q(s',a')]
a'
_
= R(s) + γ E[max Q(s',a')]
s' a'
_
= R(s) + γ Σ [T(s,a,s') * max Q(s',a')]     # so our estimate actually converges to the real Q() in theory
s'              a'

_
note: a problem with the above claim is Q() keeps getting updated, so it is a moving target

##
##  Q learning : convergence
##

Q starts anywhere, then Q(s,a) = (1-α)Q(s,a) + α[r + γ max Q(s',a')]      # α = learning rate
a'
_
then  Q(s,a) converges to Q(s,a)    # solution to Bellman equation !!
only IF
(1) s,a visited infinitely often
(2) α satisfies two properties    # sum to ∞, but its squared sum is less than ∞
(3) s' drawn from T(s,a,s'),  r drawn from R(s)

==> important questions comes up:
_
- how to initialize Q(s,a) ∀s,a ?
- how to decay αt ?
- how to choose actions ?   # always choose the same action ? - dumb, won't learn anything
# always choose randomly ?    - also won't learn anything
#                 _
# choose based on Q() ?  - sensible, but think about it.
# if initalization of estimate is done badly,
# then you may end up choosing the same action, or some local min/max action.
# but this problem is not new, we already discussed this in Randomized Optimization context
_
==> take "simulated annealing"-like approach. i.e. choose based on Q() but take a random action sometimes
_             _
π(s) = argmax Q(s,a)  with P(1-ε)     # neat, this is called "ε-greedy exploration"
random action  otherwise       # aka "greedy limit+infinite exploration" (GLIE)

##
##  ε-greedy exploration
##
_              _
if GLIE (decayed ε), then Q() -> Q() and π() -> π*()       #  called "exploration-exploitation dilemma"
__________     ___________       #  it's a fundamental tradeoff in RL.
(learn)         (use)           #  you have to balance btwn explore and exploit. (have to take action to learn but have to learn to take better action)
(exploration)    (exploitation)    #  otherwise you learn nothing or fall in local minima

note:
- explore refers to taking random action.  # with P(ε)
- exploit refers to choosing action based on the current Q() which it has already learnt/explored, to maximize reward.  # with P(1-ε)
- so we need to balance btwn the two by controlling ε. it is sensible to start with a big ε (more exloration) at the beginning, and gradually decay (as your Q value estimate improves).

(good ref) http://artint.info/html/ArtInt_266.html

##
##  note
##

- unless we decay α & ε, policy will always fluctuate a bit by design.
- notice QLeraning is really a randomized optimization algo. (thus it can get stuck in local optima)
- there is a whole lot more stuff to learn in RL, but we only covered fundamentals above.
e.g.
_                              _
other approach includes smarter Q() initialization. if you set Q(s,a) = the highest possible value for each s,a then you are bound to check all actions. so it's one way to force extensive exploration. (known as optimism in the face of uncertainty)
_
- to recap, in terms of hyperparameters, (1) ε, (2) α, (3) initial Q values

###############################
####  (3.3)  Game Theory   ####
###############################

so far, we've been talking about decision making for a single agent who wants to maximize reward. but there are other agents competing in the world.

- mathematics of conflict
- single agents VS multiple agents
- economics, biology
- increasingly a part of AI/ML

##
##  a simple game
##

two players, each taking turns to say a positive interger < 4, adding up, and he who says 20 wins.
-- you can pretty much list out all possible states and actions.
-- it's called "2-player zero-sum finite deterministic game of perfect information"

zero-sum: the sum of all rewards is a constant K in every strategy combination (K doesn't have to be zero, as long as it's one same constant in every possible outcome)
deterministic: no stochasticity, no probabilistic nature
strategy: what player does in possible states. if you have N states, and M possible actions to choose in each state, then you have N*M strategies. (note the number of choice in each state may not be the same. in some state you have binary choice, in other state, you may have many multiple choices.) also, notice a strategy is basically a combination of actions.

quiz: (how many strategies ?) https://www.youtube.com/watch?v=2k6rkcIMZE0

##
##  minimax
##

as we saw in the quiz, player A considers worst case counter by player B, and player B considers worst case counter by player A. and A wants to miximize its score while B wants to maximize its score (i.e. minimize A's score) so this paradigm is called "minimax"

##
##  Von Neumann theorem
##

in a "2-player zero-sum finite deterministic game of perfect information", minimax == maximin, and there always exists an optimal pure strategy for each player.

##
##  game tree - example
##

let's look at an example of "2-player zero-sum finite NON-deterministic game of perfect information"    # non-deterministic == stochastic

note: once you construct strategy matrix, you may not be able to reproduce the exact original tree but that's ok, as long as you can construct a game tree that satisfies the matrix, then it's all that matters.
note: Von Neumann theorem still applies (minimax==maximin) even after we changed the game to Non-deterministic.

##
##  minipoker
##

we further relax our game to be "2-player, zero-sum, finite NON-deterministic game of HIDDEN information"

"mini-poker":
- given players A & B
- A is dealt a card, red or black, 50% probability each. A wants black. (B cannot see what A got, hence "HIDDEN" information)
- A may resign if red: then -20 cents for A
else A chooses to hold         # assumption is if A gets black, then he always holds
then B may choose (1) to resign : +10 cents for A
or (2) to see A's card:  -40 cents (for A) if red
+30 cents (for A) if black

note: it's a simpler version of Poker

(see game tree & matrix representation of the above game flow)  https://www.youtube.com/watch?v=15oPc9T1b9c

note: with "hidden" information, minimax != maximin, Von Neumann theorem fails.
but we overcome this by using "mixed" strategies, instead of using "pure" strategies

##
##  "mixed" VS "pure" strategy
##

suppose a player has two actions to choose from in some state. so that's two strategies. (to simplify, let's assume there is no other state.)
"pure" strategy means the player always picks one strategy.
"mixed" strategy means the player picks strategy with some probability distribution. e.g. P(strategy_1) = 0.3, P(strategy_2) = 0.7

note: so yes, pure strategy is a type of mixed strategy where probability distrib is 100% on one strategy.

##
##  (iterated) prisoner's dilemma
##

suppose two criminals get arrested, and offered a choice between coop & defect.   # coop = keep silence, deny any accusation
# defect = blain the other guy did it

as each has to make a decision, they are detained in a separate cell, and cannot communicate to each other.

(player B)
|  coop  |  defect
---------------------------------------
(player A)     coop | -1,-1  |  -9,0        #  -9,0  means A gets 9 years in jail, B gets 0 years in jail
defect |  0,-9  |  -6,-6       #  -6,-6 means both A and B get 6 years in jail

so, as a group, it makes sense for both to coop, but since they cannot communicate with each other, they have to cogitate as below.
(from player A's perspective) if B coop, then which action gives A less time in jail? - defect (0 year, as opposed to 1 year)
if B defect, then which action gives A less time in jail? - defect (6 years, as opposed to 9 years)
so it always makes sense to defect.
player B will choose to defect for the same reason. so they both get 6 years in prison. aka prisoner's dilemma.

##
##  Nash Equilibrium   # equilibrium = balanced state
##

given N players, each with a set of strategy S1, S2,,, Sn     # S1 = a set of strategies available for player 1
# S2 = a set of strategies available for player 2
(each player picks a strategy)
a set of particular strategies chosen by each player: S1* ∈ S1, S2* ∈ S2,,, Sn* ∈ Sn
is in a Nash Equilibrium iff ∀i Si* = argmax utility(S1*,S2*,,,Si,,,,Sn*)
Si

i.e.  if you give each player a chance to change his strategy, nobody changes. i.e. all balanced.

note: Nash Equilibrium works for both pure and mixed strategies.

##
##   Nash Equilibrium quiz
##

find the (pure) N.E.

(player B)
|  coop  |  defect
---------------------------------------
(player A)     coop | -1,-1  |  -9,0     #  well, we already went thru this
defect |  0,-9  |  -6,-6    #  so, -6,-6 is the N.E. (i.e. dominant strategy combo)

player B
-----------------------
|  0,4  |  4,0  |  5,3
A |  4,0  |  0,4  |  5,3
|  3,5  |  3,5  |  6,6      #  6,6 is the N.E.

##
##  4 fundamental theorems of N.E.
##

(1) in the n-player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the unique N.E.
(2) any N.E. will survive elimination of strictly dominated strategies.
(3) if n is finite &&  ∀i Si is finite, then ∃ (either pure or mixed) N.E.     # n = number of players,  Si is the set of strategies available for player i
(4) n repeated games == n repeated N.E.     # even if you repeat the game to learn each other's pattern/behavior, doesn't change the outcome N.E.

note: it's possible for a game to have multiple N.E. and that's an active research area but beyond the scope of this class.

###
###  Uncertain End : discounting
###

let's look deeper at (4) n repeated games == n repeated N.E.

suppose you have multiple rounds of the same game of prisoner's dilemma we discussed.
the idea is if you have multiple rounds, then in earlier rounds, you get to show benign intention/pattern by cooporating, instead of defecting, so the opponent will learn to cooperate in subsequent rounds for the mutual good.
think about the very last round, because there is no subsequent round, you get greedy, and choose defect. by N.E. the other party will defect too.
then the last round outcome is already determined, then what about the 2nd last round? it's essentially the same thing, so both will defect, and so on.
hence, n repeated games == n repeated N.E.

==> but that is assuming both parties know when the last round is.
==> what about the last round is decided probabilistically? then it makes a difference !

[setup]
you start the first round, then, with probability γ, the game continues to the next round.
i.e. with P(1-γ), the game ends.  so with uncertainty, every round could be your last.

what is the expected number of rounds ?    # it's finite if γ<1

∞             1
Σ [γ^i]  =  -----
i=1           1-γ

===> this means we cannot use represent a strategy with hard-coded sequence of actions like state1:coop, state2=defect, state3=defect, etc
we need a better representation.

###
###   tit-for-tat   #  a famous IPD strategy   (IPD = iterated prisoner's dilemma)
###

(1) cooperate on first round
(2) copy opponent's previous move thereafter

we can represent it in a finite statemachine

<---(C)-----
(C)<-->C            D <-->(D)       # parenthesis () denotes opponent's move
-----(D)--->

##
##   facing TfT strategy
##

let's use the same setup.

(player B)
|  coop  |  defect
---------------------------------------
(player A)     coop | -1,-1  |  -9,0
defect |  0,-9  |  -6,-6

facing TfT opponent, what if you do the following strategies?

(1) always defect : 0 +  -6γ/(1-γ)    # the first round, is you defect, your opponent coop, so you get 0 month in prison
# the 2nd round onward, both your opponent and you defect, so that's -6 times the expected number of rounds 1/(1-γ)
# but recall it's from the 2nd round onward, so you need to multiply by γ, so you get -6γ/(1-γ)

(2) always coop : -1/(1-γ)            # -1 times the exoected number of games 1/(1-γ)

===> compare these two. they are equal when γ = 1/6         # derive γ = 1/6 by setting (1) = (2) and solving for γ
also, important to notice (1) is better when γ < 1/6
(2) is better when γ > 1/6

===> how do we generalize this?

##
##  what's the best response to a finite state strategy?
##

recall our finite state machine representation
note: we used to need the matrix when there is only one round of the game but matrix is not sufficient when there is uncertain number of rounds of the same game.

-9
-1  <---(C)-----   -6
(C)<-->C            D <-->(D)       # parenthesis () denotes your move
-----(D)--->
0

our choice impacts two things
(1) our payoff
(2) future decisions of the opponent

--> treat the above representation as MDP problem !   # which we know how to solve, using value-iteration

==> so the policy you will find will determine either
π(s)= a    # s is opponent's action C or D, which we interpret as an input state s
# and action a is our action choice of C or D

so there are only 4 combinations:

π(C) = C, π(D) = C    # always C
π(C) = D, π(D) = D    # always D
π(C) = C, π(D) = D    # this
π(C) = D, π(D) = C    # and this are essentially the same, you just loop between CDCDCDCDCDCDC...

note: MDP always has markov deterministic optimal policy.

e.g.
opponent strategy  |  your best strategy/response
--------------------------------------------------
always C      |     always D
always D      |     always D
TfT        |     always C or TfT

===> now think about which one is "mutual best reponse" ? i.e. pair of strategies where each is best reponse to other. (i.e. Nash Equilibrium)
always C + always D  ==> no, because your opponent will prefer to change to always D
always D + always D  ==> yes, it's N.E.
TfT + always C  ==> no, because your opponent will prefer to change to always D
TfT + TfT       ==> yes, it's N.E.   # notice this N.E. actually achieves "cooperative" N.E. overcoming the dilemma.

===> with TfT, we kinda changed the prisoner's dilemma game in a repeated setting, to achieve "cooperative" N.E.
let's formally define this.

###
###   Repeated Games and the Folk Theorem (intro)
###

general idea: in repeated games, the possibility of retaliation opens the door for cooperation.

in mathematics, a folk theorem means a well known. well established, provable theorem (at least to experts in the field) but not attributed to any particular individual, and no official original publication.
in game theory, the folk theorem means the set of payoffs that can result from Nash strategies in repeated games.

e.g.
let's use the usual setup,

(player B)
|  coop  |  defect
---------------------------------------
(player A)     coop | -1,-1  |  -9,0
defect |  0,-9  |  -6,-6

-9    -6    -1  0
---(c,d)-----------------    # just plotting the above table
c,c | -1
|        # connect all 4 points.
|        # that forms a polygon, which describes the "feasible region" of average payoffs of some joint strategy
d,d       | -6     # "feasible region" is an important notion
|        #
|
(d,c) -9
|

###
###   Minmax Profile        # aka minmax payoff profile
###

pair of payoffs, one for each player, that represent the payoffs that can be achieved by a player defending itself from a malicious adversary (trying to minimize your score).

note: it's zero sum game.

(player opponent)
B      S
--------------    # this game known as Back,Stravinsky (aka Battle of Sexes)
(player you)  B |  1,2    0,0     # e,g,  1,2 means you get 1 point, opponent gets 2 points
S |  0,0    2,1

Q. what is the minmax profile of BS ?

answer:  1,1     # because if you try to minimize your opponent score, you avoid a strategy that can give opponent the high score 2
# so you choose strategy S, at that point, your opponent can only score 1 at most
# vice versa, this particular game is symmetrical, so the answer is 1,1

note: "minmax profile" assumes pure strategy.
for mixed strategy, we use "security level profile"    # but some people use "mixed strategy minmax payoffs/profile"

Q. what is the security level profile of BS ?

==> suppose the opponent chooses strategy B with probability x, and thus strategy S with probability x-1
so your expected total score will be 1*x if you choose B, and 2(1-x) if you choose S
then your opponent wants to minimize your max score, if he chooses x s.t. 1*x > 2(1-x) then you choose B
if he chooses x s.t. 1*x < 2(1-x) then you choose S. so your opponent wants to min max(x,2-2x)    # hence minmax
you can plot max(x,2-2x) for x 0->1, and find where it's the lowest,
but basically you can solve this by setting x = 2-2x, so you get x=2/3
since the game is symmetric in this particular example, answer is 2/3,2/3

##
##  acceptable region      # aka preferable region
##

in Prisoner's Dilemma game, minmax payoff is -6,-6     # or whatever the penalty for both defecting

recall the "feasible" region: a region obtained by connecting 4 points (c,d)(d,d)(c,c)(d,c)

because the minmax payoff is at (d,d), any points within feasible region that can improve for either player is "acceptable" region.
i.e.  a sub region of feasible region where x≥-6 or y≥-6 will form such a region.

-9    -6    -1  0
---(c,d)-----------------
c,c | -1
|
|
d,d       | -6
|
|
(d,c) -9
|

###
###   Folk Theorem (definition)
###

any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor.

proof: if it strictly dominates the minmax profile, can use it as a threat. better off doing what you are told!

###
###  Grim Trigger       # another example strategy like TfT
###

it's a strategy where if you cooperate, Grim will cooperate.
if you defect, Grim will always defect for the rest of the game.

state machine wise, it will look like this:

(C)<--> C ---(D)---> D <-->(D),(C)       # parenthesis () denotes your move

the idea is Grim threatens you "if you defect, I will always defect, so you better cooperate."
Grim will force you to cooperate. yes it's a Nash Equilibrium.

###
###  Implausible Threats
###

a fundamental problem with Grim Trigger is its implausible threat.
because if Grim says he will always defect if you don't cooperate, but always defecting is not benefitial to Grim either.
(imagine if somebody threatens you on the street if you don't give him your wallet, he will detonate this nuclear bomb. but if you don't cooperate he dies too, and he must consider his life more worthy than your wallet, so his threat is implausible.)

"subgame perfect" : always best response independent of history

===> so Grim doesn't feel like subgame perfect.
BUT, depending on the reward/penalty, also depending on the opponent strategy, still Grim Trigger can be subgame perfect.

So, the right way to think about subgame perfect is in an actual game setup.

Q.  is Grim Trigger VS TfT  subgame perfect?

A.  No. because suppose at one point in history(you are allowed to artificially engineer some history) if TfT chose Defect, then Grim will choose Defect forever, and that will not be subgame perfect.
it is however a N.E. because it will settle in an always cooperate loop.

Q.  is TfT vs TfT  subgame perfect?

A.  No, because suppose at one point in history (again you can make this up) TfT chose Defect, then the opponent TfT will choose Defect, then both always defect. (not subgame perfect)

###
###   Pavlov      # another strategy,  aka "win-stay, lose-shift"
###

TfT: (cooperate first time, then) repeat what opponent did
Pavlov: slightly modified TfT. (start coop)
if Pavlov cooperates and you coop, (at this point, this round of game ends), then (at next round) Pavlov coop.
if Pavlov cooperates and you defect, then (at next round) Pavlov defects.
if Pavlov defects and you coop, then (at next round) Pavlov defects.
if Pavlov defects and you defect, then (at next round) Pavlov coop.

===> aka "win-stay, lose-shift"  i.e. repeat last choice if good outcome, otherwise switch

visually, it's a finite state machine as below.

<---(D)-----
(C)<-->C            D <-->(C)       # parenthesis () denotes your move
-----(D)--->

note: recall our model of "game", you do one round where both parties decide their hands, then comes the result. then the next round. so it's not truly sequential within a single round, as you may interpret from the above state machine diagram.

Q.  is Pavlov VS Pavlov Nash equilibirum ?
A.  well yes, simply it loops in Cooperate state.

Q.  is Pavlov VS Pavlov subgame perfect ?
A.  yes

suppose (your_hand,opponent_hand)
(1) if (C,C) then both parties choose C at next round, so it'll be an infinite loop.
(2) if (D,D) then going to (1) at next round
(4) if (D,C) then going to (2) at next round
(3) if (C,D) then going to (2) at next round

===> plausible threat at work.  because pavlov strategy is effectively if you screw with me, i will screw you.

###
###  Computational Folk Theorem
###

given a 2-player bimatrix game, (average reward repeated game)
we can build Pavlov-like machines for any game.
(and using that machine) we can construct subgame perfect Nash equilibrium for any game in polynomial time.

i.e.
if it is possible for 2 players to have a mutually benefitial strategies, then we can build a Pavlov-like machine in plolynomial time (i.e. quickly).
if not possible, then the game is zero-sum like (cannot mutually benefit) but still can solve a linear program, to produce a N.E. strategy solution.
or if even that is not possible, still we can find a N.E. strategy where at most one player improves in the zero-sum like game.

note: a 2-player zero-sum game: we are famililar.
a 2-player bimatrix game is where each player has his own reward structure.
note: average reward repeated game means we repeat many times and look at average reward.

(read papers by Peter Stone, Michael Littman on this topic for more details)

###
###  type of games - zero sum VS constant sum VS general sum
###

recall the matrix of strategies and corresponding rewards.

(player 2)
|   A   |   B   |
-----------------
(player 1)  C |  1,-1 |  0,0  |        # zero sum means, reward x,y hold x+y = 0
D | -2,2  |  3,-3 |        # constant sum means x+y = some_constant
# general sum means x,y can be such that there can be win-win or lose-lose strategies.
# e.g. x,y = 1,1 or x,y=-2,-2  so on.

(ref) https://www.cs.cmu.edu/~avrim/ML14/lect0409.pdf

###
###  Stochastic Games (for multi-agent RL)
###

recall MDP gave a model for RL.
similarly,
stochastic games gives a formal model for multi-agent RL.    # turns out MDP is a particular version of stochastic game

N.E. in multi-agent RL : pair of policies s.t. no agent would prefer to change his policy.

[stochastic games notions]   # first proposed by Shapley, a nobel prize winner

S : states                 #  s ∈ S
Ai : actions for player i   #  a ∈ A1, b ∈ A2    (assuming a 2-player game)
T : transitions            #  T(s,(a,b),s')
Ri : rewards for player i   #  R1(s,(a,b)),  R2(s,(a,b))
γ : discount factor        #  assume γ is the same for all players (just the definition we stick to)

##
##   models  &  stochastic games
##

we can add constraints to stochastic games to get specific models we are familiar with.
(for simplification, assume a 2-player game)

(1)  R1 = -R2

===> zero sum game

(2)  T(s,(a,b),s') = T(s,(a,b'),s') ∀b'                # i.e. player 2's action doesn't matter
R2(s,(a,b)) = 0, R1(s,(a,b)) = R1(s,(a,b')) ∀b'   # i.e. player 2's reward is always zero
#      player 2's action doesn't affect player 1's reward
===> MDP  (where your opponent's action doesn't affect you)

(3)  |S| = 1

===> repeated game   because we always play the same state

###
###   zero-sum stochastic games      # assume a 2-player game
###

recall Bellman equation to represent a utility(aka value) function
can we define the analogous Q value function for zero-sum stochastic game ?

_                                                     _
Qi(s,(a,b))  =  Ri(s,(a,b))  +  γ∑ [T(s,(a,b),s') max Qi(s',(a',b'))]      # notice something is wrong here
s'              a',b'

===> naively assumes joint actions of max benefit will be chosen for a particular player i
===> to adopt it to zero-sum stochastic game, we need to use minimax instead

_                                                         _
Qi(s,(a,b))  =  Ri(s,(a,b))  +  γ∑ [T(s,(a,b),s') minimax Qi(s',(a',b'))]    # modified bellman equation for zero-sum stochastic game
s'                a',b'

now, recall the Q learner update equation. to solve the Bellman equation.

Q(s,a) = (1-α)Q(s,a) + α[r + γ max Q(s',a')]      # α = learning rate
a'
===> let's adopt this to zero-sum stochastic game

given an experience of a transition <s,(a,b),(r1,r2),s'>      # took (a,b) in s, got r and landed in s'

Qi(s,(a,b)) = (1-α)Q(s,(a,b)) + α[r + γ minimax Qi(s',(a',b'))]   # neat, this is aka "minimax-Q"
a',b'

===> turns out we get nice properties!!

- value iterations works
- minimax-Q converges
_
- unique solution to Q
- policies can be computed independently # because you just want to minimize your opponent's max score (this idea extends to more-than-2-player games too, as long as it's zero-sum)
- update efficient (i.e. polynomial time, because the above minimax can be solved by linear programming)
- Q functions sufficient to specify policy

###
###   general-sum stochastic games
###

general-sum means not zero-sum. so there can be win-win, lose-lose pairs of strategies, like prisoner's dilemma.

let's extend the same Q-learning to general-sum games.

instead of minimax (which was useful in zero-sum game), we use Nash equilibrium.
_                                                         _
Qi(s,(a,b))  =  Ri(s,(a,b))  +  γ∑ [T(s,(a,b),s') Nash Qi(s',(a',b'))]
s'               a',b'

Qi(s,(a,b)) = (1-α)Q(s,(a,b)) + α[r + γ Nash Qi(s',(a',b'))]
a',b'

====> called "Nash-Q"

sadly, with Nash-Q in general-sum game, we don't get any of the properties from minimax-Q in zero-sum stochastic games.

- value iteration doesn't work
- Nash-Q doesn't converge
_
- no unique solution to Q
- policies cannot be computed independently   # because N.E. is a joint behavior
- update not efficient  (unless P = PPAD)
- Q functions not sufficient to specify policy

===> doomed!  but there are lots of good ideas

(1) repeated stochastic games
- play a stochastic game (which repeats turns for players) and repeat the game itself. that allows us to build a folk theorem like ideas. already some algorithm exists for this.

(2) cheaptalk
- let players to talk/share some info, to coordinate. correlated equilibria.

(3) cognitive hierarchy
- instead of trying to solve for an equilibirum, you just assume about others, so you can apply MDP Q learning type of approaches. (turns out this is often how human plays games, and often gives a near best response)

(4) side payments (coco values)  # coco = cooperative competitive
- as in, hey i get a big reward if we take this joint actions, then i pay back to you some.

###############
###  outro  ###
###############

semi-supervised learning
spectral clustering : using sophisticated linear algebra to avoid local optima
TF-IDF : term frequency - inverse document frequency
cross entropy

########################
####    Appendix    ####
########################

###
###   normal distribution      # aka Gaussian distribution
###

(ref) https://en.wikipedia.org/wiki/Normal_distribution

informally, it's a common continuous probability distribution, where it's symmetric along the mean. and most of the data points are around the mean.

mathematically, it's expressed in terms of mean μ and variance σ^2

e^( -(x-μ)^2 / 2σ^2 )
y = Pr(x) =  ----------------------    on the domain x ∈ (-∞,∞)       # this function is called "probability density function"
sqrt(2πσ^2)

note: its shape is nothing but a bell curve. μ defines the center, and σ defines the width

###
###  Sampling Distribution
###

(ref) http://davidmlane.com/hyperstat/A11150.html      #  sampling distrib
(ref) http://davidmlane.com/hyperstat/A13660.html      #  sample distrib of the mean

let's say you have a lot of data points (can be continuous or discrete), with whatever underlying distribution.

and you pick a set of samples as below:

S_a = {s1,s2,s3,,,,,sn}     #  a sample set of size n
S_b                         #  another set
S_c                         #  more
..
..
S_infinite

note: beware of the confusing wording.
sample size n refers to how many data sample points within a sample set, like above.
and we take lots of sample sets, like S_a, S_b, S_c, so on, like above.
we assume we get infinitely many sets in the context of "sample distribution" (by definition)

for each set, let's say we compute some statistics, like the mean (or median or whatever)
so you get the below

S_a's mean
S_b's mean
S_c's mean
..
..
S_infinite's mean

===> if we plot the frequency distribution of the values of those means, we know the below properties.

##
##   properties of the sampling distribution of the mean
##

given a population with a mean of μ and a standard deviation of σ,
the sampling distrib of the mean has a mean of μ, and a stddev of

σ
σ_M = ---------     where n = sample size
sqrt(n)

σ_M is also called "standard error of the mean"    # it's really just "stddev of the sampling distrib of the mean"

this is intuitive, because if you take more data points within each sample set, then each set's mean is more likely to have a value closer to the mean of the original global population of data. i.e. less fluctuation, i.e. smaller stddev (of the sample distribution)

also, as n increases, we know the sampling distrib of the mean approaches a normal distribution (known as central limit theorem)
let's write it down as an independent note below.

###
###  Central Limit Theorem (CLT)
###
- an important fundamental theorem in statistics        # Alan Turing's master's thesis was on its proof

given "any" (even crazy) distribution of data (can be continuous or discrete), with a mean μ and a variance of σ^2,
its sampling distribution of the mean approaches a normal distribution with a mean μ and a variance of σ^2 / N, as N increases, where N = sample size

note:  variance σ^2 / N  means  stddev σ/sqrt(N)

- this is why many ML algo and scientific/engineering experiment assume normal distribution of the data they analyze.

(ref) http://davidmlane.com/hyperstat/A14043.html      # central limit theorem

###
###  Cross Validation  VS  Model Complexity
###

training your model with training data, and testing(validating) the model with test data.

there are lots of sophisticated techniques.

(ref) https://en.wikipedia.org/wiki/Cross-validation_(statistics)

(1) k-fold cross validation:
- split the data in k equal sized sets. pick 1 set as test data, and others as training data.
- repeat cv k times (hence k folds), using each set exactly once.
- out of k results, take the mean, as your final estimate.

(2) stratified k-fold cross validation
- it's a special type of k-fold cv where each set is selected to have the same mean (of response variable y) as the other sets.

"Model Complexity" analysis is when you do cross validation training/testing using a learning algo (could be any, like DT, NN, KNN, SVM, SA, MIMIC, etc), there are hyper parameters to each algo (like for DT, what depth, pruning method, what method to use to determine the best attribute/value to split on?) and if you try different sets of parameters, using train/test and pick the param set that gives the best performance on the test data, then it is considered the knowledge of the test dataset leaking into the training.
so, the idea is first prepare train+test data (usually 70/30% or 80/20%), then split the train data into train + validation data sets (like 80/20%), then do training on the train data and testing on the validation set, to determine the best parameters (GridSearchCV in scikit-learn library), then try the untouched, held-out test data set.   also note, because the initial train data is split into train+validation sets, we kind of lost the data size, so this is where k-fold CV is done.
--> and sometimes people assume this Model Complexity analysis as part of Cross Validation process.

##
##  Sample Complexity
##

it's just another way of referring to "input data size" aka "problem complexity"

##
##  "data scientist"
##

sadly nowadays we see so many fancy job titles thrown around like data scientist, ML engineer, quant stat researcher, big data/model/AI analyst, and so on.

a data scientist in one company may mean something completely different from a data scientist in other company.

so many schools now offering "analytics" "data science" masters these days.
i think people who put "machine learning" as expertise on their resume ideally should have at least one paper published in one of icml, jmlr, nips, aaai, uai, ijcai, colt, icann, or related ieee/acm journals or international conferences. you know, then we know he/she is legit.

1. 2016-10-02 16:10:20 |
2. Category : gatech
3. Page View: