CS7641 notes
the discipline existed for many decades but suddenly gaining popularity recently for 2 reasons.
(1) computation power : we now have reached a point where we can meaningfully do deep neural nets
(2) data availability : we now have tons of data available to us
##
## books
##
read Tom Mitchell book
other resources: http://omscs.wikidot.com/courses:cs7641
http://www.cs.cmu.edu/~tom/mlbookchapterslides.html
http://www.cs.cmu.edu/~tom/NewChapters.html
##
## math
##
(font) http://www.alanwood.net/demos/symbol.html
(ref) http://www.cs.unm.edu/~joel/NonEuclid/noneuclidean.html
euclidean geometry: the one we are intuitively familiar with. in any finite n dimension, points are designated by coordinates (x,y,z,etc) and any two points have only one line to connect/intersect them.
noneuclidean geometry: like sphere, hyperbolic geometry, where two points can be connected by multiple ways. (imagine in a sphere, you can choose two coordinates but there are many ways to connect them.)
(ref) https://en.wikipedia.org/wiki/Norm_(mathematics)
on an n dimensional euclidean space R^n, length (i.e. norm) of a vector (x, x2, x3,,, xn) is as follows
n
x_2 = sqrt( Σ xi^2 ) # sometimes simply denoted as x or x
i
##
## max f(x) and argmax f(x)
## x x
max f(x) means the maximum f(x) value
argmax f(x) means the value of x that gives max f(x)
##############################
#### machine learning ####
##############################
induction: you try to figure out a general rule based on sepecific examples.
deduction: you try to figure out specific examples/outcome/results based on general rule.
regression: in ML/stats context, this only means "approximation"
(1) supervised learning
 given "labeled" training data X & Y, so the model can predict Y for unseen X in the future. Y is called "label"
 fair to call it "function approximation" or "function induction"
 two kinds:
 1. classification: (e.g. logistic regression) trying to classify given data X into some discrete label Y
e.g. given some data(height, weight, hair color, etc), predict gender female or male.
 2. regression: numerical (real number, continuous value) approximation.
e.g. given some data(height, weight, hair color, etc), predict age.
(2) unsupervised learning
 you try to describe/summarize "unlabeled" data into groups/clusters.
 given many animals, you might group them into mammals, insects, fish, etc
 unsupervised learning attempts to offer "data description"
(3) reinforcement learning
 supervised learning, but delayed reward/results/signals.
 e.g. an agent that plays chess. your agent might make many moves till it finds out it lost. but where ?
 so it's harder than simple supervised learning where you get feedback at each step if you did it good or bad.
 so it's going to train some behavior that achieves/scores some goal well.
NOTE: input data type doesnt matter in distinguishing between (1),(2),(3)
###
### classification learning
###
typical vocab
 instances: input
 concept: function that maps input to output
 target concept: an actual answer function
 hypothesis class: all the possible functions you are willing to entertain.
 sample: training set (input/output pairs)
 candidate: a concept you consider to be the target concept.
 testing set: input/output pairs data set which you havent used in the training.
#################################
#### (1.1) decision trees ####
#################################
 a typical classification algorithm.
suppose you are trying to classify the race of a given person.
 nodes: a question (e.g. what eye color ? speak english ?)
 edges: answers/attributes/values (blue, red, green, etc, yes,no )
 leaf nodes: output(asian, african, caucasian, etc)
NOTE: you start from the root node.
(quiz) https://www.youtube.com/watch?v=JDLcwBR8Odc
(quiz) https://www.youtube.com/watch?v=R0OA5e1tFhA
## (from Mitchell book)
## decision tree is suitable for problems with the following characteristics
##
 discrete data (but works for continuous data if we bin them)
 requirement of expressiveness. DT can cover conjunction(&&), disjunction(), xor
 training data may have errors/missing labels (DT is robust for these)
##
## decision trees: expressiveness
##
 boolean:
e.g. A && B
A
t/ \f # t=true, f=false
B 
t/ \f
+ 
===> because A and B are communitive, you can swap them in the tree, and it still works.
e.g. A  B
A
t/ \f # t=true, f=false
+ B
t/ \f
+ 
==> again, A and B are communitive in this case, so they are swappable.
e.g. A xor B
A
t/ \f
B B
t/ \f t/ \f
 + + 
===> you need more nodes. in fact, depending on what you want to express with decision trees, you get diff complexity.
===> the first two are essentially linear O(n), and xor is O(2^n)
===> so this means decision tree is expressive (big hypothesis class), but also you need to be careful about how to search among it.
(good quiz) https://www.youtube.com/watch?v=KwMaBd3UitA
(good quiz) https://www.youtube.com/watch?v=QOdGDGjTDvI
(good quiz) https://www.youtube.com/watch?v=tm_2Ot9ud18
https://www.youtube.com/watch?v=z98Lr7105IM
NOTE: this is a fundamental problem in ML. XOR is an example of a function that is not linearly separable.
##
## ID3 algorithm
##
(doc) https://storage.googleapis.com/supplemental_media/udacityu/5414400946/ID3%20Algorithm%20for%20Decision%20Trees.pdf
 an example of topdown decision tree algo
loop:
1. A = best attribute
2. assign A as decision attribute for node
3. for each value of A, create a descendant (edge)
4. sort training examples to leaves
5. if examples perfectly classified, stop. else itereate over leaves.
===> what exactly is "best attribute" ?
there are a few well established methods. e.g. entropy, gini index, correlation, etc
the most common method is information gain (i.e. randomness reduction, entropy minimization)
given a set of examples S and an attribute A,
Sv
gain(S,A) = entropy(S)  Σ  * entropy(Sv) # V(A) denotes a set of possible values(v1,v2,,vk) for A
v∈V(A) S # Sv is a subset of S where attribute A has a value v
==> what the summation part is doing is computing the expected(or average) entropy you have over each particular example.
(definition): entropy is a measure of the expected encoding bit length.
e.g. to encode weather Sunny or Rain (two choices), you only need 1 bit
e.g. to encode N labels, you need log_2(N) bits
==> we will revisit this later.
entropy(S) = Σ p(v) log_2 p(v) # p(v) = proportion of v in S.
v # e.g. if S consists of 14 samples each labeled some color, and 5 of them are Blue, then p(Blue) = 5/14
# if 3 of them are Green, then p(Green) = 3/14
note: the above method easily extends to continuous value attributes http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381s06/www/DTs.pdf
e.g. we will simply choose values to split on (like if x1 < 0.5 or not) based on max info gain (aka IG)
there are also ways to extend to covering continuous value "labels" but it's harder and not the common use of DT
##
## ID3: bias
##
restriction bias: which hypothesis set "H" you care about. (our H = all possible decision trees, in this case, not some numerical equations)
preference bias: aka inductive bias: what sort of individual hypothesis is prefered within the H you care about. aka how an algo prefers certain solutions over others.
inductive bias of ID3 algo: (i.e. what ID3 algo prefers)
 good splits at top # obviously you want the first (or nodes near the root) node to split the H in half. (most efficient decision tree'ing)
 correct over incorrect # obviously you want a decision tree that gives good prediction
 shorter trees (implied by the first point)
##
## decision tree: other considerations
##
1. does it make sense to repeat an attribute ?
e.g.
A
/ \
B
/ \
A # NO, because that's zero information gain (at least for discrete value attribute)
2. continuous attributes? e.g. weight, distance, age
 range:
age
/  \ # e.g. 3 way split: below 20, in 20, over 30
A B C
note: maybe later under node A, you might wanna use "age" attribute again, spliting within belowage20 range, like age 010, 1119, then it's valid.
3. when do we stop the loop ?
 everything classified correctly (not a good idea. sometimes impossible, also prone to overfitting)
 no more attribuets?
 cross validation (train and test. if performance is good on test data)
 prune (one way to deal with overfitting)
4. using decision tree for regression ?
 yes, possible. we revisit this later.
###############################
### (1.2) Regression ###
###############################
regression == function approximation
define the error function in terms of x & y then all you need to do is find the minima/maxima
(example calculus) https://www.youtube.com/watch?v=dewLtKqna04
f(x) = c0 + c1*x + c2*x^2 + c3*x^3 + ... # c = coefficients. aka weights
 constant regression e.g. f(x) = c # the mean is the least squared error value
 linear regression # degree 1
 polynomial regression # Nth degree polynomial
##
## Overfitting
##
 for insample data, as you increase degree of freedom (polynomial), then error decreses.
e 
r  __
r  \_
o  \_
r  \
 

degree of freedom
===> now, think about outofsample data, it's likely that to some extent, increasing the decgree of freedom helps.
but, after a threshold, its error increases. past that inflection point is called "overfitting" zone
e 
r  __ /
r  \_ _/
o  \_ _/
r  


degree of freedom
"overfitting" definition: insample error decreasing while outofsample error increasing
"underfitting" : overgeneralization: like toolow degree of freedom, like f(x) = mean(x)
#
# errors
#
 training data may have errors. so we may be modeling f()+e, not f()
 where does error come from ?
 sensor error
 maliciously. people may lie when answering questions, etc.
 transcription error
 unmodeled influence. X:size, Y:price for house is obviously an example, because house price is influenced by other factors/attributes such as location.
#
# cross validation
#
 you train models using training data set
 you test models using test data set
 you may arbitrarily split the data chunk into 80/20 as train/test (with/without replacement)
 obviously, if you deal with chronologysensitive timeseries data, like stock price, you cannot train on future data and test on past data.
note: read Model Complexity notes at the end of this page.
#
# IID: independently & identically distributed
#
 statisticians/ML scientists like to assume/believe IID when training/crossvalidating ML models
 fundamental assumption of many supervised learning algos. the data we use for train/test is representative of the world.
#
# representation for regression
#
 how to represent discrete input as numerical quantity ?
 well if it's binary, like gender male/female, it's easy: 1 or 0
 if it's skin color, then mapping arbitrary number for each color doesn't work.
e.g. brown=0, black=1, white=2, yellow=4, etc because if we decide a positive coefficient, then more yellow means a bigger value for Y ?
 so we have to get smarter about the representation. one way is to create many binary attributes from one discrete attribute, like skin white 01, brown 01, orange 01, so they can have independent coefficients.
(called one hot encoding https://en.wikipedia.org/wiki/Onehot )
####################################
#### (1.3) Neural Networks ####
####################################
in general, unlike human brain where neurons die and born all the time, neural network in ML is considered a pre fixed architecture, the number of layers and nodes and all. so if you reconfigure it, then you have to retrain everything.
###
### perceptron
###
a neuron unit called perceptron, first proposed in the 50s, can be perceived as follows:
 it accepts signals (vector X), and corresponding weights(vector w)
 computes the inner product (we call it activation), and if it's bigger than the threshold Θ, then it fires a signal, as in y=1, otherwise no action, as in y=0
k
activation a = Σ xi*wi
i
if a >= Θ , then y=1, oterwise y=0
it is a linear function.
(visualization) https://www.youtube.com/watch?v=VwCMm3llk1s
we can use a perceptron unit to implement boolean functions. (by configuring weights and threshold Θ)
(AND aka conjunction) https://www.youtube.com/watch?v=OrslP0iaQA
https://www.youtube.com/watch?v=QA2PatOWYTw
(OR function) https://www.youtube.com/watch?v=4zeUvG1xjwM
https://www.youtube.com/watch?v=CXijskL6MkU
(NOT function) https://www.youtube.com/watch?v=pE5SPhHzk5c # now we are increasing dimension
https://www.youtube.com/watch?v=dPzpnAEe1hg # so it's nonlinear transformation in a way
==> so we can represent anything by combining the above.
(XOR function) https://www.youtube.com/watch?v=5rAhHGimTOU
###
### perceptron training
###
 given examples, find weights that map inputs to outputs
 well we should consider threshold Θ too, but we just set Θ to some number, and only adjust weights, so math gets simpler.
(1) perceptron rule # threshold based output
(2) gradient descent (aka delta rule) # non threasholded output
###
### perceptron rule
###
given many examples of x1,x2,x3,,,xn y, you want to decide weights w, so you can apply w to unseen x vector, to (hopefully accurately) predict y.
in this case, assume y = 01
here is the perceptron rule:
 decide some init values for weights.
input output
x1,x2,,,,xn y # example data 1
x1,x2,,,,xn y # example data 2
x1,x2,,,,xn y # example data 3
x1,x2,,,,xn y # example data 4

w1,w2,,,,wn # weights
y^ = ( Σ(wi*xi) >= Θ ) # y^ is just the predicted y value(0 or 1) based on threshold theta Θ, using the current weights for a given input vector x
i
==> here is the trick: we wanna treat threshold Θ = 0 zero to simplify math
==> just add a nagative Θ, in the weight vector, and we set the corresponding input for that to be always 1, as below.
(that 1 is called bias)
bias input output
1,×1,x2,,,,xn y # example data 1
1,×1,x2,,,,xn y # example data 2
1,×1,x2,,,,xn y # example data 3
1,×1,x2,,,,xn y # example data 4

Θ,w1,w2,,,,wn # weights
repeat:
y^ = ( Σ(wi*xi) >= 0 ) # now replaced Θ theta with 0 zero
i
Δwi = R * (yy^) * xi # Δwi = delta wi, it's how much we adjust the existing wi, at each iteration of perceptrol rule based training
# take the difference yy^, and multiply by xi * R
# recall y = 01 binary, so
# if y == y^ , Δwi = 0, no weight adjustment needed
# if y=1, y^=0, then weights need to get bigger, hence R * 1 * xi
# if y=0, y^=1, then weights need to get smaller, hence R* 1 * xi
# R = learning rate # if small R, then it learns slowly, if big R, then it magnifies the adjustment val Δwi
wi = wi + Δwi # and finally, adjust the weights w
stop if y == y^ for all x # i.e. Δwi == 0
===> simple yet powerful. (trivial to implement in code)
===> visualizing in a plane below, separating y=0 from y=1

\ x x
 \ x x
 \ x
o \ x
 o \ x x
 o o\

===> if there is a line that separates y=0 from y=1, then we call the data "linearly separable"
===> if the data is linearly separable, perceptron rule will find a right w in finite iterations.
===> if not liniearly separable, then it can run forever (bad), so we should set some condition like if Δw is small enough, then complete. or after K iterations, complete. etc
###
### gradient descent
###
 we wanna have an algo that is more robust to non linearseparability.
a = Σ(wi*xi) # a = activation
i
==> instead of computing a threasholded value y^ = {a > 0} = 1 or 0, we just try to adjust w so a becomes closer to actual y
E(w) = 1/2 Σ(ya)^2 # E = error metric. square is just so neg,pos don't cancel each other. 1/2 is explained below.
x,y # x,y is all x,y data we have as training data
==> so we just wanna know w that minimizes E(w)
==> just take the partial derivative of E, with respect to w
# ∂ is a greek "d" for partial "d"erivative.
# partial deriv simply means taking the derivative of a function with respect to one variable, with other vars kept constant. (as opposed to total derivative which allow all vars to vary)
∂E ∂
 =  * 1/2 * Σ(ya)^2 # just use the chain rule in calculus
∂wi ∂wi x,y # here that 1/2 will be gone, which is why we put 1/2 in the E() definition, it gets cleaner here
= Σ(ya)(∂/∂wi)  Σ(wi' * xi)
x,y
= Σ(ya)(xi) # and if we set this to zero, then it turns out setting a = average of y is the best estimate
x,y
Δwi = R * (ya) * xi # R = learning rate
wi = wi + Δwi
(ref) https://math.stackexchange.com/questions/1352726/minimizingafunctionsumofsquares
(ref) https://en.wikipedia.org/wiki/Minimum_mean_square_error
###
### comparison of learning rules: perceptron VS gradient descenet
###
perceptron rule: Δwi = R * (yy^) * xi # guarantee finite convergence, but ONLY in linear separable data
gradient descent rule: Δwi = R * (ya) * xi # robust to non linsep data, but converge to local optimum
(quiz) why not do gradient descent on y^ ?  cos it's not differentiable (y^ not continuous)
https://www.youtube.com/watch?v=ymHKayqpp1E
==> what if we can make "differentiable" threshold ?
###
### Sigmoid # sig = sigma, moid = like
###
σ : lower case sigma
let's define a sigmoid function: σ(a) = 1 / (1 + e^a) # a = activation
a > ∞ , σ(a) > 0 # as a goes negative infinity, σ(a) goes zero
a > +∞ , σ(a) > 1 # as a goes positive infinity, σ(a) goes one
===> now the thresholded value is not an abrupt 0 or 1, but a nice continuous differentiable function.
let's take the derivative of this Sigmoid function.
Dσ(a) = σ(a)(1σ(a)) # D = derivative. yes, this is elegantly simple
visual: https://www.youtube.com/watch?v=WcDtwxi7Ick
###
### Neural Network Sketch
###
input layer : x vector
hidden layer: sigmoid units
output : y
x1\
x2 s1  s4  s8\
x3 s2  s5  .  y
. / s3  s6  sn/
. / \ s7
xn/
==> the beauty of this is the mapping from input to output is differentiable in terms of weights, despite the non linearity of hidden layer.
"back propagation" : computationally beneficial organization of the chain rule.
> it's just saying we compute output with x and weights, and propagate the error back toward the input, to update the weights.
NOTE: this suffers many local optima. also no guarantee of finite time convergence.
visual: https://www.youtube.com/watch?v=w9OFiOaTFs8
in a way, you can consider NN == representation learning
###
### optimizing weights
###
 gradient descent suffers local optima/minima
 advanced methods:
 momentum: don't stop at a minima, keep going/searching for other minima
 higher order derivatives
 randomized optimization: to be revisited
 penalty for "complexity" e.g. regression overfitting, e.g. large tree overfitting
complexity in network:
 more nodes
 more layers
 large numbers (i.e. magnitude for weights), like a weight of 27582409238 is more complex than a weight of 1585
in a way, optimization = learning
###
### Restriction Bias
###
 gives you a restricted set of hypothesis(bias) you are willing to consider, to inform the representational power of the structure you are using (in this case, nueral networks)
 tells you what you can/cannot represent.
perceptron: half spaces by threshold
sigmoids : much more complex, not much restriction
boolean: network of thresholdlike units
continous function: hiddden layer
arbitrary function: stitch together all of the above.
==> you can represent any function ultimately, but unbounded complexity leads to overfitting.
so usually, you bound it by fixing the max number of layers, nodes, magnitude of weights.
and do the usual crossvalidation (train VS test) to stop where you need to stop.
###
### Preference Bias
###
 tells you about algorithm to select a representation over others.
 what algorithm?
e.g. how do you decide initial weights ?
 small random values
 small is preferred to big weights (less complexity, as super big weight might help you overfit)
 randomizing init weights help you avoid getting stuck in one same local minima. "variability"
"Occam's Razor" : entities should not be multiplied(made more complex) unnecessarily.
i.e. simplicity is preferred to complexity
#############################################
##### (1.4) Instance Based Learning ####
#############################################
 another paradigm to approach supervised learning.
 just store all instances in some DB, and so your regression result is f(x) = lookup(x)
 aka "nonparametric"
###
### KNN : K nearest neighbor (instance based regression)
###
very simple.
we keep data points (i.e. store all known instances), and use them when we predict. (aka data centric, instance based approach)
given a new X, you find K nearest known Xs in your collection of data points, then take the mean of their Y values.
if we weigh each Y based on the distance of the Xs to your target X, then it is called "Kernel regression"
(good visual) https://www.youtube.com/watch?v=CWCLQ6eu2Do
(good visual) https://www.youtube.com/watch?v=ZhJTGBbR18o
note: what defines the "nearest" "distance" ?  it's really conceptually "distance" == "similarity"

Given: training data D = {xi,yi}
number of neighbors k # domain knowledge
query point q
distance metric d(q,x) # domain knowledge
NN = {i: d(q,xi) k smallest} # NN is a set of k smallest(=nearest) neighbors
return
 classification : vote (most frequent, or random) yi ∈ NN # flexibility in implementing "vote"
 regression : mean of yi ∈ NN # flexibility in implementing "mean" like weighted mean, etc

note: in practice, you can tweak the logic a bit. e.g. what if you set k=5, but you find 8 equally nearest neighbors? then maybe get all 8 of them, so you took "k or greater than k nearest neighbors"
(exercise) https://www.youtube.com/watch?v=X8tm6x2k_gQ
(quiz) https://www.youtube.com/watch?v=1t6Wm37e1VM
 running time  space

1NN learn O(1)  O(n)
query O(log(n))  O(1)
KNN learn O(1)  O(n)
query O(log(n) + k)  O(1)
linear regression learn O(n)  O(1)
query O(1)  O(1)
###
### parametric VS nonparametric regression
###
parametric: # linear regression
 no need to store data, space efficient,
 but new evidence comes then you have to redo the whole training
 thus training is slow, but query is fast # faster than BDD
nonparametric: # KNN
 need to store data, space consuming
 but able to quickly add new evidence
 training is fast (no need to learn new parameters), query can be "potentially" slow
 suitable for complex problems where we are not sure of the underlying mathematical model
###
### KNN bias (assumption)
###
 preference bias = our belief about what makes a good hypothesis # i.e. the ways we express our preference across various hypotheses
 locality: near points are similar
 smoothness: averaging
 all features matter equally
###
### curse of dimensionality
###
 as the number of featuresdimensions grows, the amount of data we need to generalize accurately grows exponentially.
 applies to ML in general, not just to KNN method
O(n^d)
##
## to recap
##
 distance metric d(q,x)
 e.g. euclidean, manhattan, weighted version, mismatch (if it's discrete instances like color)
 k
 how to determine k value (you kinda have to know someting about the data)
 how to process k chosen instances. weighted average? locally weighted linear regression?
note: if you do "locally weighted linear regression" then you do it different k nearest neighbors across your whole universe of instances, then when you stitch together the approximated functions from each local neighbor, then you get a complex yet effective good approximate function.
this is conceptually significant, becaure you went from H ∈ lines to lines ∈ H # H = hypothesis space
#################################################
##### (1.5) Ensemble Learning : Boosting ####
#################################################
ensemble learning  a class of learning algos.
e.g.
 bagging
 boosting
(1) learn a (relatively simple) rule over a subset of data
learn another rule over another subset of data
repeat
(2) combine them in "some" way to make one effective/complex rule
# note: if you take the whole data to learn a rule, it cannot produce a simple rule, instead it might do something complex which causes overfitting. (more on this later)
# important questions:  "bagging"  "boosting" 
 Q. how do we decide a subset ?  uniformly randomly  hardest example 
 Q. how do we combine ?  average  weighted average 
(quiz) https://www.youtube.com/watch?v=Lc5QcpEdySE
###
### bagging
###
(example) https://www.youtube.com/watch?v=Cn7StaXU_8o
 suppose you have 20 data points
 you take 5 subsets, each subset has 5 uniformly randomly chosen data points
 for each subset, you do 3rd order polynomial regression training (so you get five models as a result)
 then for any given test data point, you do the average of the output from each model.
(called "bagging")
===> it has been proven that this performs better (on test data) than a single 4th order polynomial which overfits more.
###
### terminology
###
"error": essentially mismatch. but we wanna define this more formally.
E = Σ(yy^)^2 # y = correct label, y^ = prediction
RMSE = sqrt(Σ(yy^)^2 / N)
==> however, it is not good enough. see below.
e.g.
suppose you get 4 test data points: x1 x2 x3 x4 (and associated y1 y2 y3 y4)
and your model predicts Y correctly for x1 and x3
==> we may say it's 50%, or we may calc RMSE
==> suppose the following frequency of data.
x1 : 20%
x2 : 5%
x3 : 70%
x4 : 5%
==> then getting x1 & x3 means getting 90% right, as opposed to 50% or whatever RMSE you calculated. so we address data distrib D
Pr_D[h(x) != c(x)] # given distribution (= D) of data, probability of hypothesis (= y^) not matching concept (= y)
# i.e. error rate that takes distribution of data into account
"weak learner" : learner that does better than chance (= 50%) regardless of distribution
∀D, Pr_D[h(x) != c(x)] <= 1/2  ϵ # ∀D means for all D
# ϵ = epsilon, some small number
e.g. your learner may consists of multiple rules/hypothesis, and given the input data set, x1 x2 .. xn, if you can find some distrib such that it does not beat 50% error rate, then it is not a weak learner.
(video) https://www.youtube.com/watch?v=ljIO4uuVR8
###
### boosting
###
(1) given a training set {xi,yi} where yi in {1,+1} # binary classification
(2) for t = 1 to T
 construct Dt # distribution at a given time t
 find weak classifier h_t(x) with small error ϵ_t = Pr_Dt[h_t(x) != yi] # 0 < ϵ_t <= 1/2
 output H_final
e.g.
suppose we have n examples x1 x2 x3 ... xn
first, start with t=1
D_1(i) = 1/n # uniform distrib (because at this point, you have no notion of which example is harder/important over the others)
# obviously, Σ Dt(i) = 1
i
D_t+1(i) = [ D_t(i) * e^(α_t * yi * h_t(xi)) ] / Zt # note h_t(xi) will predict 1 or 1
# notice if h_t(xi) == yi, then their product will be 1
where α_t = 1/2 * ln( (1 ϵ_t) / ϵ_t ) # alpha is always a positive value. the smaller error ϵ_t, the bigger α_t, and we use α_t to control Zt
where Zt = whatever normalization constant at time t, which you need to define to ensure Σ D_t+1(i) = 1
i
Zt = Σ [ Dt(i) * e(α_t * yi * h_t(xi)) ]
i
==> so you reduce the distrib for an example if your classifier gets it correct
i.e. you increase the distrib for an example if your classifier gets it wrong
==> reduce/increase by how much ? that's exactly where your ϵ_t comes into the picture.
the examples you get with bigger error rate gets increased in distrib more.
H_final(x) = sgn(Σ α_t * h_t(x)) # sgn(x) is just a sign function that extracts the sign of the input real number
t # so you get an overall consensus, basically weighting smaller error (more correct) hypothesis bigger
# because α_t is bigger for hypothesis with smaller error
# in other words, α_t is a confidence indicator. (to be revisited)
(paper that explains α_t logic) https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/boostingsurvey.pdf
in short, α_t is the way it is (we are only considering binary classification, for real value estimation, α_t gets more complicated, see the paper) because it has been mathematically proven to provide a very quick and effective learning. (assigning appropriate weighting to the errorneous predicted instances)
(intuitive example) https://www.youtube.com/watch?v=u1MXf5N3wYU
https://www.youtube.com/watch?v=zodDmURwMHY
https://www.youtube.com/watch?v=OWjjAP4noUE
suppose you have three examples (x1 x2 x3). again we assume binary classification, i.e. yi in {1,+1}
at the first iteration, lets say you got x1 & x2 right, and x3 wrong. # 66.6% correct, so that's better than chance
at the second iteration, because you increased the weight for x3, you get x3 correct, and at least of one of x1 or x2 correct, to get over 50% correct. lets say you got x1 wrong, and x2 correct.
at the third iteration, your weights on x3 & x1 are far bigger than x2, so you will x1 & x3 right, and may or may not get x2 right. but as you can see, your error is decreasing. because your distrib on x1 & x3 have been increased while x2 weight has been decreased than prev iterations, and you got x1 & x3 right.
##
## error VS confidence
##
we tend to analyze learning performance in terms of error, but also we should take advantage of confidence info.
e.g.
in KNN, if all K neighbors agree on one answer, then it's more confident than only some of K voting on the answer.
in Boosting, α_t is confidence indicator
also, for other algo, we can repeat and measure variance.
###
### boosting & overfitting
###
 boosting is not very prone to overfitting. at least not in traditional sense where error on training data decreases while error on test data increasing.
(to be discussed in the next section on SVM)
############################################
#### (1.6) Support Vector Machines ####
############################################
SVM is a classifier algorithm, trained based on labeled data sets (i.e. supervised learning) that classifies examples by finding an optimal hyperplane that linearly separates examples.
(intuitive example quiz) https://www.youtube.com/watch?v=IVaT503Nel4
here using a 2D plane, trying to linearly separate o & +, then you can draw a line close to the cluster of "+", or the cluster of "o", but it feels more generalized if you can draw a line that is right in between, i.e. as far away from the data as possible while remaining of course consistent with the data.
in fact, the more optimally "righ in the middle" you can draw the line, it is more generalized and reduces overfitting.
 + + + +
 o + + + ++
 o o ++ ++
 ooo o + +
o o oo o +
__________________
To formalize with algebra, let's generalize to hyperplane. # but to keep intuitively thinking about this in the above 2D plane, you can substitute hyperplane as "line"
given labelled training set (x1,y1),(x2,y2),,,(xn,yn)
y = w^T * x + b # y = label # we assume yi in {1,+1} in this case
# x = training data/examples
# w,b = parameter of the plane. w = weight vector, b = bias
# w^T is just w tranpose so you can get the inner product of w & x vectors
# to be precise, you may write this as yi = w^T * xi + b
so the hyperplane (the line right in between, furtherst from both clusters in the above 2D visual) should be: w^T * x + b = 0
==> so (again, speaking in 2D for the sake of intuitive example), anything on one side will be classified as "+", and the other side as "o"
the further away that hyperplane is from the training data examples, the more "generalized" we consider the classifier is.
to define that optimal hyperplane(or line in 2D), you need to define hyperplanes (or two parallel lines) at the boundary of training examples.
for the maximum generalizedness of the classifier(optimal hyperplane), we wanna find the two boundary hyperplanes that maximize the distance(we call "margin") between them, i.e. giving the most generalized (optimal) hyperplane.
the reason for two parallel lines is because otherwise you cannot define the middle line that stays the furthest from both boundary lines.
the hyperplane that is:
 the closest to the border of the "+" cluster will have w^T * x + b = +1 # this is the definition
 the closest to the border of the "o" cluster will have w^T * x + b = 1 # you just have to accept them
(ref) http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
(ref) https://nlp.stanford.edu/IRbook/html/htmledition/supportvectormachinesthelinearlyseparablecase1.html
(ref) http://web.mit.edu/6.034/wwwbob/svm.pdf # idiot's guide to SVM
Definition: # from the ref
define the hyperplanes H such that
w^T * xi + b <= 1 when yi = 1
w^T * xi + b >= +1 when yi = +1
H1 and H2 are the planes: # two boundary planes for us to find the middle plane
H1 : w^T * xi + b = +1 # sometimes (like Tom Mitchell)
H2 : w^T * xi + b = 1 # we use a +a, instead of 1 +1 # a is for arbitrary
(ref) https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kernels_SVM2_04_12_2011ann.pdf
H0 is the (optimal) hyperplane in between: w^T * xi + b = 0
2
the distance (called "margin") between H1 & H2 is:  = 2 / w # or 2a / w
w
+ + ++ + ++
+ ++ + + +
+ + + + + +
+H1  # here, examples on H1, H2 are called "support vectors"

H0  margin = 2 / w

ooH2 
o o o o
oo o o o
o o o o
#
# note : how we get 2 / w
#
this is nothing but just linear algebra in high dimension.
what is the distance bewteen a given point and a line?
a point: (m,n)
a line : B*y = A*x + c
Am + Bn + c
distance d = 
sprt(A^2 + B^2) # L2 norm
(ref) https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line
(ref) http://www.intmath.com/planeanalyticgeometry/perpendiculardistancepointline.php # very good
(ref) https://en.wikipedia.org/wiki/Norm_(mathematics)
e.g.
given a line: y = A*x + C
what is the distance to the origin (0,0)
c
d = 
A
(ref) http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
###
### Support Vector Machines
###
maximize the margin 2 / w while classifying every training example correctly.
note: "classifying every training example correctly" can be written as
w^T * xi + b <= 1 when yi = 1
w^T * xi + b >= +1 when yi = +1
==> that can be simplified by multiplying by yi, as below.
yi * (w^T * xi + b) >= 1 ∀i # ∀i means for all i
also, maximizing 2 / w means minimizing w
so we can rewrite the entire statement as follows:
minimize w while yi * (w^T * xi + b) >= 1 ∀i
mathematically, there is a technique called quadratic programming from linear algebra that can solve this minimization(optimization) problem. (but we don't go into the details in this course)
maximize Q(α) = ∑αi  1/2 * ∑ αi * αj * yi * yj * xi^T * xj # where i = training examples, j = test examples
i ij
s.t. αi >= 0 and ∑ (αi * yi) = 0 # s.t. means subject to
i
note: α is just a new set of parameters we use for quadratic programming
(ref) https://en.wikipedia.org/wiki/Quadratic_programming
note:
xi^T * xj is just a dot product, a number that shows the length of one vector projected onto the other.
in general, given two vectors xi,xj # again, for our purpose, i = training examples, j = test examples
 if they are orthogonal (i.e. perpendicular), then xi^T * xj = 0 # dissimilar
 if they point to the same direction (i.e. parallel), then xi^T * xj = big # veeery similar
 if they point to the opposite direction, then xi^T * xj = negative number # veeery dissimilar
===> this intuition that the dot product (aka inner product) provides some measure of "similarity" is the key
(ref) http://web.mit.edu/6.034/wwwbob/svm.pdf # page 17 ~, excellent visuals
there are 2 cases:
case 1: xi,xj are dissimilar, then their dot product is 0, so they don't contribute to Q(α)
case 2: xi,xj are similar
 and classify the same class: decrease Q(α) because the algo downgrades similar vectors that makes the same prediction
 and classify the diff class: increase Q(α) because the algo wants to maximize Q(α) for similar vectors that tell the two classes apart. i.e. maximizing the margin.
right now, we are just using xi^T * xj as our similarity measurement, but as we will see below, for not linearly separable problems, we will use a transform function φ() to make xi,xj linearly separable. (called kernel trick)
so our similiarity measurement will change
from : xi^T * xj
to : φ(xi)^T * φ(xj) # we simply rewrite this as K(xi,xj) # K = Kernel
note: the definition of φ() is different problembyproblem.
###
### properties of SVM
###
1. w = ∑ (αi * yi * xi) # once w is obtained, b is easy to obtain
i
2. αi mostly 0 # most of the alpha is zero, i.e. only few xi matter
==> only xi that are support vectors matter.
###
### non linearly separable problems
###
so we know how to find the optimal hyperplane for linearly separable problem.
e.g.
++ +
oo + +
o oo + +
o oo +
ooo
how about not linearly separable problems?
e.g.
++ +
oo + + o
o oo + o+
o oo +
ooo
===> we can still come up with a way to define outliers, and find the hyperplane that seprates examples with some notion of minimized error.
then, how about the example below?
ooooo
o +++ o
o +++++ o
o +++ o
ooooo
we use "kernel trick" (aka kernel function) to transform data points to another space (often higher dimension) where it becomes linearly separable.
in this case with the above visual example, we kind of wanna draw a circle to separate o from +
a circle is where x^2 + y^2 + b = d
now let's design our kernel function for this problem.
given a point q = (q1,q2)
φ(q) = <q1^2, q2^2, sprt(2)*q1*q2> # φ() is just Phi, a transform function
e.g.
ooooo φ(o) φ(+)φ(+)
o +++ o φ(o) φ(+) φ(+)
o +++++ o ===> φ(o) φ(+)
o +++ o φ(o) φ(o) φ(+)
ooooo φ(o) φ(+)
given two points: x and y # i.e. x = (x1,x2), y = (y1,y2) # previously, we used xi,xj instead of x,y
φ(x)^T * φ(y) = <x1^2, x2^2, sqrt(2) * x1 * x2>^T * <y1^2, y2^2, sqrt(2) * y1 * y2>
= x1^2 * y1^2 + x2^2 * y2^2 + 2 * x1 * x2 * y1 * y2
= (x1 * y1 + x2 * y2)^2
= (x^T * y)^2 # yes, this is essentially an equation for circle
= K(x,y) # we write this as K(x,y)
note: so we got K(x,y) = φ(x)^T * φ(y)
= (x^T * y)^2
===> as long as we have K(x,y) = (x^T * y)^2 , we don't even need to knowcompute φ() # in fact φ() is often unknown, expensive to compute
this is the neat aspect of kernel trick
what makes a good kernel function K()? well that's where you inject your "domain knowledge" of what constitutes "similarity" that separates + and o
e.g.
K = x^T * y
K = (x^T * y)^2
K = (x^T * y + c)^p # p for polynomial
K = exp(1 * (xy^2 / 2σ) ) # radial basis function (gaussians)
K = tanh(β * x^T * y  σ) # sigmoid
note: SVM with a radial basis function has infinite model capacity, so it can overfit.
recall the SVM problem in quadratic programming form:
maximize Q(α) = ∑αi  1/2 * ∑ αi * αj * yi * yj * xi^T * xj
i ij
we can rewrite it as:
maximize Q(α) = ∑αi  1/2 * ∑ αi * αj * yi * yj * K(xi^T, xj)
i ij
note: kernels can be defined over almost anything. not just vectors. but also strings, trees, structures, etc etc
===> extremely powerful
===> SVM is used to compare DNA, sentence structures, etc
"Mercer Condition": well behaved distance(or similarity) function
###
### Boosting revisited
###
 recall we claimed boosting is not volunerable to overfitting (at least not in traditional sense, where error on training data decreases while error on test data starts increasing)
 recall our boosting algo, the final hypothesis
H_final(x) = sgn(Σ α_t * h_t(x)) # sgn(x) is just a sign function that extracts the sign of the input real number
t # so you get an overall consensus, basically weighting smaller error (more correct) hypothesis bigger
> look at the inside of sgn()
Σ α_t * h_t(x) # α_t is bigger for h_t() with smaller error
t # so α_t can be seen as confidence indicator
suppose you have H_final(x) that predicts everything correctly.
among the examples correctly predicted, there are high confidence ones, and notsoconfident ones.
suppose your boosting learner prediction is binary classification of "+" or "o"
e.g.
o oo o o o + + ++ +++ +

confident notconfident confident
as you keep training the boosting learner more, it assigns bigger α_t for h_t() that correctly predicts notsoconfident examples.
e.g.
oooo oo ++++ ++++

confident notconfident confident
==> it may reduce α_t for some of alreadysuperconfident ones but at least it increases α_t for veryunconfident examples.
and the overall error rate stays the same while increasing confidence.
and notice the distance between the most unconfident "+" and the most unconfident "o" are wider.
this can be seen as "margin" from SVM.
the bigger margin, the more generalized, the less overfitting !!
and this is an intuitive explanation of why boosting is not prone to overfitting
(it's not completely intuitive so read the paper for the formal proof)
Still, however, Boosting is known to (tend to) overfit when:
 (typically a neural network with many layers and nodes) underlying learners/hypothesis all overfit (and all perfectly predict correct) then you cannot stop them from overfitting.
 when data has pink noise (= uniform noise) # white noise = gaussian noise
#################################################
#### (1.7) Computational Learning Theory ####
#################################################
 defining learning problems.
 showing specific algorithms work.
 show these problems are fundamentally hard.
theory of computing analyzes how algorithms use resources:
 time
 space
 samples/examples/data # i.e. how much data does the algo require ?
###
### inductive learning: learning from examples, to derive general form
###
important questions in inductive learning algo.
1. probability of successful training : 1δ where δ is the probability of failure
2. number of examples to train on
3. complexity of hypothesis class
4. accuracy to which target concept is approximiated: ε
5. manner in which training examples are presented: batch or online(one by one) ?
6. manner in which training examples are selected
###
### selecting training examples
###
1. learner selects x, and ask for y
2. teacher/feeder select x and tells y
3. fixed distrib: x chosen from d by nature
4. worst distrib: evil
quiz:
H: set of possible people(i.e. hypothesis)
X: set of yesno questions
if teacher (who knows the answer) chooses X, then only 1 question needed.  "is this person XYZ(correct answer person)" then answer yes, done.
if learner (who doesn't know the answer) chooses X, then if has to pick a question like "is this person A or B or C or D or... ?" selecting half of H, then the rest is binary search. thus log_2H
==> but the above teacher scenario is unrealistic. usually there are some "constraints".
###
### teacher with constrained queries
###
e.g.
X: x1,x2,,,xk # kbit input
H: conjunctions of literals or negation
e.g.
k = 5
h: x1 && x3 && !x5
 deduce the h value for each input X vector
x1 x2 x3 x4 x5  h

0 1 0 1 1  ? # it's 0
1 1 1 0 0  ? # it's 1
1 0 0 1 0  ? # it's 0
==> then, given the below table, how do you induce the right h ? (could be more than one)
x1 x2 x3 x4 x5  h

1 0 1 1 0  1
0 0 0 1 0  1
1 1 1 1 0  0
1 0 1 0 0  0
1 0 1 1 1  0
==> compare the first two rows, then you see x1 and x3 are irrelevant to the output h value 1
h: positive absent negated
 # this means 3^k potential hypothesis
x1: +
x2: +
x3: +
x4: +
x5: +
==> so we got h: !x2 && x4 && !x5
how many questions does teacher need to get the above answer h ?
(1) show what's irrelevant: 2
(2) show what's relevant: k
===> so the answer is k + 2 # linear time !
###
### learner with constrained queries
###
e.g.
X: x1,x2,,,xk # kbit input
H: conjunctions of literals or negation
e.g.
k = 5
now you have to come up with input X then you can learn h.
but what X values should you ask to induce h ?
(x1,x2,x3,x4,x5) = (0,0,0,0,0) ?
(x1,x2,x3,x4,x5) = (1,0,1,0,0) ?
..
..
x1 x2 x3 x4 x5  h

? ? ? ? ? 
? ? ? ? ? 
? ? ? ? ? 
? ? ? ? ? 
..
..
==> you need to ask O(2^k) questions. # there are 2^k possible input X
# exponential sample complexity is bad
==> but it's because of the harsh constraints. if we were able to ask questions like "is x1 positive? " then we could induce h more quickly, like in linear time.
so the constraints matter.
###
### learner with mistake bounds
###
(1) input arrives # doesnt matter how the input is selected. (randomly by learner or benevolent/malicious teacher, etc)
(2) learner guesses answer # no need to know about hypothesis space
(3) wrong answer charged # learner learns
(4) go to (1)
==> bound the total number of mistakes
[candidate eliminattion algo] # illustrated in Mitchell book
(1) assume it's possible each variable positive and negated
(2) given input, compute output
(2.5) eliminate impossible hypothesis # essentially the same as (3)
(3) if wrong, set all positive variables that were 0 to absent, negate variables that were 1 to absent, goto (2)
e.g.
start with this.
h: positive absent negated

x1: + +
x2: + +
x3: + +
x4: + +
x5: + +
but if you compute output based on your current hypothesis: (x1 && !x1 && x2 && !x2 && x3 && !x3 && x4 && !x4 && x5 && !x5)
==> this will only produce 0 for any input because you are taking the AND of every variable being positive and negated.
that means, to get a wrong answer to get charged/penalized, you need an input example with the correct output 1
here is an input. (any prior input with the correct output = 0 doesn't change the H table)
x1 x2 x3 x4 x5  h

1 0 1 1 0  1 # so correct output is 1
following (2.5) logic, we can eliminate some possibility
h: positive absent negated

x1: + # given the input, x1 can still be positive or absent, but we know it's not negated
x2: + # similarly update the rest
x3: +
x4: +
x5: +
actually, if you skip (2.5) and go straight to (3) then you get the below, which is essentially the same.
it says we know it's either positivenegated and/or absent, then ignore absent.
h: positive absent negated

x1: + +
x2: + +
x3: + +
x4: + +
x5: + +
===> so we essentially eliminated the half of the possibile literals.
(so far this is the first "mistake")
here is a next input.
x1 x2 x3 x4 x5  h

1 0 1 1 1  1
==> following the logic (3), you update x5 in the H table, as below.
h: positive absent negated

x1: + +
x2: + +
x3: + +
x4: + +
x5: + # now we definitely know x5 is absent
===> this is the 2nd "mistaken"
===> repeat.
==> how many mistakes do we need ?
well, the first mistake eliminated the half of the literals, and the rest is we need at most k mistakes to correct x1 ~ xk variables.
i.e. it never makes more than k+1 mistakes
note: this learning algo will converge to a correct hypothesis, assuming (1) no error in the training sample, (2) there is at least some h ∈ H s.t. h(x) = c(x). otherwise the algo will converge to an empty hypothesis set.
###
### definitions
###
computational complexity : how much computational effort is needed for a learner to converge to an answer?
sample complexity : (in case of batch) how many training examples are needed for a learner to create a successful hypothesis?
mistake bounds : how many misclassifications can a learner make over an intinite run ?
training set : S ⊆ X, and c(x) ∀x ∈ S
==> it's just saying out of ALL input space X, we only get a sample subset S, and for every x in S, we know y == c(x)
hypothesis space : H
true hypothesis : c ∈ H # c = concept, i.e. y
candidate hypothesis : h ∈ H
"consistent" learner : produces c(x) = h(x) ∀x ∈ S
(more formally)
A learner L is a consistent learner if L outputs hypotheses that perfectly fit the training data.
note: for a consistent learner, VS(H) == H
note: for a consistent learner, if H is finite, we can bound the number of training examples needed for a ceoncept to be PAClearnable by L. (more details to follow)
version space: VS(S) = {h s.t. h ∈ H consistent wrt S} # i.e. a set of hypothesis consistent with examples
i.e. # also denoted as VS(S,H)
VS(S) = {h ∈ H  h(x) = c(x) ∀x ∈ S}
e.g.
c: target concept # well, it's just the XOR function, in this case
x1 x2  c(x) aka y

0 0  0
0 1  1
1 0  1
1 1  0
suppose, S (i.e. training set) is below
x1 x2  c(x)

0 0  0
1 0  1
==> what is VS(S) ?
H = {x1, !x1, x2, !x2, x1 && x2, x1  x2, x1 ^ x2, x1==x2} # let's say H consists of 8 hypothesis
VS(S) = {x1, x1  x2, x1 ^ x2} # 3 of them are all consistent wrt S
==> obviously, we know in this case (x1 ^ x2) is the answer, but as far as S goes, then VS(S) is the above.
(good quiz) https://www.youtube.com/watch?v=3xv0tf3voow
##
## a simplistic VS algo (from Mitchell book)
##
aka listtheneliminate algo
1. VS = H
2. for each training example <x,c(x)>
remove from VS any h where h(x) != c(x)
3. output VS
==> this only works if H is finite. if H is huge, then VS might get huge, which is not ideal.
see "learner with mistaken bounds" for a more elaborate approach.
##
## a more compact representation for Version Space (from Mitchell book p32)
##
define:
D: sample training data
G: the most general (aka "maximally general") hypothesis group, consistent with D
S: the most specific (aka "maximally specific") hypothesis group, consistent with D
formally:
G = {h ∈ Hconsistent(h,D) ∧ (¬∃h' ∈ H)[(h' >g h) ∧ consistent(h',D)]}
S = {h ∈ Hconsistent(h,D) ∧ (¬∃h' ∈ H)[(h >g h') ∧ consistent(h',D)]}
VS(H,D) = {h ∈ H(∃s ∈ S)(∃g ∈ G)(g >=g h >=g s)}
note: when a learner is asking for new training examples (or a helpful teacher is giving more training examples), good examples are the ones that help shrink the version space.
##
## partially learned concepts (from Mitchell book)
##
suppose your VS contains multiple hypotheses, you haven't yet got enough training examples to converge to one h, and you have to classify some new test data x_test.
==> it is possible to classify x_test with the same confidence as you would with fully converged VS with a single h.
e.g.
if every h ∈ H gives h(x_test)=1
more specifically, you don't need to check every h ∈ H, you only need to check if every h ∈ S where S = most specific hypothesis group in VS.
generally speaking,
if a majority of h ∈ VS classify h(x_test)=1, then you are fairly confident.
if a minority does, then not so confident.
##
## inductive bias  formally defined (from Mitchell book)
##
let L(xi,Dc) denote the classification that learner L assigns to test data xi after training on sample data Dc
L(xi,Dc) is inductively inferred from (Dc ∧ xi)
but without some sort of assumptions, L may be completely overfit or simply have memorized (rote learn) every example as such, and never generalize to unobserved data. so there should be some sort of assumption. we call it inductive bias.
intuitively, it's a set of assumptions(preferences) so that the learning algorithm works(effectively) at all.
[definition]
consider a concept learning algorithm L for the set of instances X.
let c be an arbitrary concept defined over X, and let Dc = {<x,c(x)>} be an arbitrary set of training examples of c.
let L(xi,Dc) denote the classification assigned to the instance xi by L after training on the data Dc.
the inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc.
(∀xi ∈ X)[L(xi,Dc) follows deductively (B ∧ Dc ∧ xi)]
to recap,
L(xi,Dc) is inductively inferred from (Dc ∧ xi)
and (∀xi ∈ X)[L(xi,Dc) follows deductively (B ∧ Dc ∧ xi)]
such B is called inductive bias
###
### Error of hypothesis h
###
(1) training error: fraction of traning examples misclassified by h
(2) test error: fraction of testing examples misclassified by h
(3) true error: fraction of examples that would be misclassified on sample drawn from D # D = some distrib, i.e. we are talking infinite data set
Definition: the true error of a hypothesis h w.r.t a target concept c and an instance distribution D is the probability that h will misclassify an instance drawn at random according to D.
we write this as below:
error_D(h) = Pr_x~D[c(x) != h(x)] # very simple, error of h in some distrib D is the probability of c(x) != h(x) for x drawn from D
===> so when we think about penalizing error, we can penalize the misclassification on the examples that have higher distrib, than rare examples.
(pretty much the same idea from the boosting algo)
note: the best h() you pick may not be identical to c() but it's possible at least for training examples you get, you find such h() s.t. h(x)=c(x) ∀x ∈ S
note: true error is theoretical probability. often in real world, we don't fully know the distribution D, and don't have infinite training examples. but if we have a good idea to estimate D, and sufficiently many examples then we can learn accurateenough true error. let's formally define "accurate enough" & "sufficiently many".
###
### PAC (probably approximately correct) learning
###
C : a class of target concepts
L : learner, i.e. algorithm
H : the hypothesis space of L
n : H = size of hypothesis space
D : distrbution over inputs
0 <= ε <= 1/2 : error goal
0 <= δ <= 1/2 : certainty goal. i.e. for probability of (1δ), the algo must achieve true error(h) <= ε
C is PAClearnable by L using H iff (if and only if) learner L will, with probability 1δ (or better), output a hypothesis h ∈ H s.t. true error_D(h) <= ε in time and sample polynomial in terms of 1/ε, 1/δ and n. (polynomial far better than exponential)
i.e.
Probably == 1δ
Approximately == ε
Correct == error_D(h)=0
note: of course, such h() is in VS(H)
###
### εexhausted version space
###
VS(S) is εexhausted iff ∀h ∈ VS(S), error_D(h) <= ε
(good quiz) https://www.youtube.com/watch?v=_AJvYHrgrD0
e.g.
let's revisit the same quiz setup.
c: target concept # wel, it's just the XOR function, in this case
x1 x2  c(x)  D

0 0  0  0.1
0 1  1  0.5
1 0  1  0.4
1 1  0  0
suppose, S (i.e. training set) is below
x1 x2  c(x)

0 0  0
1 0  1
==> what is VS(S) ?
H = {x1, !x1, x2, !x2, x1 && x2, x1  x2, x1 ^ x2, x1==x2} # let's say H consists of 8 hypothesis
VS(S) = {x1, x1  x2, x1 ^ x2} # 3 of them are all consistent wrt S
==> obviously, we know in this case (x1 ^ x2) is the answer, but as far as S goes, then VS(S) is the above.
find the smallest ε s.t. we εexhausted VS(S)
h : error_D(h)

x1 : 0.5
x1  x2 : 0
x1 ^ x2 : 0
====> therefore, ε = 0.5 # well, this satisfies 0 <= ε <= 0.5 so it's valid
###
### Haussler Theorem  bound true error as a function of training examples
###
"how many training examples would it take to εexhaust version space with probability 1δ (or better)"
i.e. "how many traning examples would it take to PAClearn a concept ?"
Theorem: if H is finite, and M is a sequence of independently randomly drawn training points from data D (i.e. M is i.i.d and represent the distrib of data D), then for any 0 ≤ ε ≤ 1, then Probability[VS(H) not εexhausted] ≤ Hexp(εM)
(proof)
think about all the hypothesis in H (let's say H = {h1,h2,,,hn} i.e. H = n), some have high true error, while others have low true error.
let error_D(h1,h2,,,,hk ∈ H) > ε # high true error hypothesis set. (high, relative to ε)
# how much training samples do we need to be sure these are really bad?
# note: h1,h2,,,hk are a subset of H that gives error_D(hi) > ε
# i.e. k < H
in other words,
Pr_x~D[hi(x) = c(x)] <= 1ε # i.e. Pr_x~D[hi(x) != c(x)] > ε
where i = 1~k # basically saying low probability of h(x)=c(x) for those high true error hypothesis
# low, relative to 1ε
Pr[an hi consistent with c on M examples] <= (1ε)^M # i.e. in version space
# each probability is independent
# so Pr[h1 in VS(S)] && Pr[h2 in VS(S)] && .. Pr[hk in VS(S)] would be a multiplication (1ε)^Mk but we don't really use this.
Pr[at least one of h1,h2,,,hk consistent with c on M example] <= k(1ε)^M <= H(1ε)^M <= Hexp(εM) <= δ
note: exp(x) == e^x
note: bounding Pr[VS(H) not εexhausted] <= δ means bounding Pr[VS(H) is εexhausted] > 1δ
note: calculus(taylor series) says ε >= ln(1ε) thus exp(ε) >= 1ε, then exp(εM) >= (1ε)^M
note: δ = failure probability, which we decide to bound the upper bound of Pr[VS(H) not εexhausted] > 1δ
note: why Pr[at least one of h1,h2,,,hk consistent with c on M examples] <= k(1ε)^M ?
technically speaking, it should be
Pr[an hi not consistent with c on M examples] > 1  (1ε)^M
Pr[every hi not consistent with c on M examples] > (1  (1ε)^M)^k
Pr[at least one of h1,h2,,,hk consistent with c on M example] <= 1  (1  (1ε)^M)^k # but it's too complex
here, the trick we use is "unionbound inequality"
i.e.
P(A ∪ B) = P(A) + P(B)  P(A ∩ B) # by definition of set theory
P(A ∪ B) ≤ P(A) + P(B)
==> that's why we could simply add up (1ε)^M for k times, to obtain ≤ k(1ε)^M
so, to recap, Hexp(εM) is the upper bound of version space not εexhausted after M samples
now lets rewrite this in terms of M
Hexp(εM) <= δ
lnH εM <= ln(δ)
lnH  ln(δ) <= εM
(1/ε)(lnH + ln(1/δ)) <= M # sample complexity lower bound M is polynomial in terms of ε,δ,H !!
note: "sample complexity lower bound M" == at least how many training samples we need, to achieve Pr[VS(H) is εexhausted] > 1δ]
(recall) Pr[VS(H) not εexhausted] <= δ
Pr[VS(H) is εexhausted] > 1δ
i.e. if Pr[VS(H) is εexhausted] > 1δ, then C is PAClearnable.
note: recall properties of logarithm
log(x^y) = y*log(x) # e.g. log(δ) = log(δ^1) = log(1/δ)
log(x) + log(y) = log(xy)
log_b(x) = log_a(x) / log_a(b)
b^(log_b(x)) = x
(ref) https://www.andrews.edu/~calkins/math/webtexts/numb17.htm
NOTE: in Haussler theorem, this sample complexity bound M assumes H includes h(x)=c(x) but what if not? then we call it "agnostic" and try to find h from H s.t. h(x) best fits(closest to) c(x).
NOTE: what if H = n is infinity ? then M becomes infinity. this is an important issue. we talk about this in the next lesson.
##
## PAC learnable  an exercise quiz
##
(quiz) https://www.youtube.com/watch?v=LZ6e0z4YkdM
given
H = {hi(x) = xi}
X : 10 bits i.e. i = 1~10 # i.e. H = 10
ε : 0.1 # 10% error
δ : 0.2
D : uniform
How many samples do we need to PAClearn this hypothesis set ?
just use the equation.
M >= 39.12
let's say we need 40 examples. this is good because the input space is 2^k = 2^10 = 1024, so M is less than 4% of the input space.
===> but now think about, we want the error to be lower, like 1%, then M >= 400, so now that's almost 40% of the input space.
i.e. we need a bigger M (i.e. more samples) to PAClearn with lower error.
#################################
#### (1.8) VC Dimensions ####
#################################
###
### infinite hypothesis spaces
###
recall Haussler theorem
(1/ε)(lnH + ln(1/δ)) <= M # sample complexity bound M is polynomial in terms of ε,δ,n where n = H
what if H is huge (or infinite) ? then M is infinite,,,, bad.
e.g.
 linear speratators # you can pick an infinite combination of coordinates
 artificial nueral networkds # weight(s) = real number, so your choice is infinite
 decision trees (continuous input) # you can infinitely keep splitting. e.g. is x > 1.1 ? is x > 1.11 ? is x > 1.111 ? and so on
(but decision trees with discrete input has a finite H, because there is only a finite number of features)
 KNN is tricky, one can argue there is an infinite number of possible input data points, so the classifier that comes out of it, there can be an infinite kind.
(but if you think about it differently, well, the data set is fixed for a given KNN classifier, so its behavior is finite. it depends on whether you consider data as part of H space)
###
### how to approach infinite H learning problem
###
X : {1,2,3,4,5,6,7,8,9,10} # input space is integers from 1 to 10
H : h(x) = 1 if(x >= θ) # θ theta is just a parameter, a real number
= 0 otherwise # so yes, H can be infinite, because there is an infinite choice of θ value
like always,
 track all hypotheses # only need to track nonnegative integers ten or below. then it's finite AND gives the same answer!
 keep (and εexhaust) version space
i.e. if θ is one or below, then the answer is always zero, so we don't need to track any hypothesis where θ is one or below.
similarly, if θ is 10 or above, then we can ignore all such hypothesis, because the answer is always 1.
i.e. we say "syntactically" the hypothesis space is infinite, but "semantically" it is finite.
##
## how to approach inifite H learning problem (from Tom Mitchell textbook)
##
 it helps to have some way to order (or give structure) to H
e.g.
generaltospecific ordering # textbook page24
suppose input x = <x1,x2,x3,x4,x5>
h1(x) = 1 if x4 >= 0 # only cares about x4
0 otherwise
h2(x) = 1 if x2 >= 0 and x4 >= 0 # cares about x2 and x4 only
= 0 otherwise
[definition]
any x that gives h2(x)=1 will give h1(x)=1 # we say "give"=="satisfy"
then we say h1 is more_general_than_or_equal_to h2 # nicely intuitive
h1 >=g h2 # let's denote it like this
we say h1 is more_general_than h2 if h1 >=g h2 ∧ h2 !>=g h1
h1 >g h2 iff h1 >=g h2 ∧ h2 !>=g h1 # also nicely intuitive
importantly, "h1 more_general_than h2" == "h2 more_specific_than h1"
NOTE: this doesn't allow every single h ∈ H to be ordered. because it's possible hj !>=g hk ∧ hk !>= hj
(we say this generaltospecific relation ordering method gives a "partial" order, not a "total" order, over H)
(formally, we call the relation reflexive, antisymmetric, and transitive)
[a simple algo using generaltospecific ordering]
1. start with the most specific h()
2. iterate thru training data instance, and if the current h() doesn't fit, then choose more general h() that fits training data.
3. you end up with the most specific h() consistent with input data. # obviously there can be multiple such h() and we randomly arrive at one such h()
(very naive but it's just an approach. we need to deal with outliers/noise/error, overfitting, etc)
###
### VC dimension as a measure of the complexity of hypothesis space
###
e.g.
X : {1,2,3,4,5,6,7,8,9,10}
H : h(x) = 1 if(x >= θ) # well, h(x) output can ever be 1 or 0
= 0 otherwise # so "all possible ways" are 1 or 0 in this case
 what is the largest set of inputs that the hypothesis class can label in all possible ways?
==> the answer = 1, because suppose S = {6}, then
if θ = 5, then the answer is 1
if θ = 7, then the answer is 0
why not two (or bigger) ? because suppose you pick S = {4,6}
then, "all possible ways" mean all four cases below.
h(4)  h(6)

0  0
0  1
1  0 # this is never possible because if 4 >= θ, then 6 >= θ too.
1  1
==> so we can only label 3 out of 4 all possible ways.
(video) https://www.youtube.com/watch?v=ESoyVKSGKJc
###
### VC dimension
###
"label in all possible ways" == "shatter" # just the formal ML term
"the largest set of inputs that the hypothesis class can shatter" is known as "VC dimension"
(VC = VapnikChervonenkis, two russian scientists who came up with this theory)
so even if H is infinite, as long as VC dimension is finite, we can finitely figure the amount of data needed to learn.
we write it as e.g. VC(H) = 3
e.g. VC(H) = ∞
###
### VC dimension : quiz  interval
###
(quiz) https://www.youtube.com/watch?v=SDrFK7NxKF8
X = R # any real number, so H = infinite
H = {h(x) = x∈[a,b]} # is a <= x <=b ?
so all possible labels in this case are just 2^i where i = S, each h(xi) = 0 or 1
is the VC dimension >= 1 ? yes, because you pick any one real number, and by choosing a & b such that you can get all possible labels, as below.
x1
 # so h(x1) = 1
a b
or
x1 # or you can put x1 > b
 # so h(x1) = 0
a b
is the VC dimension >= 2 ? Yes!
x1 x2 # h(x1) h(x2)
 # 0 1
a b
x1 x2 # h(x1) h(x2)
 # 1 0
a b
x1 x2 # h(x1) h(x2)
 # 1 1
a b
x1 x2 # h(x1) h(x2)
 # 0 0
a b
is the VC dimension >= 3 ? no,
x1 x2 x3 # x1 <= x2 <= x3
 # there is no way you can get h(x1)=1, h(x2)=0, h(x3)=1
a b #
==> therefore, overall, VC dimension = 2
(this was a simplistic intuitive example, let's look at a more realistic example next)
###
### VC dimension : quiz  linear separators
###
X = R^2 # 2 dimensional space
H = {h(x) = w^T * x >= Θ} # this equation is generalized as a linear separator in hyperplane
# but just for simplicity, let's just think only in R^2, 2dimensional space
# e.g. classify any points below Θ as "" and else "+"
 + + + +
 o + + + ++
 o o ++ ++
 ooo o + +
o o oo o +
__________________
basically you have control over each value of W vector elements
is VC >= 1 ? yes because if input size is 1, just pick some W so that x1 is left or right of your linear separator
is VC >= 2 ? yes because imagine two input points (x1,x2) in a 2d plane below.
 # h(x1)  h(x2)
 x1 x2 # 
 # 1  1 # this is easy
 # 0  0 # just draw your line left or right (or above or below) both points
# 1  0 # just draw a line to separate x1 and x2
# 0  1 # then flip the sign for theta Θ, so you get this
is VC >= 3 ? yes

 x2

 x1 x3


h(x1)  h(x2)  h(x3)

1  1  1 # these are always easy
0  0  0
1  0  0 # easy also
0  1  1 # just flip the sign for theta Θ
1  1  0
0  0  1
1  0  1 # just draw a line to separate x2 from x1,x3
0  1  0 # flip the sign again
==> notice we kind of chose x1,x2,x3 relative coordinates like the above by ourselves, which is ok.
==> VC dimension doesn't have to satisfy every single possible values of X vector.
==> because otherwise think about the case where x1=x2=x3, then we can only cover the case where h(x1)=h(x2)=h(x3)
is VC >= 4 ? no because no matter how you choose your X and W, you cannot shatter.
e.g.
 # here, there is no way you can get h(x1)=h(x4)=1 and h(x3)=h(x2)=0
 x3 x4 #

 x1 x2


e.g.
 x4 # again, there is no way you can get
 # h(x1)=h(x2)=h(x4)=0 and h(x3)=1
 x3

 x1 x2

###
### generalize the relationship btwn VC dimension and Hypothesis space dimension
###
hypothesis space dimension  parameters  VC dimension

0  theta Θ  1
1  a,b  2
2  W,Θ  3 # W is a vector of size 2
3  W,Θ  4 # W is a vector of size 3
d  W,Θ  d+1 # W is a vector of size d
##
## quiz: polygons(convex)
##
(quiz) https://www.youtube.com/watch?v=TYHFOpgCBek
(ref) https://en.wikipedia.org/wiki/Convex_polygon
a polygon is just a shape consisting of lines and edges.
e.g. a triangle, a rectancle, a pentagon, etc
e.g. a circle can be considered a polygon of infinite degree
a convex polygon is where each edge has angle <= 180
X : R^2 # 2d plane
H : points inside some convex polygon (onedge counts as inside)
parameters : infinite, because the degree of polygon is not specified
VC = infinite
===> see the video for proof, the point is VC=d+1 observation still applies.
##
## sample complexity & VC dimension
##
m >= (1/ε)(8*VC(H) * log_2(13/ε) + 4log_2(2/δ)) # infinite case (we don't go into its proof)
m >= (1/ε)(lnH + ln(1/δ)) # finite case
(Q) what is VC dimension of finite H ?
if H is finite, then VC(H) is finite, then we know the following upper bound.
suppose VC(H) = d, then for those d points from X, there must be 2^d possible labelings of these points. Each of these labelings requires a separate hypothesis.
i.e.
suppose VC(H) = d, then ∃ 2^d distinct concepts(=labels) (each gets a different h) # ∃ means "there exists"
thus, 2^d < H # i.e. d < log_2H
thus, VC(H) = d < log_2H
note: 2^d because we assume binary classification
Theorem: H is PAClearnabe if and only if VC dimension is finite # i.e. if VC dim = ∞, then it's not PAC learnable
######################################
#### (1.9) Bayesian Learning ####
######################################
 learn the best (= most probable, most likely correct) hypothesis h, given data & some domain knowledge
argmax Pr(hD) # probability of hypothesis h (drawn from some hypothesis class H) given data D
h∈H # notice D is data, not distribution in this case
note: argmax f(x) means pick x such that f(x) is maximized
x
#
# probability chain rule
#
Pr(a,b) = Pr(ab) * Pr(b) # probability of a & b is probability of a given b times probability of b
= Pr(ba) * Pr(a) # because the order doesn't matter, you can rewrite it as this also
==> in other words,
Pr(ba) * Pr(a)
Pr(ab) =  # here you got the Bayes theorem
Pr(b)
##
## Bayes theorem
##
P(Dh) * Pr(h)
Pr(hD) = 
P(D)
Pr(D) : prior belief of seeing some particular data set D
because Pr(D) doesn't depend on H, so we typically ignore this.
Pr(Dh) : probability of data D given the hypothesis h (assuming h is true)
==> accuracy of h
D = {(xi,yi)} # training data set D
# so Pr(Dh) means the probability of h(xi) = yi
# i.e. Pr(Dh) can be interpreted as the accuracy of h on D
e.g. given h(x) = {x >= 10}
x=7, then probability of h(x)=1 is zero
and probability of h(x)=0 is 1
Pr(h) : prior belief of seeing h out of H
"prior belief" == "domain knowledge"
NOTE: to argmax Pr(hD), you can either increase Pr(h) and/or Pr(Dh), as you analyze/pick h.
h∈H
##
## Bayes rule  quiz
##
a man goes to a hostpital to test for a desease K. his lab test result is positive "+".
the test returns correct positive 98% of the time. and correct negative 97% of the time.
0.8 % of the population has the disease K.
does he really have K ?
what's the probability he has K ?
Pr(K+) = Pr(+K) * Pr(K) / Pr(+)
= 0.98 * 0.008 / Pr(+)
= 0.00784 / Pr(+) # see below for Pr(+) derivation
= 21%
what's the probability he doesn't has K ?
Pr(!K+) = Pr(+!K) * Pr(!K) / Pr(+)
= 0.03 * 0.992 / Pr(+)
= 0.02976 / Pr(+)
= 79%
===> Pr(K+) + Pr(!K+) = 1 so you can derive Pr(+) = 0.0376
===> despite the positive test result, he likely doesn't have K. (only 21% chance)
===> this is exactly why doctors don't test you for every rare disease even if the test is fairly accurate.
===> but thru some other test, or other prior "domain knowledge", if doctors can classify you into some group of population in which 0.8% changes to a higher probability, e.g. if you have symptoms XYZ, then you are in a group of people who have 10% chance of having K, then the test makes sense.
===> also, from mathematical perspective, you can calculate at what percentage (instead of 0.08%) does the test make sense. interesting to think about.
###
### MLE vs MAP
###
[definitioin]
a priori knowledge : domain/background generalized ground truth/knowledge
a posteriori knowldge : inductive/derived knowledge from experience/examples
(ref) https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
MLE: maximum likelihood estimation
 choose value that maximizes the probability of observed data
_
θ_MLE = argmax Pr(Dθ)
θ
(ref) http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf
MAP: maximum a posteriori
 choose value that is most probable given observed data and prior belief
_
θ_MAP = argmax Pr(θD) = argmax Pr(Dθ) * Pr(θ)
θ θ
###
### Bayesian Learning
###
[algorithm]
for each h ∈ H, calculate Pr(hD) = Pr(Dh) * Pr(h) / Pr(D)
output: h = argmax Pr(hD)
h
NOTE: you can derive Pr(D) based on the constraint that Σ Pr(hiD) = 1
hi∈H
but because Pr(D) is the same for every Pr(hiD), for the purpose of computing argmax, you can ignore Pr(D)
so let's simplify the algo.
Pr(hD) ≈ Pr(Dh) * Pr(h) # "≈" denotes approximately equal to
argmax Pr(hD) = argmax Pr(Dh) * Pr(h) # this is called MAP(maximum a posteriori) estimation
# again, notice the removal of Pr(D) doesn't affect argmax
furthermore, even a more simplified version is below.
Pr(hD) ≈ Pr(Dh) # lets assumes each Pr(hi) is equally likely, i.e. uniform distrib of h in H
# in that case, it's ok to drop Pr(h) for our argmax computation
===> so now we have a few versions of h
h = argmax Pr(hD)
h
h_map = argmax Pr(Dh) * Pr(h)
h
h_ML = argmax Pr(Dh) # ML = maximum likelihood (estimation) aka MLE
h
NOTE: it's mathematically/conceptually/theoretically elegant, but looking at each h may not be computationally practical if H is big.
###
### Bayesian Learning in Action
###
(1) given {<xi,yi>} as "noisefree" examples of C # C = concept
(2) C ∈ H # let's say H is a finite class
(3) uniform prior # i.e. Pr(hi) is equal for every i
note: noisefree means c(xi) = yi
note: (1) ~ (3) are kind of strong assumption, but not entirely unreasonable.
let's try compute Pr(hD) = Pr(Dh) * Pr(h) / Pr(D)
Pr(h) = 1 / H
Pr(Dh) = 1 if h(xi) = yi ∀ xi,yi ∈ D # ∀ means "for all"
0 otherwise # notice this is the same as "version space"
Pr(D) = Σ Pr(Dhi) * Pr(hi) # nothing fancy, just so Σ Pr(hiD) = 1
hi∈H hi∈H
# assuming each h is mutually exclusive
= Σ 1 * (1/H) # notice we already know Pr(Dhi) and Pr(hi)
hi∈VS(H,D)
= VS / H
1 * (1/H) 1 # this is neat
Pr(hD) =  =  for h ∈ VS(H,D) # probability of h being correct given D
VS/H VS # is just 1 / VS
# because we only consider h ∈ VS(H,D), Pr(Dh)=0 case is implicitly skipped
###
### noisy data  quiz
###
<xi,yi> where yi = k * xi
where k is an integer from {1,2,3,,,,,∞}
where k is chosen with probability Pr(1/2^k)
e.g. probability of k=5 is 1/2^5 = 1/32
∞
Σ 1/(2^k) = 1
k=1
===> so that's the noise process
x  y  k  Pr(1/2^k)

1  5  5  1/32
3  6  2  1/4
11  11  1  1/2
12  36  3  1/8
20  100  5  1/32
here is a candidate hypothesis: h(x) = x # f(x)=x is called "identity function"
(plus the aforementioned noise process)
Pr(Dh) = (1/32) * (1/4) * (1/2) * (1/8) * (1/32) = 1/65536
= Π Pr(yih)
i
= Π 1/(2^k) where k = y1/xi
i
if yi mod xi = 0 ∀ xi,yi
###
### Bayesian Learning  contd
###
given {<xi,yi>}
yi = f(xi) + εi # ε = error term, to make data noisy, aka "noise" corruption
εi is chosen from norm(0,δ^2) i.i.d. # normal(aka gaussian) distrib of zero mean, with variance δ
# iid = independently and identically distributed
NOTE: the above assumption is important. f(x) is just some deterministic function, which we corrupt with gaussian noise ε.
notice it's not dependent on Hypothesis class
h_ML = argmax Pr(hD)
h
= argmax Pr(Dh)
h
= argmax Π Pr(yih) # Pr(yih) is probability of seeing correct yi given h
h i
= argmax Π 1/sqrt(2πδ^2) * e^(1/2 * [(yi  h(xi))^2 / δ^2]) # just using gaussian distrib
h i # see below note and wikipedia
## at this point, to simply, we can get rid of coefficient that doesn't affect argmax computation
## we can remove 1/sqrt(2πδ^2) as it doesn't depend on i
## also we can apply ln() because log() is a monotonic function (i.e. doesn't change the argmax)
## recall log(A) + log(B) = log(A*B), and ln(e^(abc)) = abc
= argmax Σ 1/2 * (yi  h(xi))^2 / δ^2 # well simplified
h i # we can do more, get rid of δ^2
# get rid of 1/2
= argmin Σ (yi  h(xi))^2 # a minus sign can be eliminated by changing argmax to argmin
h i
## turns out this is just the sum of squared error, just like other supervised ML algo we looked at. (neural net, gradient descent, linear regression, etc)
## whoever decided to model noise as gaussian was onto something
note: gaussian probability distrib function = 1/sqrt(2πδ^2) * e^(1/2 * (yi  h(xi))^2 / δ^2)
(ref) https://en.wikipedia.org/wiki/Normal_distribution
(ref) https://en.wikipedia.org/wiki/Monotonic_function
###
### best hypothesis  quiz
###
let's apply Bayesian learning algo.
given {<xi,yi>}
yi = f(xi) + εi
sample data set S is:
x  y

1  1
3  0
6  5
10  2
11  1
13  4
which one is the best h ?
1. h(x) = x mod 9
2. h(x) = x/3
3. h(x) = 2
==> just compute Σ (yih(xi))^2 and you will find 1. is the smallest error
i
###
### Minimum Description Length
###
log_2() = lg() # let's just agree this notation of log base 2
h_MAP = argmax Pr(Dh) * Pr(h)
= argmax [lg(Pr(Dh)) + lg(Pr(h))]
= argmin [lg(Pr(Dh))  lg(Pr(h))] # see NOTE_1 below
= argmin [length(Dh) + length(h)] # see NOTE_2 below
= argmin [error(h(D)) + size(h)]
==> so you want the hypothesis that minimizes error and size.
(smaller size means simpler)
==> "occam's razor" !!! # note, in reality, error and size are in a tradeoff relation.
==> called "minimum description length"
NOTE_1:
(from information theory, there is the notion of entrophy)
an event with probability P
> its optimal entropy is lg(P) # "entropy" aka "length"
(ref) https://en.wikipedia.org/wiki/Entropy_(information_theory)
NOTE_2:
length(h) : "length of hypothesis"
: basically how many bits needed to represent a hypothesis
: i.e. "size of h"
: e.g. if your hypothesis representation is a decision tree, then smaller tree size 2^N is better.
: e.g. if your h representation is neural network, then the overall neurons/layer size may be fixed, but still smaller weights are better (because bigger wrights require more bits, in fact that leads to overfitting). so this embracing smaller size really captures the idea for neural network as well.
length(Dh) : recall an earlier discussion.
 so Pr(Dh) means the probability of h(xi) = yi
 i.e. Pr(Dh) can be interpreted as the accuracy of h on D
 so length(Dh) is shorter if accuracy is better
 i.e. length(Dh) = error
###
### Bayesian Learning VS Bayesian Optimal Classification
###
so far, we've been talking about Bayesian "learning"
i.e.
argmax Pr(hD) # pick the best (most probable, most likely correct) hypothesis h, given data D (and other domain knowledge)
h∈H # i.e. for h∈H compute Pr(hD), and output h that maximizes Pr(hD)
[quiz] given below, what should be the best label(i.e. y) for x ? + or  ?
hi(x)  Pr(hiD)

h1  +  40%
h2    30%
h3    30%
===> well, according to Bayesian "learning" rule, we pick h1(x) = + as our best h, so y should be +
but that's just picking the best "hypothesis", we want the best "label"
so we do weighted voting, aka Bayesian optimal "classification", as expressed below
v_MAP = argmax Σ Pr(vh)Pr(hD) # Pr(vh) is the accuracy of v, given h
v h∈H # pick v (== label y) such that the sum of (the probability of v given h, i.e. the accuracy of h on v) * (probability of h)
if v = +,
Pr(+h1) * Pr(h1D) + Pr(+h2) * Pr(h2D) + Pr(+h3) * Pr(h3D)
1 * 0.4 + 0 * 0.3 + 0 * 0.3 = 0.4
if v = ,
Pr(h1) * Pr(h1D) + Pr(h2) * Pr(h2D) + Pr(h3) * Pr(h3D)
0 * 0.4 + 1 * 0.3 + 1 * 0.3 = 0.6 # thus argmax will pick v = 
########################################
#### (1.10) Bayesian Inference ####
########################################
Bayesian network: representing and reasoning with probabilistic quantities.
###
### random variable
###
what is random variable: let's say "X"
 it's a variable that can take any value from a set of values. Ω_X (dubbed its domain)
 can be discrete or finite, or letters of alphabet, or {0,1}
 a trial: an instance of observation of a particular value of random variable
 a sample: a collection of trials
 every random variable (we assume) has some underlying probability distribution. P(X)
Σ P(x) = 1
x∈Ω_X
e.g.
suppose X is pittsburgh weather.
then we may get something like below.
P(X=cloudy) = 0.9
P(X=snow) = 0.07
P(X=sunny) = 0.02
P(X=other) = 0.01
note: obviously when X is contiunous, then P(x) for any particular value of x is almost zero, so we usually integrate ∫P(x) for some range of x, and call it probability "density".
###
### joint/marginal/conditional distribution
###
note: "joint distribution" == "joint probability distribution" == "joint frequency" # same goes for marginal/conditional
joint probability distribution is probability of X=x and Y=y
P(X=x,Y=y) = some_value_between_[0,1] # also denoted as P(X=x ∧ Y=y)
e.g.
atlanta summer afternoon weather:
storm  lightening  probability (should add up to 1.0)

T  T  0.25
T  F  0.4
F  F  0.05
F  T  0.3
P(storm=T,lightening=T) = 0.25
P(storm=T,lightening=F) = 0.4 # note we usually use variables names like X,Y instead of "storm" "lightening"
P(storm=F,lightening=F) = 0.05
P(storm=F,lightening=T) = 0.3
note: X and Y are independent iff P(X=x) * P(Y=y) = P(X=x,Y=y) # sometimes we write P(X)*P(Y)=P(X,Y) to simplify
"marginal distribution" is defined as below: # very obvious
P(X=x) = Σ P(X=x,Y=y) # suppose we only have joint distrib data like above table
y∈Ω_Y # then we can derive marginal distrib P(X=x) from the joint distrib data
"conditional distribution" is defined as below:
P(X=x,Y=y)
P(X=xY=y) =  # probability of X=x, given Y=y
P(X=x)
e.g.
P(X,Y,Z) = P(XY,Z) * P(YZ) * P(Z) # notice, this means we can recover joint distrib from conditional distrib
e.g.
if X,Y,Z are independent, then P(X,Y,Z) = P(X) * P(Y) * P(Z)
and of course, from the above, we derived Bayes Rule:
P(YX) * P(X)
P(XY) = 
P(Y)
here are exercise quizes;
Pr(¬storm) = 0.3 + 0.05 # "¬" = logical not. i.e. "!"
= 0.35 # this is marginal distrib
Pr(lighteningstorm) = 0.25/0.65 # probability of lightening given storm
= 5/13 # so the denominator is 0.25 + 0.4
= 0.3846 # this is conditional distrib
NOTE: with only two binary variables (storm and lightening) we have 4 possible cases.
with N binary variables, we get 2^N cases. but a variable may not be binary, e.g. a variable like "food_type" may take a few possible labels. then the complexity grows more quickly.
well we can factor these. more details to follow.
###
### Conditional Independence
###
definition: X is conditionally independent of Y given Z
if the probability distribution governing X is independent of the value of Y given the value of Z
i.e. if ∀ x,y,x, P(X=xY=y,Z=z) = P(X=xZ=z) # notice this is just expressing exactly what's described above.
(more compactly we write) P(XY,Z) = P(XZ) # i.e. probability of X is independent of Y given Z
FYI:
chain rule: P(X,Y) = P(XY) * P(Y)
normal independence: P(X,Y) = P(X) * P(Y) # simple. probability of X & Y is the probability of X times the probability of Y
P(XY) = P(X) # this is true under normal independence scenario
(good quiz) https://www.youtube.com/watch?v=Qc3gdfwHbrA
###
### Belief Networks (aka Bayes Network, Bayesian Nets, aka Graphical Models)
###
Bayes Nets  it's a representation of joint distribution of some variables (aka features, attributes) of particular conditional dependency
e.g.
S = storm # assume all these variables take binary value. 1 or 0
L = lightening # so we have 2^3 scenarios
T = Thunder # but we can simply express all cases with the below steps, ASSUMING T is conditionally independent of S given L
[storm] [lightening] [thunder]
given you can calc then you can calc
P(S) > P(LS) or P(L¬S) > P(TL) or P(T¬L)
a b c d e # some value
e.g. what's the probability of S=1,L=0,T=1 ?
==> a * (1b) * e
NOTE: we assumed T is conditionally independent of S given L. as above, it's a nice graphical representation of statistical dependence/independence.
but otherwise we need an arrow connecting directly from [storm] to [thunder] as well, i.e. to calc [thunder] probability you need both [storm] and [lightening] probabilities. but that would make the graph grow exponentially. bad.
==> this is much like a neural network (though it's not the same, because this is just a probabilistic/statistical dependence/independence visualization)
Bayes Nets  example
> e (for better visual) https://www.youtube.com/watch?v=0Te3m5y3CYA
/ ^
  # this graph just denotes
a > c > d # a: deson't depend on anything, completely independent
^ ^ # b: also completely independent
  # c: depends on a and b
b/ # d: depends on b and c
# e: depends on c and d
node  what it represents

a  P(a)
b  P(b)
c  P(ca,b)
d  P(db,c)
e  P(ec,d)
 must be acyclic
 nodes visited (computation done) in topological order. # e.g. all required parents nodes must be visited.
###
### recovering joint distribution (from conditional probability)
###
recall "joint distribution" tables, i.e. we were given the probability for every combination of variable values.
from such a table, we calculated conditional probability, such as P(a¬b)
now we can try the other direction: calculating joint distribution from conditonal probability table.
joint distrib  conditional probability
 # assuming the same above Bayes Net
P(a,b,c,d,e) = P(a) * P(b) * P(ca,b) * P(db,c) * P(ec,d)
what's the representation complexity (i.e. table size) ? # assume each variable is boolean
P(a,b,c,d,e) : 2^5 = 32 (combinations) that's gonna be a exponentially growing table size, O(2^n) 31
technically 2^5  1, because you can derive the 32nd combination's probability as 1  Σp(xi)
i=1
P(a) : 2^1 # a is true or false, hence 2. but you can always derive P(a) = 1P(¬a)
P(b) : 2^1 # so the size is 1
P(ca,b) : 2^2 # you store probability of c for every combination of a & b (=2^2, so that's size 4), and P(¬ca,b) can be derived.
P(db,c) : 2^2 # (good ref) https://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
P(ec,d) : 2^2
==> if you construct the conditional probability table, all you need is 1 + 1 + 4 + 4 + 4 = 14 in this case.
31 > 14
O(2^n) > O(n * 2^k) # k is how many parents a node has
so in a way, storing info in a conditional probability table is more efficient.
NOTE: table size (how many things you have to remember) illustration below
a=T  a=F
P(a): 0.3  0.7 # then you need just 1, because P(a=F) can be derived by 1P(a=T)
b=T  b=F
P(ab): a=T 0.6  0.1 # so table size is 2^2
a=F 0.4  0.9 # but this row P(¬ab) can be derived from the first row P(ab), so it's really 2
# by now you see the pattern
P(ab,c): 2^2 = 4 # it's 2^k where k = number of dependent variables (parents in the bayes network).
###
### sampling (from some distribution)
###
distribution:
 probability of value
 generate values
==> simulates a complex process
==> approximate inference # approximation is because maybe exact determination is hard or slow
==> visualization
###
### inference rule
###
marginalization: P(x) = Σ P(x,y) # this is trivial and redundant but useful
y # y = 1 or 0, so it's like saying P(x) = P(x,y=1) + P(x,y=0)
chain rule: P(x,y) = P(x) * P(yx) # of course, if x and y are completely independent,
= P(y) * P(xy) # then you have P(x,y) = P(x) * P(y)
= P(y,x) # x & y are commutative
NOTE: though P(x,y) = P(y,x), they are different in Belief Nets representation
e.g.
x>y # represents P(x)*P(yx)
y>x # represents P(y)*P(xy)
P(xy) * P(y)
Bayes rule: P(yx) = 
P(x)
(quiz) https://www.youtube.com/watch?v=Vt8fRJwvdwA # inference by hand
https://www.youtube.com/watch?v=uACvhE19c_M # messy but very good demonstration of marginalization/chain/bayesian rules
###
### Naive Bayes inference
###
naive bayes  it's a special case of bayes nets, where we assume all attributes are conditionally independent of other attributes given a label v
 grahically below
/> a1
> a2
. . # every node (a1,a2,,,,an) has one parent "v"
v . . # so it's a depth=1 tree
 .
> an
v = some value # can be a class of hypothesis
a = some attribute
given the above particular setup, the joint probability distribution is as below
P(v)
P(va1,a2,,,,an) =  * Π P(aiv) # z = just some normalization factor (we discuss below)
z i
MAP class = argmax P(v) * Π P(aiv) # pick v ∈ V such that it maximizes P(va1,a2,,,,an)
v i # notice it's the same as h_map
# so we dropped z also
==> so we infer the best fit hypothesis class given attributes
NOTE: this inference rule is "naive" because the underlying assumption is the probability of each attribute is conitionally independent of other attributes given v
(quiz) https://www.youtube.com/watch?v=M59h7CFUwPU
given an email (possibly spam), then what's the probability of the email containing certain words ?
P(spam) = 0.4
P('rayban'spam) = 0.2 # probability of the word rayban appearing given a spam email
P('rayban'¬spam) = 0.1 # probability of the word rayban appearing given a nonspam email
P('viagra'spam) = 0.3
P('viagra'¬spam) = 0.001
P('gatech'spam) = 0.0001
P('gatech'¬spam) = 0.1
==> we can visualize in a bayesian net
>gatech
/
spam>rayban
\
>viagra
now, compute P(spam'rayban',¬'viagra',¬'gatech') # simply apply Bayes rule
= P('rayban',¬'viagra',¬'gatech'spam) * P(spam) / P('rayban',¬'viagra',¬'gatech')
= P('rayban'spam) * P(¬'viagra'spam) * P(¬'gatech'spam) * P(spam) / P('rayban',¬'viagra',¬'gatech') # now apply the numbers
= 0.3 * 0.7 * 0.9999 * 0.4 / P('rayban',¬'viagra',¬'gatech') # we don't know the denominator
= some_number_foo / P('rayban',¬'viagra',¬'gatech')
we don't know the denominator P('rayban',¬'viagra',¬'gatech'), but that's ok we can normalize, to find its value, aka normalization factor.
remember that P(spam'rayban',¬'viagra',¬'gatech') + P(¬spam'rayban',¬'viagra',¬'gatech') = 1 # obvious, P(ab) + P(¬ab) = 1
now try compute P(¬spam'rayban',¬'viagra',¬'gatech') # just apply Bayes rule
= P('rayban',¬'viagra',¬'gatech'¬spam) * P(¬spam) / P('rayban',¬'viagra',¬'gatech')
= P('rayban'¬spam) * P(¬'viagra'¬spam) * P(¬'gatech'¬spam) * P(¬spam) / P('rayban',¬'viagra',¬'gatech') # now apply the numbers
= 0.1 * 0.999 * 0.9 * 0.6 / P('rayban',¬'viagra',¬'gatech')
= some_number_bar / P('rayban',¬'viagra',¬'gatech')
some_number_foo + some_number_bar # we know both in real values
 = 1
P('rayban',¬'viagra',¬'gatech') # so we can derive this normalization factor
note: if we only care about "classification" then we just need to know if some_number_foo > some_number_bar, and not worry about the denominator
###
### why Naive Bayes is cool
###
 inference is cheap # inference itself is NPhard, but it's cheap in Naive Bayes form
 few parameters # like we saw in conditional independence table, its space complexity is linear
 estimate parameters with labeled data (we can do the other direction too)
# of occurrence of ai,v # look at how many instances of ai in a given labeled data set
P(aiv) = 
# of v # v = label. in practice, v is initialized to some super small number, to smooth.
# assumption(or we may call it inductive bias in this case) is all things are mildly possible to begin with. (avoids overfitting)
 connects inference and classification # inference may be hard, but classification may be simpler
 empirically successsful (handles missing attributes well, because it's just all probabilistic inference. in other algo like decision tree, if you miss attributes, you might get stuck)
#######################################
#### (2) Unsupervised Learning ####
#######################################
###########################################
#### (2.1) Randomized Optimization ####
###########################################
[optimization]
 input space X
 objective function (aka fitness function) f:X > R
 goal: find x*∈X s.t. f(x*) = max f(x) # i.e. argmax
x
note: x can be discrete
e.g.
 stock portfolio
 route finding
 factory, chemical composition (process control in general)
 neural networks # pick weights that minimize error
 decision trees # structure
###
### optimization approaches
###
 generate & test # useful when input space is small, or complex function
 calculus # useful when function is differentiable, solvable
 newtons method # hopefully function has single optimum
 hillclimbing: pick a point x, evaluate itself and its neighbors f(x1) f(x) (x+1) and decide which direction to go, repeat until converge, i.e. reaching an optimum.
in other words, optimization can be hard if
 big input space
 complex function
 no derivative (or hard to find)
 possibly many local optima
(quiz) https://www.youtube.com/watch?v=Bw4oFUiabXU
https://www.youtube.com/watch?v=b9QA2ETuhso
==> "randomization" helps overcome those obstacles.
##
## RHC : randomized hill clibming
##
e.g. pick randomly different starting points for your newton's (or hillclimb) method. and compare results.
(quiz) https://www.youtube.com/watch?v=Ts9vLVxyqdY
note: one observation is to find the global optimum, its basin of attraction matters.
 .
 . .
 . . ..
 . . . .
 . . . .
. . .
____________________
<a><b>
a = basin of attraction for a local optimum
b = basin of attraction for the global optimum
NOTE: always improving is prone to local optimization (loosely relating to overfitting maybe) but if you allow more search/exploration, like going down in the above algebraic function, instead of always going up i.e. improving your f(x), then you may find a better solution.
this brings to the concept of annealing algorithm.
###
### Simulated Annealing Algorithm (one of randomized optimization algos)
###
annealing is how you can reinforce metal by heating and cooling it. (by controlling the temperature carefully, molecules line up better and therefore your metal, like a sward or whatever, strengthens)
so let's have a fitness function f(x)
1. for a finite set of iterations:
a. sample a point x, then sample a new point xt in N(x) # xt = N(x) N() is a neighbor function, N(x) finds a neighbor of x
b. jump to new sample with probability given by an acceptance probability function P(x,xt,T) # i.e. probabilistically decide whether to move x to xt
c. decrease temperature T (slowly) # assume T > 0
P(x,xt,T) = 1 if f(xt) >= f(x) # improving
= exp( [f(xt)  f(x)] / T ) otherwise # searching/exploring
e.g.
if f(xt)  f(x) = infinitestimally small, then it's essentially e^0 = 1 # large T causes this effectively also
if f(xt)  f(x) = a huge negative number, then it's essentially 1 / e^a_huge_pos_number = almost 0 # tiny T causes this effectively also
===> this intuitively makes sense. we call this adoption of annleaing to our optimization context: "simulated annealing"
###
### properties of simulated annealing
###
T > ∞ : P(x,xt,T) = e^0 = 1 so it always moves x to xt regardless of f(xt)f(x) i.e. "random walk"
T > 0 : e^(∞) = 0 so it never explores, only improves. (may end up in a local optimum)
exp( f(x)/T ) # we don't go into proof
Pr(ending_at_some_point_x) =  # this is called Boltzmann distribution in physics community
Zt # Z is just a normalization factor
###
### Genetic Algorithms (another randomized optimization algo)
###
conceptually,,
population of individuals
mutation : local search N(x)
crossover : like a child inheriting properties(DNA) from two individuals(parents) # we may call it "combination" essentially.
generations : iterations of improvement
[genetic algo  sketch]
(1) P_0 = initial population of size K # how to represent input is a topic of interest
(2) repeat the below until converge
a. compute fitness of all x
b. select "most fit" individuals (top half, weighted probability)
c. pair up individuals, replacing "least fit" individuals via crossover/mutation
note: obviously you have to define fitness(x) function, and also have to define how to select most fit individuals.
"top half" : take the top half. this kind of slection is called "truncate selection"
"weighted probability" : you pick individuals randomly, but you assign higher probability to better fitnessscore individuals. (called "roulette wheel" selection)
note: because of its generic applicability, genetic algo is sometimes called the 2nd best solution to any problem
###
### crossover example
###
x = 8bit strings
x1: 01101100
x2: 11010111
there are many ways to crossover them.
(1) one point crossover
 choose one position to slice them, then flip flop
e.g.
x1: 01101100
x2: 11010111
offspring: 0110 0111
==> what properties(assumption/inductive bias) exist in this crossover method?
a. locality of bits # are preserved (and assumed to matter)
b. subspaces to optimize (independently) # related to a. locality property
# assume you have some subspace optimized already, so you can focus on other subspaces, etc
(2) uniform crossover
 randomize at each bit position
e.g.
x1: 01101100
x2: 11010111
offspring: 11001110 # this happens in real human gene reproduction
###
### MIMIC (mutual info maximizing input clustering)
###
two issues with those randomized optimization algos we've looked at so far.
(1) structured information
you go thru all the trouble of choosing a point, evaluating fitness, iteration, but you don't really keep track of all the info you gain along the way.
(2) unclear probability distribution
ok, annealing algo gives boltzmann distrib, which is good. but not all randomized optimization algos give that kind of nice clear probability distribution.
==> MIMIC algo directly models distribution of the underlying data (via dependency tree), and refines the model over time, thereby conveying structure.
(paper) https://www.cc.gatech.edu/~isbell/papers/isbellmimicnips1997.pdf
[MIMIC algo]
P_Θ(x) = 1/z_Θ if f(x) >= Θ # probability of x given threshold theta Θ
0 otherwise # z is just a normalization factor, out of all x whose f(x) >= Θ (this point is important)
(quiz) https://www.youtube.com/watch?v=49Y2C_mTWD0
Θmin = smallest value f(x) can return
Θmax = largest value f(x) can return
P_Θmin(x) = uniform distrib # because everything is accepted
P_Θmax(x) = optima # only accepting the max but there can be multiple of them on the same value
MIMIC algo starts with P_Θmin(x) and aims at P_Θmax(x), as below.
[MIMIC algo  pseudo code]
1. generate (many) samples from P_Θt(x) # kinda like 'population' in genetic algo
2. set Θt+1 to n'th percentile # pick the best(fittest) n'th percentile out of samples from 1. # btw, Θt+1 == Θ_(t+1)
3. retain only those samples s.t. f(x) >= Θt+1
4. estimate P_Θt+1(x) via dependency tree # this is where you encode structure, sort of.
5. goto 1.
a couple of important assumptions:
 re: step 4. : given a finite set of data, we can estimate its probability distribution P_Θ(x)
 P_Θ(x) ≈ P_Θt+1(x) # these two should be close, so that we can generate samples for Θt+1 from the samples of Θ
###
### estimating distribution (from a finite set of data samples)
###
x = [x1 x2 x3 ... xn]
P(x) = P(x1x2....xn) * P(x2x3....xn) * ... * P(xn) # joint distrib expressed in the usual formula from conditional independence
==> this is hard to compute, as it requires exponential sized table O(2^n), but becomes easy with an assumption: we always have "dependency trees"
dependency tree: a special type of Bayes Nets where network itself is a tree (i.e. every node has one parent, except for the root)
it's useful because (1) it captures relationships between variables, and (2) sampling is easy, as you just pick some parts of the tree.
_ _
Then P(x) = P(x) = Π P(xiπ(xi)) # π(xi) means xi's parent. Π is just product, P is Phat, hat == approximation
π # the root node doesn't have a parent, so just assume π(x_root) returns itself, or something
# O(n * 2^k) where k = 1, so it's O(2n) = O(n)
NOTE: conceptually, MIMIC algo just wants a simple yet effective structure to represent intervariable probabilistic dependency of input variables, so "dependency tree" is just one simplistic way.
we certainly don't want to assume every variable is completely independent from each other, so with dependency tree, we at least allow a (simple enough) structure in which every variable probabilistically depends on (in this case strictly one) other variable. one can argue that's just too arbitrary and not good enough, but it's been proven effective empirically and theoretically. more details below.
(read the paper) https://www.cc.gatech.edu/~isbell/papers/isbellmimicnips1997.pdf
###
### finding dependency trees # in such a way that it models prob distrib well
###
_
KL divergence: a measure of distance(divergence) btwn true probability distrib P and our estimation P
_ _
D_kl(PP_π) = Σ P[lg(P)  lg(P_π)]
= h(p) + Σh(xiπ(xi)) # h(p) means entropy of probability distrib p. h(xiπ(xi)) means conditional entropy of xi given its parent
= Σh(xiπ(xi)) # all we want is to find the best π() that minimizes the sum, so we can drop h(p)
= J_π # we denote it like this
===> obviously, our goal is to minimize the KL divergence, i.e. min J_π
min J_π is hard to compute, so we define a slightly different version that makes it easier to compute.
_
J_π = Σh(xi) + Σh(xiπ(xi)) # we just added a term Σh(xi) which is ok because it doesn't depent on π at all.
= Σ I(xi,π(xi)) # so in terms of minimizing π(), it's the same as
_
min J_π = max Σ I(xi,π(xi)) # maximize mutual information between xi and its parent
suppose a fully connected graph: each node is xi, and each edge is mutual information between two nodes
oo
\ /
 \ 
/ \
oo
===> we simply wanna pick a set of edages that form a highest total information content tree: maximum spanning tree !!
(so we successfully transformed the problem of finding the best dependency tree to a problem of finding a maximum spaning tree)
(i.e. you can negate the edge value and find a mimimum spanning tree)
(just use Prim's algo, which finds a maximum spanning tree in a densely connected graph)
(pick any node as root, then that dictates the rest as a tree)
###
### MIMIC algo: pseudo code
###
now we know how to find the best dependency tree, let's get back to MIMIC algo
[MIMIC algo  pseudo code] _
1. generate (many) samples from P_Θt(x) # we use its estimate Pπ represented in dependency tree
2. set Θt+1 to n'th percentile # pick the best(fittest) n'th percentile out of samples from 1. # btw, Θt+1 == Θ_(t+1)
3. retain only those samples s.t. f(x) >= Θt+1
4. estimate P_Θt+1(x) via dependency tree # find MST over I(x,π(x))
5. goto 1.
a couple of important assumptions:
 re: step 4. : given a finite set of data, we can estimate its probability distribution P_Θ(x)
 P_Θ(x) ≈ P_Θt+1(x) # these two should be close, so that we can generate samples for Θt+1 from the samples of Θ
note: again, dependency tree is just a simple but effective example of representing probability distrib for MIMIC, and not a must.
(good video) https://www.youtube.com/watch?v=dXWTBFMFxqs
###
### MIMIC: practical matters
###
(a must see quiz): https://www.youtube.com/watch?v=E0aAz2odMqY
https://www.youtube.com/watch?v=4DtVLCC73bA
 MIMIC does well with structure
 representing P_Θ vΘ
 local optima # but you can usually avoid this because restart is free
 probability theory
 time complexity # fewer iteration than others(random hillclimb, genetic algo, etc)
# MIMIC does well when the cost of fitness function is expensive
==> overall, MIMIC gives you a lot more structure (i.e. information) about the (probability distrib/dependency of) data we train a learner on.
###########################################################
#### (2.2) Clustering and Expectation Maximization ####
###########################################################
one of the classical unsupervised learning problems is "clustering" : taking a set of objects and putting them into groups.
[clustering  high level description]
given:
 set of objects X
 interobject distances D(x,y)=D(y,x) where x,y ∈ X # at least some meta notion of "distance" (or similarity)
# no need to satisfy the law of triangular inequality
output:
 partition Pd(x) = Pd(y) if x & y in same cluster # Pd(a) = b means partion an input 'a' into a cluster 'b' using distance metric 'd'
(quiz) how do we partition every input into the same cluster ?
Pd(x) = a
(quiz) how do we partition every input into its own cluster ?
Pd(x) = x
===> there are many specific clustering algorithms, so we go thru some of them.
###
### distance (aka similarity) metrics
###
finding the right distance metric is one of the hardest tasks in clustering problems.
you need domain knowledge of your data, and some theoretical and empirical analysis.
sklearn defines them so well, so i'm gonna quote their tables here.
http://scikitlearn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
(1) Metrics intended for realvalued vector spaces:
identifier class name args distance function

"euclidean" EuclideanDistance . sqrt(sum((x  y)^2))
"manhattan" ManhattanDistance . sum(x  y)
"chebyshev" ChebyshevDistance . max(x  y)
"minkowski" MinkowskiDistance p sum(x  y^p)^(1/p)
"wminkowski" WMinkowskiDistance p,w sum(w * x  y^p)^(1/p)
"seuclidean" SEuclideanDistance V sqrt(sum((x  y)^2 / V))
"mahalanobis" MahalanobisDistance V or VI sqrt((x  y)' V^1 (x  y))
(2) Metrics intended for integervalued vector spaces: Though intended for integervalued vectors, these are also valid metrics in the case of realvalued vectors.
identifier class name distance function

"hamming" HammingDistance N_unequal(x, y) / N_tot
"canberra" CanberraDistance sum(x  y / (x + y))
"braycurtis" BrayCurtisDistance sum(x  y) / (sum(x) + sum(y))
(3) Metrics intended for booleanvalued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:
N : number of dimensions
NTT : number of dims in which both values are True
NTF : number of dims in which the first value is True, second is False
NFT : number of dims in which the first value is False, second is True
NFF : number of dims in which both values are False
NNEQ : number of nonequal dimensions, NNEQ = NTF + NFT
NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT
identifier class name distance function

“jaccard” JaccardDistance NNEQ / NNZ
“matching” MatchingDistance NNEQ / N
“dice” DiceDistance NNEQ / (NTT + NNZ)
“kulsinski” KulsinskiDistance (NNEQ + N  NTT) / (NNEQ + N)
“rogerstanimoto” RogersTanimotoDistance 2 * NNEQ / (N + NNEQ)
“russellrao” RussellRaoDistance NNZ / N
“sokalmichener” SokalMichenerDistance 2 * NNEQ / (N + NNEQ)
“sokalsneath” SokalSneathDistance NNEQ / (NNEQ + 0.5 * NTT)
###
### single linkage clustering (SLC) algo
###
[SLC algo]
 consider each object a cluster (n objects) # i.e. Pd(x) = x
 define intercluster distance as the distance between the closest two points in the two clusters # this step is O(n^2)
 merge two closest clusters
 repeat nk times to make k clusters # this step is O(nk)
(quiz) https://www.youtube.com/watch?v=aojgUed9M0w
note: we took two "closest" points to define intercluster distance. but, depending on your "domain knowledge", you might choose median,average,farthest points. (then the algo has a different name, like average/median/maximal linkage clustering, etc)
note: SLC is in a class of unsupervised learning algo called "hierarchical agglomorative clustering" aka HAC
note: a tree representation of clustering is informative: it tells you which one is closer together and got clustered first
e.g. n=7, k=2 # i.e. group these n objects into k clusters
a b d
e f # probably you connect de first, then maybe ab, eg, gf, bc
c g
===> notice you can represent the clustering steps/result as a tree, very informative
/\
/\ /\ \
/ \ / \ \
/ \ \ / \ \ \
a b c d e g f
[SLC properties]
 SLC is deterministic
 running time of SLC with n objects, k clusters: O(n^3) # we can improve with memoization, etc but still O(n^3)
 if distance is represented as edge length of a graph, then SLC algo is the same as a minimum spanning tree algo.
 it fails with an example data set like below
e.g. k=2
............
. .
. .. . . . . # ideally we want to cluster the left and right
. .. . .. . . ...... # but SLC will cluster the 4 dots in the big rectangle on the left as a cluster and the rest as the other cluster
. . . . .. # other clustering algo like "K means clustering" will slove this.
............
###
### K means clustering
###
put n objects into k clusters
[algo]
1. pick k centers (at random) # "center" doesn't need to be one of the n objects
2. each center "claims" its closest points to form k clusters # every n_i obj should find its closest center
3. recompute the centers by averaging the clustered points # hence "k means"
4. repeat 2 & 3 until converge # this iteration is counted as t
(visual) https://www.youtube.com/watch?v=f4jvifS41M4
##
## "k means" in Euclidean space
##
Pt(x) = y : partition(i.e. cluster) object x into some cluster y, t is just a counter of iteration t=1,2,3,,, so on
Ct_i : set of all points in cluster i = {x s.t. Pt(x) = i}
t
center = ∑ y / Ci # the center (i.e. "mean" aka "centroid") of cluster i in iteration t is simply the sum of all objects y in the cluster, divided by the number of objects in the cluster
i y∈Ct_i # there are k centers: so we get "k means" literally
[algo] # just rewritten in math notation. "2" just denotes euclidean
0 t1 2 t
center > Pt(x) = i = argmin x  center  > center = ∑y / Ci
i i i 2 <(t=t+1) i y∈Ct_i
[step 1] [step 2] [step 3]
##
## "k means" as optimization problem
##
configuration (aka input): center, P
2
scores : error E(P,center) = ∑ center  x # want to minimize this error
x P(x) 2
neighborhood : P,center = {(P',center)} ∪ {(P,center')} # i.e. we either move the partition or move the center
# and take its ∪ = union to define neighborhood
(quiz) https://www.youtube.com/watch?v=y6a5Nm394Gw
==> notice the stunning similarity to hillclimbing algo.
when we update Pt(x), we stay in the same cluster or pick better score (less error) cluster <==> we evaluate neighbors and stay or go up if improve.
we take the average point of each cluster as a new center <==> we only go up if it improves
(average is the best estimate if you want to minimize the sum of squared error, it's just calculus)
(ref) https://math.stackexchange.com/questions/1352726/minimizingafunctionsumofsquares
so the way we update center ensures the error is monotonically nonincreasing (because we stay or improve)
also, configuration set is finite, because the input object set is finite, so the centers are defined deterministically.
also, it's possible that you may thrash back and forth between some equally optimal partitining, so you need to have a simple tie breaking rule.
because you never go into higher error configuration, basically you never repeat configurations, i.e. finite (possibly exponential) iteration, thus eventual convergence.
##
## Kmeans clustering  properties
##
(1) complexity
[algo]
1. pick k centers (at random) # O(k)
2. each center "claims" its closest points to form k clusters # O(nk) or O(nkd) if you consider d dimentional points
3. recompute the centers by averaging the clustered points # O(n)
4. repeat 2 & 3 until converge # well, the worst case is you explore every possible kclustering combination of n objects, so nCk = O(n^k) or O(n^kd) if you consider d dimentions
note: step 2 & 3 are additive, so it's just O(nk). and multiply by step 4.
thus overall,
O(nkd) * O(n^kd) = O(n^(kd+1)) # so this is much worse than SLC algo which gives O(n^3)
(ref) https://en.wikipedia.org/wiki/Kmeans_clustering#Complexity
(2) problematic property
 suppose below problem. n = 6, k = 3
c d
a b # visually, we think it should be the clusters of {a,b}{c,d}{e,f}
e f
 step 1: suppose you pick a,b,f as your initial centers
 step 2: cluster 1 = {a}, cluster 2 = {b}, cluster 3 = {c,d,e,f}
 notice you converged, but not correct.
(random restart helps, or maybe try to pick initial random k centers in such a way that they are far apart, spread, but not rigorous enough, soft clustering is one solution)
###
### Soft Clustering # aka fuzzy clustering
###
abc d efg # suppose n=7,k=2 then you wanna allow 'd' to kind of softly belong to both clusters
(quiz) https://www.youtube.com/watch?v=NSx_numd44
[steps] # note this is just one conceptual approach. there are others too.
assume the data was generated by
1. select one of k Gaussian distributions [with fixed known variance σ^2] uniformly
2. sample xi from that Gaussian
3. repeat n times
task: find a hypothesis h={M1,M2,,,Mk} that maximizes the probability of the data (i.e. maximum likelihood)
note: gaussian distrib == normal distrib
note: k is the number of clusters. and we assume those clusters come from k distinct gaussian distributions = {g1,g2,,,,gk}
(by distinct, we mean we expect the mean of each gaussian distrib is different from others)
so if n>k, then obviously we pick some points from the same gaussian distrib.
note: M1 = the mean of g1
M2 = the mean of g2
note: it's really a probabilistic process
note: so the above defined task is saying let's find a hypothesis that assumes those M1,M2,,Mk means which fit the data the most.
###
### Maximum Likelihood Gaussian
###
 suppose you have some data points like below, and you know they all come from one same Gaussian distrib.
 then how do you estimate the maximum likelihood (ML) mean of this Gaussian distrib?
 well it just happens to be the mean of the data points! easy and simple.
x x
x x x
x
x
 but the data points don't come from one same Gaussian distrib. they come from k distinct Gaussian distribs.
 how do we represent which data point came from which Gaussian distrib?
 use hidden variables. # aka "latent" variables.
e.g.
x1 = <x1,z1,z2,z3,,,,zk> # z is some random variable, all zero except one of them is 1, indicating x belongs to that Gaussian distrib (i.e. cluster)
x2 = <x2,z1,z2,z3,,,,zk> # soft clustering assigns probability, e.g. x2 may have z1=0.34, z2=0, z3=0.60, z4=0.01, so on
# i.e. x2 comes from cluster 1 with 34% probability, from clsuter 2 with 0% probability, from cluster 3 with 60% probability, from cluster 4 with 1%, so on
###
### expectation maximization algo
###
[algo] repeat between the below 2 phases until they don't change much
Σ E[Zij] xi
P(x=xiM=Mj) i
E[Zij] =  > Mj = 
k < Σ E[Zij] # these denominators are just for normalization
Σ P(x=xiM=Mj) i
j=1
(define expectation of z from M) <> (define max mean M from z)
note: P(x=xiM=Mi) denotes probability of xi produced by cluster j
note: Zij denotes likelihood of xi coming from cluster j
note: E[Zij] denotes the expectation of xi coming from cluster j where 1 <= i <= n and 1 <= j <= k
using bayes rule, then we can drop P(Mj) because it's uniform, also since we are in MLE setting, E[Zij] is simply proportional to P(x=xiM=Mj)
note: algorithmically, it's very similar to kmeans clustering. in fact this becomes kmeans if cluster assignments use argmax.
P(x=xiM=Mi) = e^( 1/2 * σ^2 * (xiMi)^2 )
###
### Properties of EM
###
 monotonically nondecreasing likelihood
 does not converge (practically does)
 will not diverge # because it's within the bound of probability
 can get stuck # just random restart to get around this
 works with any distribution (if E and M are solvable)
(ref) https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
###
### (desired) properties for clustering
###
P_D(x) = C # clustering scheme(algo) that "P"artition an input x to cluster C, based on distance matrix D
here are 3 properties we desire for any clustering(partitioning) algo.
(1) richness: (expressiveness)
 for any assignment of objects to clusters, there is some distance matrix D such that P_D returns that clustering.
i.e. ∀C ∃D P_D = C # simply saying you can find some D to produce C for any input
(2) scaleinvariance
 scaling distances by a positive value does not change the clustering
i.e. ∀D ∀k>0 P_D=P_kD # intuitive, because distance between objects are relative, so scaling them all by some positive value k shouldn't change the clustering(partitioning) result.
(3) consistency
 shrinking "intra"cluster distances and expanding "inter"cluster distances does not change the clustering
i.e. P_D = P_D' # intuitive if you visualize
e.g.
a b e # let's say cluster 1 = {a,b,c}
c d f # cluster 2 = {d,e,f}
===> shrink intracluster distances, and expand intercluster distnaces
ab e # hence P_D = P_D'
c df
(quiz) https://www.youtube.com/watch?v=Ru5MsE6RQ0
SLC algo that stops when  richness  scaleinvariance  consistency

n/2 clusters reached  no  yes  yes
clusters are Θ units apart  yes  no  yes
clusters are Θ/ω units apart where ω=max D(i,j)  yes  yes  no
i,j
###
### impossibility theorem (proven by Kleinberg)
###
no clustering scheme can achieve all three of:
 richness
 scaleinvariance
 consistency
# read Kleinberg paper for more
#####################################
#### (2.3) Feature Selection ####
#####################################
 knowledge discovery, interpretability, insight
 curse of dimensionality: the amount of data you need grows exponentially to the number of features
e.g.
given 1000 features, you wanna pick the most relevant feature subset.
===> how hard is feature selection ? # let's say you want to pick m features out of n features, where n >= m
 NPhard problem
 worst case: have to evaluate every possible combination of feature subset = 2^n
 two known approaches: (they have obvious tradeoffs)
##
## (1) filtering
##
you do feature selection as an independent process, and feed the result(selected feature subset) to learner.
 pros : speed
 cons : ignores the learning problem
often isolated individual feature analysis only
i.e. some features are relevant in combination but not individually
e.g. your feature selection maybe Decision Tree (what with its information gain based split), then you take all features selected by DT and give it to whatever is your learner (which may be also DT or KNN or boosting, etc) because DT and other algo have different inductive bias. so it may make sense to use DT to select features and use other algo to do the learning/training.
select features on what criteria ?
 information gain
 variance, entropy # variance as in, within a feature, how much variance of labels?
 GINI index
 independent(nonredundant) # if a feature is linearly correlated to a combination of other features, then it's redundant to keep that feature.
note: recall we really wanna get rid of features to combat the curse of dimentionality, to help the learner.
##
## (2) wrapping
##
your feature selection is more closely tied to learner, so you pick some features then learner feedbacks "this feature is useless, and this feature is important" etc
and you iterate thru that feedback cycle.
 pros : takes into account model/learner bias
 cons : learning speed is slow
select features on what criteria?
 hill climbing
 randomized optimization
 forward search : given feature {x1,x2,,,xn}, you give it individually to learner, and get the score for each feature, as in score(xi). pick the best (suppose x3), then pick the 2nd best (suppose x5), evaluate the choice of these two, as in score(x3,x5), if score(x3,x5) >> score(x3) then keep going. pick the 3rd best feature (suppose x2), then compare score(x3,x5) and score(x2,x3,x5). suppose they are not much different(i.e. tiny improvement), then decalare x3,x5 as the selected subset. basically, when adding the next best feature doesn't improve your overall score (score of the combination of features) much, then you stop.
 backward search: the opposite of the forward search. suppose you have five features {x1,x2,x3,x4,x5}, you let learner evaluate (i.e. assign a score to) each feature, and eliminate the worst feature(suppose x4). compoare score(x1,x2,x3,x4,x5) and score(x1,x2,x3,x5), and suppose they are not much different (i.e. only slight degradation), then you proceed. pick the 2nd worst feature (suppose x1), then compare score(x1,x2,x3,x5) and score(x2,x3,x5), and suppose it got much worse, as in score(x1,x2,x3,x5) >> score(x2,x3,x5) then we know removing x1 was not a good idea, and we assume other features are even more important (because their individual scores are higher than score(x1)) thus we declare {1,×2,x3,x5} to be the selected feature subset.
##
## quiz : smallest set of features to guarantee zero training error
##
(quiz) https://www.youtube.com/watch?v=qJTmtN6o56c
x1 x2 x3 x4  y # let's just spoil that it's AND(x1,x2) function
 # x3 looks useless, in terms of information gain
0 1 1 1  0
0 1 1 0  0
1 0 1 0  0
1 1 1 0  1
if you use DT, then you only need {x1,x2} x1
0/ \1
0 x2
0/ \1
0 1
if you use perceptron y = 1 if (W^T * x) > 0
0 otherwise
then if you only pick {x1,x2} you cannot find weight vector W that lets you express AND(x1,x2) with the above perceptron.
but if you pick {x1,x2,x3} then you can pick the below weight, for example, to satisfy. (see the above video)
W = [1 1 1]
so x3 was considered useless but actually useful in this case. we need to define this notion of relevance VS usefulness formally.
###
### relevance
###
(1) xi is strongly relevant if removing xi degrades Bayes Optimal Classifier(BOC) that is based on ALL features. # i.e. BOC(all features  xi) < BOC(all features)
(2) xi is weakly relevant if: x1 not strongly relevant, AND
: ∃ subset of features S s.t. adding xi to S improves BOC that is based on S. # i.e. BOC(S+xi) > BOC(S)
(3) xi is irrelevant otherwise
note: BOC is the one that takes the weighted average of all the hypotheses based on the probability of being correct hypothesis given the data. so essentially it's the best you could do (on average).
x1 x2 x3 x4  y # let's just spoil that it's AND(x1,x2) function
 # x1 and x2 are strongly relevant
0 1 1 1  0 # x3 is irrelavent
0 1 1 0  0
1 0 1 0  0
1 1 1 0  1
to illustrate "weakly" relevant, let's add x5 = ¬x1
x1 x2 x3 x4 x5  y # let's spoil again that it's AND(x1,x2) function
 # x2 is strongly relevant, because no other feature, out of ALL features, can represent its info
0 1 1 1 1  0 # x3 is still irrelavent
0 1 1 0 1  0 # x1 and x5 are "weakly" relevant. because, when considering ALL features, removing x1 does not degrade BOC(all feature) because x5 can cover.
1 0 1 0 0  0 # but in a subset S that doesn't include x5, then x1 is relevant. hence "weakly" relevant.
1 1 1 0 0  1
###
### usefulness
###
 relevance measures effect on BOC # measures informationness/entropy of variable,feature in a data set
 usefulness measures effect on a "particular" predictor # measures error given some particular model(or learner,algo,classifier)
note: BOC is really just the gold/true standard of information, the best you could do if you considered all hypothesis with its probability given the data.
any other classifier has inductive bias.
so it is fair to say relevance is usefulness w.r.t. BOC
##########################################
#### (2.4) Feature Transformation ####
##########################################
the problem of preprocessing a set of features to create a new (smarter? more compact?) feature set, while retaining as much (relevant? useful?) information as possible:
note: by this definition, feature selection can be considered just a subset of feature transformation.
note: also, we restrict our transformation to be just linear combination of original features. (i.e. feature "linear" transformation)
note: implicit assumption(i.e. domain knowledge) is that we don't need all n features.
we already looked at examples of feature transformation (as part of some specific learning algos)
e.g.
XOR in perceptron # nonlinear
kernel methods in SVM
##
## adhoc information retrieval problem # aka google problem
##
google has lots of documents (i.e. objects) and gets adhoc search words(features) from user, and has to retrieve relevant objects.
adhoc as opposed to knowing what search words will come beforehand.
what are our features?
 words # full, partial (like plural to singluar, removing words like "the" "of")
 number of words
==> there are too many words
 one word may mean multiple things: polysemy # e.g. patient. "false positive": when you type for apple, as food, but you get iphone info.
 many words may mean the same thing: synonymy # e.g. car and automobile. "false negative": when you google "machine learning" you may not get contents on "data mining" which you'd like.
false positive: flagging something positive when it's not. e.g. giving you something as relevant search result(apple iphone) when you searched for apple as fruit.
false negative: flagging something negative when it's positive. e.g. not giving you "data mining" results when you searched for "machine learning"
==> so feature transformation should eliminate these false pos/neg. like combining words "apple fruit", or adding "machine learning" + "data mining"
##
## variance
##
expected value: E[X] = xΣ P(X=x) = (1/n)Σ P(X=x)
x∈Ω_X x∈Ω_X
n _ n
Σ (xi  x)^2 Σ xi
i=1 _ i=1
note: variance σ(X)^2 =  = E[(XE[X])^2] = E[X^2]  E[X] where mean x =  = E[X]
n n
σ(X) = stddev(X)
k'th moment is defined as:
E[ (XE[X])^k ]
 # different k is used to study the behavior of a random variable.
σ(X)^k
e.g.
k=1 : mean
k=2 : variance
k=4 : kurtosis # a convenient measure of the peakedness of a distrib
###
### Principal Components Analysis (PCA)  a linear transformation algo
###
PCA transforms variables into a new set of variables (called principal components) which are linear combination of original variables.
first component is constructed so that it accounts for the most variance(i.e. info) of the data.
2nd component is chosen so that it captures the most (remaining) variance. mathematically each component is just an engenvector, and Nth component is simply orthogonal to (N1)th component.
it's nothing but linear algebra.
(ref) https://georgemdallas.wordpress.com/2013/10/30/principalcomponentanalysis4dummieseigenvectorseigenvaluesanddimensionreduction/
(ref) https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
(ref) https://onlinecourses.science.psu.edu/stat505/node/51
suppose 2dimensional data: x,y
there are 2 eigenvectors, and their eigenvalues are variance. # ndimensional data have n eigenvectors
 PCA algo treats data as eigen problem
(1) finds direction (an eigenvector) that maximizes variance (eigenvalue) # 1st principal component
(2) finds direction (the other eigenvector) that is orthogonal # 2nd principal component
# if it's ndimensional data, it keeps going to n'th principal component
y
  ..
y   .. .
variance range   . ...
  . .

 x

x variance range
so combining y variance range and x variance range, you can get a vector for 1st principal component, then pick orthogonal vector you need to represent 2nd principal component.
(read literature for mathematical proof)
conceptually, bigger variance means more information. so you wanna capture as much info as possible about the data.
[PCA for dimensionality reduction]
suppose you have 3dimensional data (x,y,z), they could be like weight,height,age. you get 3 eigenvectors, and if their eigenvalues are 132,104 and 0.07 then, the third vector probably has not much information, so you can represent the data with just those 2 eigenvectors, without losing much info from the original data. and you successfully reduced dimension!
##
## PCA properties
##
 global algorithm # mutual orthogonal
 best reconstruction error # no information loss
 minimizes L2 error moving from n > m dimentions
 eigenvalue (nonnegative, monotonically nonincreasing) # if eigenvalue for a variable/feature is zero, that means it has no variance, no entropy, so it's irrelevant. (could be still useful)
note: suppose you had 2D data, and reduced it to 1D, using PCA. then you represented the original data with less dimension (less complexity, more simplicity, which may help computation/classification speed, less mem usage, etc). if you reconstruct the 2D data from 1D representation, there will be some error, as in distance(orig_data,transformed_data), we call it "reconstruction error". the goal is to reduce dimension while minimizing reconstruction error. we can say we want to reduce dimension while maximizing correlation(mutual information) btwn orig features and transformed features. if the orig data has many redundant features, then it's possible to reduce dim with zero recon error.
usually, (as described elsewhere), given Ndim orig data, it'll be the tradeoff btwn dim reduction and recon error growth. but hopefully the first few components capture the most of of eigenvalue(i.e. variance, i.e. info) like below.
 .
 .
recon  . # the idea is hopefully k < d
error  . # k = number of principal components that capture most of the variance of original data
 ......... # d = original feature dimension


# of components
(visual) https://www.youtube.com/watch?v=_nZUhVqhZA
(good ref) http://www.cs.princeton.edu/courses/archive/spr08/cos424/scribe_notes/0424.pdf
note: PCA is sensitive to the scale of measurement, so make sure you scale each variable value range.
also, applying PCA to original data, the new variables (aka principal components) you get are purely the result of variance maximizing liniear algebraic feature transformation.
so while it may help you achieve a computational benefit (dimensionality reduction, or sample complexity is smaller, etc), the new variables no longer have any interpretable meaning.
so if you still need to interpret the result (like data content, which feature meant whatnot), then PCA is not the best option.
###
### Independent Components Analysis (ICA)
###
 PCA is about finding correlation, by maximizing variance, which gives you best reconstruction
 ICA is about maximizing statistical independence.
(1) ICA finds linear transformation of features x, s.t. their transformed version features (let's call them y) are independent from each other.
i.e.
x1 > y1
x2 > y2 and yi ⊥ yj i.e. I(yi,yj)=0 # their mutual info is zero
x3 > y3
.. so on
(2) ICA maximizes mutual info between original features and transformed features
i.e.
maximize I(x,y)
==> suppose
hidden variables : y1 y2 y3 y4 # they are all independent from each other
\/ \ /\ /\
/\ /\//\/\
observables : x1 x2 x3 x4 x5 # linear combination of y1,y2,y3,y4
==> ICA attempts to transform x to y # this problem is often called "cocktail party problem"
note: the word "latent" is often used in literature to mean "hidden"  they are the same thing.
##
## Cocktail Party problem # aka "blind source separation"
##
person_A person_B person_C # suppose three people are talking in a cocktail party, independent of one another
mike_A mike_B mike_C # suppose three microphones are there.
# mike_A, being closest to person_A, probably captures a lot of person_A's voice, then some of person_B's voice, then a little bit of person_C's voice.
# mike_B, mike_C will also have some linear combination of the voices.
# ICA separates them back to person_A to C, independently.
(ICA fascinating video demo) https://www.youtube.com/watch?v=T0HP9cxri0A
(ICA indepth paper) "Independent Component Analysis:Algorithms and Applications" https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
##
## ICA intuition
##
ICA has a fundamental assumption that the given data is a linear mixture/transform of latent(i.e. hidden, original) features which are not normally distributed. thanks to CLT, the linear mixture itself is normally distributed. so ICA attempts to undo the linear mixture by selecting linear transform(hopefully the opposite way of the mixture) that gives features (hopefully the exactl latent features) that maximize the kurtosis.
kurtosis is imply the 4th moment the data, and we know kurtosis = 3 for Gaussian distrib, so ICA basically wants to recover the original features by maximizing kurtosis3.
note: 1st moment = mean
2nd moment = variance
3rd moment = skewness
4th moment = kurtosis
(ref) https://en.wikipedia.org/wiki/Central_limit_theorem
(ref) https://en.wikipedia.org/wiki/Kurtosis
##
## ICA: number of components to reduce to
##
if we use ICA for dim reduction (let's say you have 5 features which can be 5 mikes that recorded 3 people, then you might wanna reduce to 3 components), how do we decide the number of components to reduce to?
you can look at
(1) reconstruction error
(2) excess kurtosis of each component
(3) mutual information
##
## PCA vs ICA
##
 they are trying to do the same thing(i.e. capture the original data in transformed state) but have different fundamental assumptions.
 PCA is linear algebraic
 ICA is probabilistic, information theory based
(important quiz) https://www.youtube.com/watch?v=SjM2Qm7N9CU
 ICA  PCA

blindsourceseparation  good  bad
directional  yes (direction of input data must be consistent)  no
faces  parts(noses, eyes, hairs)  eigenfaces (focues on brightness/luminance, ends up finding average face, useful for reconstruction)
natural scenes  edges (independent components are edges) 
documents  topics 
speed  slower  faster
note: it ultimately depends on the data (if the data has independent latent variablres/features then ICA can shine).
also it depends on the use case. (real time system  ICA may be too slow, but more static system, then ICA may shine)
###
### alternatives to PCA/ICA
###
RCA: random components analysis T
 generates random directions! P # finds some matrix to project original data onto.
x
 fast (projection & reconstruction) # because we pick components randomly
 surprisingly works well (i.e. accurate) # even random vectors still retain info (and like other components analysis, we get to reduce dimensionality)
LDA: linear discriminant analysis
 finds a projection that discriminates based on the label
(paper on feature transformation) "A Survey of Dimension Reduction Techniques" by Fodor https://ereportsext.llnl.gov/pdf/240921.pdf
######################################
#### (2.5) Information Theory ####
######################################
it is a huge topic of its own, originating in physics thermodynamics, then pioneered by Claude Shannon. here we only cover the basics relevant to ML.
recall in decision tree classification, we talked about deciding a feature to split a node based on "information" gain, but what is information?
also, how can we rigorously compare input(or output) vectors(features) to compare "similarity" ?
in ML context, every input vector and output vector can be considered as prbability density function.
information theory gives us a mathematical framework to compare/analyze these probability density functions.
###
### entropy as a measure of information
###
suppose you want to send the results of 10 coin flips.
case (1) with a normal coin: 50% head, 50% tail. so the result may looks like this: HTHHTHTTHT
case (2) with a headonly coin, then the result is always HHHHHHHHHH
how many bits do you need? i.e. how much information?
case (1): 10 bits
case (2): 0 bit # because you know it's always head
so "information" is really about the degree of randomness, uncertainty: "entropy"
suppose you have an alphabet universe of only 4 letters: ABCD, with the below frequency of occurrence in communication
and think about how to encode each letter's bit pattern efficiently
letter  frequency of occurrence  encode_1  encode_2

A  0.5  00  0
B  0.125  01  110
C  0.125  10  111
D  0.25  11  10
what's the expected size of the message per symbol ? (aka entropy)
Σ P(s) * size(s) # the sum of probability of s occurring times the number of bits per s
s # size(s) == log(1/P(s)) which is the formal definition. this is base2 log()
= Σ P(s) * log(1/P(s)) = Σ P(s) * log(P(s)) # aka "entropy formula"
s s
in encode_1: 0.5*2 + 0.125*2 + 0.125*2 + 0.25*2 = 2 # well, you can tell because every symbol is encoded with two bits
in encode_2: 0.5*1 + 0.125*3 + 0.125*3 + 0.25*2 = 1.75 # improved
note: it's more intuitive to represent in binary trees
encode_1:
.
0/ \1
0/\1 0/\1
A B C D
encode_2: aka "variable length encoding"
.
0/ \1
A 0/ \1
D 0/\1
B C
###
### information between two variables
###
[joint entropy]: H(x,y) = Σ P(x,y) * log(P(x,y))
x∈Ω_X,y∈Ω_Y
[conditional entropy]: H(yx) = Σ P(x,y) * log(P(yx))
x∈Ω_X,y∈Ω_Y
e.g.
if x & y are independent from each other,
then
H(yx) = H(y)
H(x,y) = H(x) + H(y)
###
### Mutual Information
###
H(yx) # entropy of y given x
# but how much does y depend on x ?
I(x,y) = H(y)  H(yx) # I(x,y) is mutual information
= H(x)  H(xy) # smaller I(x,y) indicates x tells a great deal about y (or H(y) is super small to begin with)
= H(x) + H(y)  H(x,y)
= I(y,x)
(paper) http://www.robotvisions.org/4641/downloads/InfoTheory.fm.pdf
https://github.com/pushkar/robotvisions/blob/master/4641/downloads/InfoTheory.fm.pdf (backup URL)
###
### quiz exercise
###
(two independent coins) https://www.youtube.com/watch?v=w0RZOrSvF7U
P(A) = P(B) = 0.5
P(A,B) = P(A)*P(B) = 0.25
P(AB) = P(A) = 0.5
H(A) = Σ P(A)*log(P(A)) = 1
H(B) = H(A) = 1
H(A,B) = Σ P(A,B)*log(P(A,B)) = 2
H(AB) = Σ P(A,B)*log(P(AB)) = 1
I(A,B) = H(A)  H(AB) = 11 = 0 # since A & B are independent, no mutual info between them
(two "dependent" coins) https://www.youtube.com/watch?v=P5GpR9XVWQ
P(A) = P(B) = 0.5
P(A,B) = P(A)*P(B) = 0.5
P(AB) = P(A,B)/P(B) = 1
H(A) = Σ P(A)*log(P(A)) = 1
H(B) = H(A) = 1
H(A,B) = Σ P(A,B)*log(P(A,B)) = 1
H(AB) = Σ P(A,B)*log(P(AB)) = 0
I(A,B) = H(A)  H(AB) = 10 = 1
###
### KullbackLeibler Divergence (aka KL divergence)
###
KL divergence is a measure of distance between two distributions. (distance not in geometric sense, because this doesn't have any unit nor follow the triangular inequality law, so it's really just "divergence", not distance per se)
formula:
p(x)
D(pq) = ∫ p(x)log = 0 when p(x)==q(x) i.e. two distributions are identical, then their distance is 0
q(x) non_zero positive real number otherwise
==> because in supervised learning, we are trying to fit our model to some observed distribution.
let's say p(x) is some well known distribution, and q(x) is our sampling. if we care to sample in such a way so that p(x) ≈ q(x) then we try to minimize D(pq)
mutal information is a particular case of KL divergence.
#######################################
#### (3) Reinforcement Learning ####
#######################################
recall
 supervised learning: y = f(x) # given a lot of x,y examples, learn f() i.e. function approximation
 unsupercised learning: f(x) # given a lot of x examples, learn f() that clusters/describes x
 reinforcement learning: y = f(x) # given a lot of x,z examples and delayed y at the end, learn f()
z
###
### RL world
###
a typical example world: grid
3by4 grid
+ + + G # available actions: UP, DOWN, RIGHT, LEFT
+ W + T # S = start, W = wall(cannot go into), T = trap(sends you back to S), G = goal(wanna reach)
S + + + # you stay in the current spot if you try to break out of the grid
==> what's the sequence of actions that leads you to G ?
well, it's easy. we have two routes: UURRR or RRUUR
suppose we have this stochasticity/uncertainty/probability: when executing an action, there is 80% chance it executes as intended. and 20% chance going side ways (10% to left, 10% to right).
what is the (probabilistic) reliability of UURRR action sequence ?
0.8 ^ 5 = 0.32768 # this alone is not the answer, because it's possible you intended UURRR, and you took RRUUR and still reached the goal
(0.1 ^ 4) * 0.8 = 0.00008 # here is the probability of taking RRUUR when your intended action was UURRR

+)0.32776 # correct answer
===> we use a framework called MDP (Markov Decision Process) to deal with this.
#############################################
#### (3.1) Markov Decision Processes #### aka MDP
#############################################
##
## background
##
Markov: a russian known for his work on stochastic process.
often we want to model a real world sequential process/workflow where we have
 finite number of discrete states.
 probabilistic transition between states.
 (each state having some sort of reward/cost)
Markov Chain (aka Markov model/system): models exactly the above.
 but one fundamental assumption: next state is determined only by the "current" state.
(it is fair to argue, sometimes, the next state is determined by a few prev states, but Markov model doesn't let you define a sequential process that way.)
==> we say the current state determines the probability distribution for the next state.
e.g.
suppose we have S = {s1,s2,s3}
current_state = s1
p(next_state=s1  current_state=s1) = 0.1
p(next_state=s2  current_state=s1) = 0.3
p(next_state=s3  current_state=s1) = 0.6
Hidden Markov Model (aka HMM): it's Markov Chain, except we don't always get to know the current state, only partially observable. then we probabilistically try to guess the current state (or past transition path) based on past observation/action.
(see page 21) http://www.cs.cmu.edu/~./awm/tutorials/hmm.html
(ref) https://www.cs.cmu.edu/~ggordon/780fall07/lectures/POMDP_lecture.pdf
MDP: we add "action" to Markov Chain.
 finite number of discrete states
 probabilistic transitions between states based on controllable actions in each state.
 (again, the same fundamental assumption, aka Markovian property) next state is determined by only current state and current action.
==> it models sequential decision making process well, like a single player game, etc. also computationally tractable to solve.
(but assumes perfect knowledge of current state. thus MDP is aka COMDP: completely observable MDP)
==> a solution to an MDP problem is a policy π(s) = a that maximizes the expected sum of reward
Partially Observable MDP (POMDP): we add "action" to HMM
 it's MDP with current state only partially observable.
 it generalizes MDP further (to model many real world engineering process/system), by allowing uncertainty of knowledge of current state.
 so POMDP builds observation model p(os)
 computationally intractable to solve optimally.
(we study POMDP formally in RL class)
##
## MDP overview
##
in MDP, we define a problem with "states", "model", "actions" and "reward", then a solution is "policy"
States: S # a set of possible states a player can encounter. e.g. we may use x,y coordinate to represent a state.
Model: T(s,a,s') = Pr(s's,a) # aka (probabilistic) transition function that describes the rules/physics of the world. probability of transitioning to state s' if you take action a in state s
Actions: A(s), A # A = a set of actions, or A(s) = a set of actions you can take in state s
Reward: R(s), R(s,a), R(s,a,s') # encompasses the domain knowledge. normally you only need to think R(s)

Policy: π(s) = a # tells you which action to take in state s.
# π*() is called "optimal policy" that maximizes the long term reward
# ideally we want training examples <s,a> that tells correct action a in state s, but we don't get them in RL problems. (otherwise it'd be a classic supervised learning)
# instead we get <s,a,r> that tells taking action a in state s gives reward r
##
## Markovian properties (assumption)
##
 "only the present matters" :so the above model/transition function T(s,a,s') only needs to know the current state s, to discuss the possibility of moving to s' given a.
 rules/model/transition functions are "stationary" # i.e. they don't change over time
##
## MDP: notion of "reward"
##
 delayed reward
 minor changes (of reward function, or reward assignment) matter
e.g.
you play a game of chess, and let's say you win(or lose) after 87 moves. the final outcome is your reward, but how do you assign reward for each move? it's possible i'th move was bad, or j'th move was good, etc.
==> known as "temporal credit assignment" problem
e.g.
recall this grid path finding game.
+ + + G # G or T are termination/absorbing spots
+ W + T # S is the start spot
S + + + # you must take an action of either UP,LEFT,RIGHT,DOWN at each step
let's assign reward to each state as below
0.04 0.04 0.04 +1
0.04 NA 0.04 1
0.04 0.04 0.04 0.04
global optimal action for each state is below:
UP RIGHT RIGHT na
UP na UP na
UP LEFT LEFT LEFT # why is the second last column LEFT ? UP would lead to the goal quicker.
# this is because the cost of possibly going to the trap state is too big (1) so it makes sense to go thru the longer route, collecting several 0.04s.
what if we change 0.04 to +2 ? what about 2 ?
+2 +2 +2 +1
+2 na +2 1 # this means it's in your interest to forever loop, collecting +2 [unless we impose some constraints like you die if you don't reach GT within 10 actions]
+2 +2 +2 +2 # instead of reaching either of termination states of 1 or +1
(more example) https://www.youtube.com/watch?v=yLPJxEZS88A
note: again, these minor changes in reward function matter a lot.
i.e. reward == domain knowledge
###
### assumptions on MDP rewards
###
(1) infinite horizons # how much time, how many turns do you have ? finite? infinite?
policy function π that decides action a given state s
π(s) = a # but in finite time horizon, this is not stationary
# so it makes sense to write it as π(s,t)=a so for different value of s, you may get different action even if state has not changed!
# but we don't worry about finite time horizon in this course.
(2) utility of sequences
"stationarity of preferences"
if U(s0,s1,s2,,,,,sn) > U(s0,s1',s2',,,,sn') # U(sequence_of_states) decides its preferencelevel, i.e. the sum of all immediate and subsequent rewards in this case.
then U(s1,s2,,,,,sn) > U(s1',s2',,,sn') # we say this holds
∞
U(s0,s1,s2,,,,) = Σ R(s_i) # naive version
i=0
∞
= Σ [γ^i * R(s_i)] where 0 <= γ < 1 # more sophistivated version
i=0 # gamma γ is discounted reward rate
∞ Rmax
≤ Σ [γ^i * Rmax] =  # upper bound (just a plain geometric series)
i=0 1γ
S1 = [1 1 2 1 1 1 2 .....] # each sequence has an infinite length
S2 = [1 1 1 1 1 1 1 .....]
==> compare sequence S1 and S2. in naive version, U(S1) = U(S2) = ∞
##
## geometric series (recap)
##
(ref) https://en.wikipedia.org/wiki/Geometric_series
∞ ∞
Σ [γ^i * Rmax] = Rmax * Σ [γ^i]
i=1 i=1
assume 0 <= γ < 1
∞
x = Σ [γ^i] = (γ^0 + γ^1 + γ^2 + ....) # till infinity
i=0
x = γ^0 + γx
= 1 + γx # x  γx = 1
= 1/(1γ) # x(1  γ) = 1
∞ Rmax
Rmax * Σ [γ^i] = 
i=1 1γ
###
### policy : definition
###
E[ab] denotes expectation of a given b.
∞
optimal policy π* = argmax E[ Σ[γ^i * R(s_i)]  π] # expectation of total reward given π
π i=0 # we are using E[] because it's not always deterministic
∞
U_π(s) = E[ Σ[γ^i * R(s_i)]  π,s0=s] != R(s) # utility of being in a particular state s is the total reward from that state onward.
i=0 # i.e. U_π(s) = long term(or delayed) reward, R(s) = short term(or immediate) reward
# so R(s) can be negative as long as U_π(s) is big positive, just like going to college.
NOTE: assume U(s) == U_π*(s) below # U_π*(s) = true value of being in that state
π*(s) = argmax Σ T(s,a,s') * U(s')
a s'
U(s) = R(s) + γ max Σ T(s,a,s') * U(s') # aka Bellman Equation (notice its recursive nature)
a s'
NOTE: utility function U(s) is also known as value function V(s)
###
### policy : finding policy by "Value Iteration" (aka VI)
###
U(s) = R(s) + γ max Σ T(s,a,s') * U(s') # Bellman Equation : notice it's nonlinear because of max()
a s' # there are n unknown states, so there are n unknown U(s)
# this nonlinearity is not good
s : there are n states
a : there is a finite set of actions
γ : discount factor  we assume it's known
"Value Iteration" [algo] _
1. start with arbitrary (estimate or just purely arbitrarily initialized) utilities. # e.g. U_π(s) where hat means estimate
2. update utilities based on neighbors
3. repeat 2. until converge # guaranteed to converge
_ _
U_t+1(s) = R(s) + γ max Σ T(s,a,s') * U_t(s') # why does this work ? _
a s' # because you are iteratively updating your U_π(s) with true immediate reward R(s)
# for more rigorous proof, see below PDF, page 25
(ref) https://s3.amazonaws.com/mlclass/notes/MDPIntro.pdf # "Markov Decision Processes and Reinforcement Learning  an introduction to stochastic planning"
note: you don't need to estimate precise values of utility function U(s) because we only need U(s) to be accurate enough to give optimal policy π(s) that gives correct action a
s
note: "converge" is commonly determined by how much the value of U(s) and U'(s) changes. if it's small (or zero) then we say converge, or we may simply set the max iteration count.
note: once we converge to a particular U(), then the policy can be obtained as follows:
π*(s) = argmax Σ T(s,a,s') * U(s')
a s'
note: normally each iteration takes an equal amount of time.
###
### policy : finding policy by "policy iteration" (aka PI)
###
"policy iteration" [algo]
1. start with π_0(s) # just take a guess
2. evaluate: given π(s), calculate U_t+1() = U_tπ()
U_t+1(s) = R(s) + γ Σ T(s,π(s),s') * U_t(s') # another Bellman Equation: notice max is gone, now this is linear
s' # this linearity is good. but still O(n^3) but there are more tricks to improve it.
# repeat this U_t+1() update until converge
3. improve: ∀s, π(s) = argmax Σ T(s,a,s') * U_t(s') # find the best action, with onestep lookahead, using the existing utility.
a s'
==> repeat 2. & 3. until π(s) converges # guaranteed to converge. look up proof
note: notice how step 2 itself is a loop inside a bigger loop (of repeating steps 2,3)
note: notice "converge" can be tricky in stochastic environment. so a common approach is to run play the game with the latest π() for multiple trials (like 50 times), and take the average total reward. and if average total reward doesn't change much between π_t+1(s) and π_t(s), then we say converged. of course, in deterministic environment, we need to play only once.
##
## VI versus PI (versus QLearning)
##
notice the overlapping/different aspects of the both methods.
 when the underlying MDP model is known, i.e. R() & T() are known, both VI & PI will converge to optimality.
 because PI updates both policy and utility functions in each itertion, its each iteration takes longer than VI (who only updates utility over iterations, and updates policy in one shot at the end), but precisely for the same reason, PI usually requires less iterations in total. we will see below that Qlearning's each iteration is even quicker but the number of iteration it takes to converge will be significantly larger with Qlearning.
(ref) https://people.eecs.berkeley.edu/~pabbeel/cs287fa12/slides/mdpsexactmethods.pdf
 unlike VI, PI's each iteration (specifically step 2) may take less time as it approaches convergence.
 is either of VI or PI definitely faster ? no. but one way to think about this is, if updating PI at each iteration helps improve policy dramatically (closer to optimal) and it converges before value function converges, then PI could overally be quicker. and of course, for some problems, we may see the opposite effect. e.g. updating policy at each iteration is not as quicktoconverge as updating only utility func and doing oneshot policy production at the end.
 as we will see next, QLearning is a fundamentally a diff approach. it assumes agent doesn't get to know R(),T(), and uses randomized optimization approach. so it doesn't have a guarantee for convergence to optimal policy. but usually it produces a comparable policy, and even quicker total time (far more iterations, but at this point, a single iteration means a diff thing btwn VI,PI,QL, so not critical to compare periterationtime).
##########################################
#### (3.2) Reinforcement Learning ####
##########################################
history:
in 1930's, psychologists conducted experiments where a rat is given a signal before cheese was fed.
and they empirically observed:
 animal sees a stimulus(i.e. state): s
 takes an action: a
 gets a reward: r
 strengthens(i.e. reinforces) its response to this stimulus.
==> computer scientists interpreted this as "reward maximization" problem, where you choose an action that maximizes reward, as a function of state.
##
## RL API components
##
planning: model(T,R) > [planner] > policy π(s)
RL: transitions <s,a,r,s'>* > [learner] > policy π(s) # suppose we get lots of <s,a,r,s'> examples
modeler: transitions > [modeler] > model
model: model > [simulator] > transitions
note: model is like the rules/physics of the world. so for certain games (like chess, backgammon), the model is well defined and well known, but for real world (like financial market), model is not well defined/known.
(hence the need for modelfree approach)
note: action (the effects of action on the environment, to be precise) can be deterministic or stochastic. e.g. when you say go left, agent may go left 80%, and go up 10%, stay 10%.
some literature call this "deterministic environment" "stochastic environment"
note: agent may not know T(), R() and only get the experience <s,a,r,s'>
note: important to consider COMDP vs POMDP
##
## RL types
##
(1) RLbased planner # this is not really an established term. some call this modelfree planner
_____________________________________________
model > [simulator] > transitions > [learner] > policy # takes model, and produces policy
_____________________________________________ # this works great for problems for which model is well known, like backgammon.
# based on the model, simulate transitions, and learners can learn policy
(2) modelbased RL # this is an established term
___________________________________
transitions >  [modeler] > model > [planner] > policy # takes transitions, and produces policy
___________________________________ # based on transitions, build a model, then use planner(like value,policy iterations algo) to produce policy
##
## Three approaches to RL
##
(1) policy search algorithm: s > π() > a
(2) valuefunction based algo: s > U() > v # U() = utility function, aka V() = value function, v = value
(3) modelbased RL algo: s,a > T(),R() > s',r # T() = transition function, R() reward function
generally,
(1) (2) (3)
<>
directly usable more direct learning (you can directly get s',r from s,a)
(policy gives action which is what we want)
indirect learning indirectly usable (because this dosen't give you the policy to direct the best action)
(indirect because you don't get the right s,a examples )
therefore, we can follow the below steps.
(3) solve Bellman equations> (2) argmax> (1)
note: we focus on (2) where lots of celebrated research has been done.
##
## A new kind of value function
##
recall:
U(s) = R(s) + γ max Σ [T(s,a,s') * U(s')] # U(s) = utility of s, i.e. long term value of state s
a s' # γ = discount rate, T(s,a,s') = Pr(s's,a), s' = all possible next state
# recursive, nonlinear, but you can use value iteration to solve.
π(s) = argmax Σ [T(s,a,s') * U(s')] # we can define the best policy like so
a s'
now we wanna define a new value function called Q()
Q(s,a) = R(s) + γ Σ [T(s,a,s') * max Q(s',a')] # Q(s,a) = maximum discounted cumulative reward that can be achieved starting from taking action a in state s
s' a' # aka Q value
note: we can trivially redefine U() and π() below
U(s) = max Q(s,a)
a
π(s) = argmax Q(s,a)
a
note: the beauty of Q Learning algo is, if we can learn Q(s,a) well, then we don't ever need to know R() or T() or U*(), to get π*(s)
note: notice in VI,PI, the value function was U(s), but here the value function is Q(s,a). action is part of it.
##
## Q learning : model free approach
##
we wanna learn Q(s,a) without models of T(),R()
i.e.
 we only have access to transitions <s,a,s',r> # we took a in s, and got r, and landed in s'
 no access to models of T(), R()
_
we can estimate Q() from transitions
_ _ _
Q(s,a) = (1α)Q(s,a) + α[r + γ max Q(s',a')] # α = learning rate
a'
##
## learning rate α and convergence
##
intuitively, it's how much to learn from the most recent experience. as we set the learning rate bigger, QLearner will consider the most recent experience more heavily.
e.g. if we set α=0, agent will not learn anything. if we set α=1, then only the most recent reward info will be considered.
note: α=1 is optimal in a deterministic world, but in stochastic world, α=1 will likely make Q value fluctuate/oscillate around the max reward.
so all we are doing in Q estimation is to continuously update V = (1α)V + αX # Vt = (1αt)Vt + αtXt to be precise
# assume Xt is drawn from some probability distrib of X
to "converge", learning rate α satisfies two properties
∞
(1) Σ α_t = ∞
t=0
∞
(2) Σ (αt)^2 < ∞
t=0
There is a whole range of values for α_t that meets the above properties, but an example is α_t = 1/t
(in practice, a constant α is also commonly used.)
interestingly, Vt = (1αt)Vt + αtXt will converge to E[X] # essentially average of X
##
## Q learning (cont'd)
##
_ _ _
Q(s,a) = (1α)Q(s,a) + α[r + γ max Q(s',a')] # α = learning rate
a'
_
= E[r + γ max Q(s',a')]
a'
_
= R(s) + γ E[max Q(s',a')]
s' a'
_
= R(s) + γ Σ [T(s,a,s') * max Q(s',a')] # so our estimate actually converges to the real Q() in theory
s' a'
_
note: a problem with the above claim is Q() keeps getting updated, so it is a moving target
##
## Q learning : convergence
##
_
Q starts anywhere, then Q(s,a) = (1α)Q(s,a) + α[r + γ max Q(s',a')] # α = learning rate
a'
_
then Q(s,a) converges to Q(s,a) # solution to Bellman equation !!
only IF
(1) s,a visited infinitely often
(2) α satisfies two properties # sum to ∞, but its squared sum is less than ∞
(3) s' drawn from T(s,a,s'), r drawn from R(s)
==> important questions comes up:
_
 how to initialize Q(s,a) ∀s,a ?
 how to decay αt ?
 how to choose actions ? # always choose the same action ?  dumb, won't learn anything
# always choose randomly ?  also won't learn anything
# _
# choose based on Q() ?  sensible, but think about it.
# if initalization of estimate is done badly,
# then you may end up choosing the same action, or some local min/max action.
# but this problem is not new, we already discussed this in Randomized Optimization context
_
==> take "simulated annealing"like approach. i.e. choose based on Q() but take a random action sometimes
_ _
π(s) = argmax Q(s,a) with P(1ε) # neat, this is called "εgreedy exploration"
random action otherwise # aka "greedy limit+infinite exploration" (GLIE)
##
## εgreedy exploration
##
_ _
if GLIE (decayed ε), then Q() > Q() and π() > π*() # called "explorationexploitation dilemma"
__________ ___________ # it's a fundamental tradeoff in RL.
(learn) (use) # you have to balance btwn explore and exploit. (have to take action to learn but have to learn to take better action)
(exploration) (exploitation) # otherwise you learn nothing or fall in local minima
note:
 explore refers to taking random action. # with P(ε)
 exploit refers to choosing action based on the current Q() which it has already learnt/explored, to maximize reward. # with P(1ε)
 so we need to balance btwn the two by controlling ε. it is sensible to start with a big ε (more exloration) at the beginning, and gradually decay (as your Q value estimate improves).
(good ref) http://artint.info/html/ArtInt_266.html
##
## note
##
 unless we decay α & ε, policy will always fluctuate a bit by design.
 notice QLeraning is really a randomized optimization algo. (thus it can get stuck in local optima)
 there is a whole lot more stuff to learn in RL, but we only covered fundamentals above.
e.g.
_ _
other approach includes smarter Q() initialization. if you set Q(s,a) = the highest possible value for each s,a then you are bound to check all actions. so it's one way to force extensive exploration. (known as optimism in the face of uncertainty)
_
 to recap, in terms of hyperparameters, (1) ε, (2) α, (3) initial Q values
###############################
#### (3.3) Game Theory ####
###############################
so far, we've been talking about decision making for a single agent who wants to maximize reward. but there are other agents competing in the world.
 mathematics of conflict
 single agents VS multiple agents
 economics, biology
 increasingly a part of AI/ML
##
## a simple game
##
two players, each taking turns to say a positive interger < 4, adding up, and he who says 20 wins.
 you can pretty much list out all possible states and actions.
 it's called "2player zerosum finite deterministic game of perfect information"
zerosum: the sum of all rewards is a constant K in every strategy combination (K doesn't have to be zero, as long as it's one same constant in every possible outcome)
deterministic: no stochasticity, no probabilistic nature
strategy: what player does in possible states. if you have N states, and M possible actions to choose in each state, then you have N*M strategies. (note the number of choice in each state may not be the same. in some state you have binary choice, in other state, you may have many multiple choices.) also, notice a strategy is basically a combination of actions.
quiz: (how many strategies ?) https://www.youtube.com/watch?v=2k6rkcIMZE0
quiz: (matrix representation) https://www.youtube.com/watch?v=fqRs8qNUoeM
https://www.youtube.com/watch?v=8CYEPMd40qE
##
## minimax
##
as we saw in the quiz, player A considers worst case counter by player B, and player B considers worst case counter by player A. and A wants to miximize its score while B wants to maximize its score (i.e. minimize A's score) so this paradigm is called "minimax"
##
## Von Neumann theorem
##
in a "2player zerosum finite deterministic game of perfect information", minimax == maximin, and there always exists an optimal pure strategy for each player.
##
## game tree  example
##
let's look at an example of "2player zerosum finite NONdeterministic game of perfect information" # nondeterministic == stochastic
(quiz) https://www.youtube.com/watch?v=_dqKspIU7y4
https://www.youtube.com/watch?v=_URgDCeeMSY
note: once you construct strategy matrix, you may not be able to reproduce the exact original tree but that's ok, as long as you can construct a game tree that satisfies the matrix, then it's all that matters.
note: Von Neumann theorem still applies (minimax==maximin) even after we changed the game to Nondeterministic.
##
## minipoker
##
we further relax our game to be "2player, zerosum, finite NONdeterministic game of HIDDEN information"
"minipoker":
 given players A & B
 A is dealt a card, red or black, 50% probability each. A wants black. (B cannot see what A got, hence "HIDDEN" information)
 A may resign if red: then 20 cents for A
else A chooses to hold # assumption is if A gets black, then he always holds
then B may choose (1) to resign : +10 cents for A
or (2) to see A's card: 40 cents (for A) if red
+30 cents (for A) if black
note: it's a simpler version of Poker
(see game tree & matrix representation of the above game flow) https://www.youtube.com/watch?v=15oPc9T1b9c
note: with "hidden" information, minimax != maximin, Von Neumann theorem fails.
but we overcome this by using "mixed" strategies, instead of using "pure" strategies
##
## "mixed" VS "pure" strategy
##
suppose a player has two actions to choose from in some state. so that's two strategies. (to simplify, let's assume there is no other state.)
"pure" strategy means the player always picks one strategy.
"mixed" strategy means the player picks strategy with some probability distribution. e.g. P(strategy_1) = 0.3, P(strategy_2) = 0.7
note: so yes, pure strategy is a type of mixed strategy where probability distrib is 100% on one strategy.
##
## (iterated) prisoner's dilemma
##
suppose two criminals get arrested, and offered a choice between coop & defect. # coop = keep silence, deny any accusation
# defect = blain the other guy did it
as each has to make a decision, they are detained in a separate cell, and cannot communicate to each other.
(player B)
 coop  defect

(player A) coop  1,1  9,0 # 9,0 means A gets 9 years in jail, B gets 0 years in jail
defect  0,9  6,6 # 6,6 means both A and B get 6 years in jail
so, as a group, it makes sense for both to coop, but since they cannot communicate with each other, they have to cogitate as below.
(from player A's perspective) if B coop, then which action gives A less time in jail?  defect (0 year, as opposed to 1 year)
if B defect, then which action gives A less time in jail?  defect (6 years, as opposed to 9 years)
so it always makes sense to defect.
player B will choose to defect for the same reason. so they both get 6 years in prison. aka prisoner's dilemma.
##
## Nash Equilibrium # equilibrium = balanced state
##
given N players, each with a set of strategy S1, S2,,, Sn # S1 = a set of strategies available for player 1
# S2 = a set of strategies available for player 2
(each player picks a strategy)
a set of particular strategies chosen by each player: S1* ∈ S1, S2* ∈ S2,,, Sn* ∈ Sn
is in a Nash Equilibrium iff ∀i Si* = argmax utility(S1*,S2*,,,Si,,,,Sn*)
Si
i.e. if you give each player a chance to change his strategy, nobody changes. i.e. all balanced.
note: Nash Equilibrium works for both pure and mixed strategies.
##
## Nash Equilibrium quiz
##
find the (pure) N.E.
(player B)
 coop  defect

(player A) coop  1,1  9,0 # well, we already went thru this
defect  0,9  6,6 # so, 6,6 is the N.E. (i.e. dominant strategy combo)
player B

 0,4  4,0  5,3
A  4,0  0,4  5,3
 3,5  3,5  6,6 # 6,6 is the N.E.
##
## 4 fundamental theorems of N.E.
##
(1) in the nplayer pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the unique N.E.
(2) any N.E. will survive elimination of strictly dominated strategies.
(3) if n is finite && ∀i Si is finite, then ∃ (either pure or mixed) N.E. # n = number of players, Si is the set of strategies available for player i
(4) n repeated games == n repeated N.E. # even if you repeat the game to learn each other's pattern/behavior, doesn't change the outcome N.E.
note: it's possible for a game to have multiple N.E. and that's an active research area but beyond the scope of this class.
###
### Uncertain End : discounting
###
let's look deeper at (4) n repeated games == n repeated N.E.
suppose you have multiple rounds of the same game of prisoner's dilemma we discussed.
the idea is if you have multiple rounds, then in earlier rounds, you get to show benign intention/pattern by cooporating, instead of defecting, so the opponent will learn to cooperate in subsequent rounds for the mutual good.
think about the very last round, because there is no subsequent round, you get greedy, and choose defect. by N.E. the other party will defect too.
then the last round outcome is already determined, then what about the 2nd last round? it's essentially the same thing, so both will defect, and so on.
hence, n repeated games == n repeated N.E.
==> but that is assuming both parties know when the last round is.
==> what about the last round is decided probabilistically? then it makes a difference !
[setup]
you start the first round, then, with probability γ, the game continues to the next round.
i.e. with P(1γ), the game ends. so with uncertainty, every round could be your last.
what is the expected number of rounds ? # it's finite if γ<1
∞ 1
Σ [γ^i] = 
i=1 1γ
===> this means we cannot use represent a strategy with hardcoded sequence of actions like state1:coop, state2=defect, state3=defect, etc
we need a better representation.
###
### titfortat # a famous IPD strategy (IPD = iterated prisoner's dilemma)
###
(1) cooperate on first round
(2) copy opponent's previous move thereafter
we can represent it in a finite statemachine
<(C)
(C)<>C D <>(D) # parenthesis () denotes opponent's move
(D)>
(quiz) https://www.youtube.com/watch?v=ln7FE1e1cJE
##
## facing TfT strategy
##
let's use the same setup.
(player B)
 coop  defect

(player A) coop  1,1  9,0
defect  0,9  6,6
facing TfT opponent, what if you do the following strategies?
(1) always defect : 0 + 6γ/(1γ) # the first round, is you defect, your opponent coop, so you get 0 month in prison
# the 2nd round onward, both your opponent and you defect, so that's 6 times the expected number of rounds 1/(1γ)
# but recall it's from the 2nd round onward, so you need to multiply by γ, so you get 6γ/(1γ)
(2) always coop : 1/(1γ) # 1 times the exoected number of games 1/(1γ)
===> compare these two. they are equal when γ = 1/6 # derive γ = 1/6 by setting (1) = (2) and solving for γ
also, important to notice (1) is better when γ < 1/6
(2) is better when γ > 1/6
===> how do we generalize this?
##
## what's the best response to a finite state strategy?
##
recall our finite state machine representation
note: we used to need the matrix when there is only one round of the game but matrix is not sufficient when there is uncertain number of rounds of the same game.
9
1 <(C) 6
(C)<>C D <>(D) # parenthesis () denotes your move
(D)>
0
our choice impacts two things
(1) our payoff
(2) future decisions of the opponent
> treat the above representation as MDP problem ! # which we know how to solve, using valueiteration
==> so the policy you will find will determine either
π(s)= a # s is opponent's action C or D, which we interpret as an input state s
# and action a is our action choice of C or D
so there are only 4 combinations:
π(C) = C, π(D) = C # always C
π(C) = D, π(D) = D # always D
π(C) = C, π(D) = D # this
π(C) = D, π(D) = C # and this are essentially the same, you just loop between CDCDCDCDCDCDC...
note: MDP always has markov deterministic optimal policy.
(mustsee quiz) https://www.youtube.com/watch?v=2QTz1H_PRp0
e.g.
opponent strategy  your best strategy/response

always C  always D
always D  always D
TfT  always C or TfT
===> now think about which one is "mutual best reponse" ? i.e. pair of strategies where each is best reponse to other. (i.e. Nash Equilibrium)
always C + always D ==> no, because your opponent will prefer to change to always D
always D + always D ==> yes, it's N.E.
TfT + always C ==> no, because your opponent will prefer to change to always D
TfT + TfT ==> yes, it's N.E. # notice this N.E. actually achieves "cooperative" N.E. overcoming the dilemma.
===> with TfT, we kinda changed the prisoner's dilemma game in a repeated setting, to achieve "cooperative" N.E.
let's formally define this.
###
### Repeated Games and the Folk Theorem (intro)
###
general idea: in repeated games, the possibility of retaliation opens the door for cooperation.
in mathematics, a folk theorem means a well known. well established, provable theorem (at least to experts in the field) but not attributed to any particular individual, and no official original publication.
in game theory, the folk theorem means the set of payoffs that can result from Nash strategies in repeated games.
(excellent quiz) https://www.youtube.com/watch?v=fEdzBqxhUKM
e.g.
let's use the usual setup,
(player B)
 coop  defect

(player A) coop  1,1  9,0
defect  0,9  6,6
===> just to think about this geometrically, suppose xaxis = opponent's score, yaxis = your score
9 6 1 0
(c,d) # just plotting the above table
c,c  1
 # connect all 4 points.
 # that forms a polygon, which describes the "feasible region" of average payoffs of some joint strategy
d,d  6 # "feasible region" is an important notion
 #

(d,c) 9

###
### Minmax Profile # aka minmax payoff profile
###
pair of payoffs, one for each player, that represent the payoffs that can be achieved by a player defending itself from a malicious adversary (trying to minimize your score).
note: it's zero sum game.
(player opponent)
B S
 # this game known as Back,Stravinsky (aka Battle of Sexes)
(player you) B  1,2 0,0 # e,g, 1,2 means you get 1 point, opponent gets 2 points
S  0,0 2,1
Q. what is the minmax profile of BS ?
answer: 1,1 # because if you try to minimize your opponent score, you avoid a strategy that can give opponent the high score 2
# so you choose strategy S, at that point, your opponent can only score 1 at most
# vice versa, this particular game is symmetrical, so the answer is 1,1
note: "minmax profile" assumes pure strategy.
for mixed strategy, we use "security level profile" # but some people use "mixed strategy minmax payoffs/profile"
Q. what is the security level profile of BS ?
answer: 2/3,2/3
==> suppose the opponent chooses strategy B with probability x, and thus strategy S with probability x1
so your expected total score will be 1*x if you choose B, and 2(1x) if you choose S
then your opponent wants to minimize your max score, if he chooses x s.t. 1*x > 2(1x) then you choose B
if he chooses x s.t. 1*x < 2(1x) then you choose S. so your opponent wants to min max(x,22x) # hence minmax
you can plot max(x,22x) for x 0>1, and find where it's the lowest,
but basically you can solve this by setting x = 22x, so you get x=2/3
since the game is symmetric in this particular example, answer is 2/3,2/3
##
## acceptable region # aka preferable region
##
in Prisoner's Dilemma game, minmax payoff is 6,6 # or whatever the penalty for both defecting
recall the "feasible" region: a region obtained by connecting 4 points (c,d)(d,d)(c,c)(d,c)
because the minmax payoff is at (d,d), any points within feasible region that can improve for either player is "acceptable" region.
i.e. a sub region of feasible region where x≥6 or y≥6 will form such a region.
9 6 1 0
(c,d)
c,c  1


d,d  6


(d,c) 9

(visual) https://www.youtube.com/watch?time_continue=79&v=tLoEupti9E
###
### Folk Theorem (definition)
###
any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor.
proof: if it strictly dominates the minmax profile, can use it as a threat. better off doing what you are told!
###
### Grim Trigger # another example strategy like TfT
###
it's a strategy where if you cooperate, Grim will cooperate.
if you defect, Grim will always defect for the rest of the game.
state machine wise, it will look like this:
(C)<> C (D)> D <>(D),(C) # parenthesis () denotes your move
the idea is Grim threatens you "if you defect, I will always defect, so you better cooperate."
Grim will force you to cooperate. yes it's a Nash Equilibrium.
###
### Implausible Threats
###
a fundamental problem with Grim Trigger is its implausible threat.
because if Grim says he will always defect if you don't cooperate, but always defecting is not benefitial to Grim either.
(imagine if somebody threatens you on the street if you don't give him your wallet, he will detonate this nuclear bomb. but if you don't cooperate he dies too, and he must consider his life more worthy than your wallet, so his threat is implausible.)
"subgame perfect" : always best response independent of history
===> so Grim doesn't feel like subgame perfect.
BUT, depending on the reward/penalty, also depending on the opponent strategy, still Grim Trigger can be subgame perfect.
So, the right way to think about subgame perfect is in an actual game setup.
Q. is Grim Trigger VS TfT subgame perfect?
A. No. because suppose at one point in history(you are allowed to artificially engineer some history) if TfT chose Defect, then Grim will choose Defect forever, and that will not be subgame perfect.
it is however a N.E. because it will settle in an always cooperate loop.
Q. is TfT vs TfT subgame perfect?
A. No, because suppose at one point in history (again you can make this up) TfT chose Defect, then the opponent TfT will choose Defect, then both always defect. (not subgame perfect)
###
### Pavlov # another strategy, aka "winstay, loseshift"
###
TfT: (cooperate first time, then) repeat what opponent did
Pavlov: slightly modified TfT. (start coop)
if Pavlov cooperates and you coop, (at this point, this round of game ends), then (at next round) Pavlov coop.
if Pavlov cooperates and you defect, then (at next round) Pavlov defects.
if Pavlov defects and you coop, then (at next round) Pavlov defects.
if Pavlov defects and you defect, then (at next round) Pavlov coop.
===> aka "winstay, loseshift" i.e. repeat last choice if good outcome, otherwise switch
visually, it's a finite state machine as below.
<(D)
(C)<>C D <>(C) # parenthesis () denotes your move
(D)>
note: recall our model of "game", you do one round where both parties decide their hands, then comes the result. then the next round. so it's not truly sequential within a single round, as you may interpret from the above state machine diagram.
Q. is Pavlov VS Pavlov Nash equilibirum ?
A. well yes, simply it loops in Cooperate state.
Q. is Pavlov VS Pavlov subgame perfect ?
A. yes
suppose (your_hand,opponent_hand)
(1) if (C,C) then both parties choose C at next round, so it'll be an infinite loop.
(2) if (D,D) then going to (1) at next round
(4) if (D,C) then going to (2) at next round
(3) if (C,D) then going to (2) at next round
===> plausible threat at work. because pavlov strategy is effectively if you screw with me, i will screw you.
###
### Computational Folk Theorem
###
given a 2player bimatrix game, (average reward repeated game)
we can build Pavlovlike machines for any game.
(and using that machine) we can construct subgame perfect Nash equilibrium for any game in polynomial time.
i.e.
if it is possible for 2 players to have a mutually benefitial strategies, then we can build a Pavlovlike machine in plolynomial time (i.e. quickly).
if not possible, then the game is zerosum like (cannot mutually benefit) but still can solve a linear program, to produce a N.E. strategy solution.
or if even that is not possible, still we can find a N.E. strategy where at most one player improves in the zerosum like game.
note: a 2player zerosum game: we are famililar.
a 2player bimatrix game is where each player has his own reward structure.
note: average reward repeated game means we repeat many times and look at average reward.
(read papers by Peter Stone, Michael Littman on this topic for more details)
###
### type of games  zero sum VS constant sum VS general sum
###
recall the matrix of strategies and corresponding rewards.
(player 2)
 A  B 

(player 1) C  1,1  0,0  # zero sum means, reward x,y hold x+y = 0
D  2,2  3,3  # constant sum means x+y = some_constant
# general sum means x,y can be such that there can be winwin or loselose strategies.
# e.g. x,y = 1,1 or x,y=2,2 so on.
(ref) https://www.cs.cmu.edu/~avrim/ML14/lect0409.pdf
###
### Stochastic Games (for multiagent RL)
###
recall MDP gave a model for RL.
similarly,
stochastic games gives a formal model for multiagent RL. # turns out MDP is a particular version of stochastic game
N.E. in multiagent RL : pair of policies s.t. no agent would prefer to change his policy.
[stochastic games notions] # first proposed by Shapley, a nobel prize winner
S : states # s ∈ S
Ai : actions for player i # a ∈ A1, b ∈ A2 (assuming a 2player game)
T : transitions # T(s,(a,b),s')
Ri : rewards for player i # R1(s,(a,b)), R2(s,(a,b))
γ : discount factor # assume γ is the same for all players (just the definition we stick to)
##
## models & stochastic games
##
we can add constraints to stochastic games to get specific models we are familiar with.
(for simplification, assume a 2player game)
(1) R1 = R2
===> zero sum game
(2) T(s,(a,b),s') = T(s,(a,b'),s') ∀b' # i.e. player 2's action doesn't matter
R2(s,(a,b)) = 0, R1(s,(a,b)) = R1(s,(a,b')) ∀b' # i.e. player 2's reward is always zero
# player 2's action doesn't affect player 1's reward
===> MDP (where your opponent's action doesn't affect you)
(3) S = 1
===> repeated game because we always play the same state
###
### zerosum stochastic games # assume a 2player game
###
recall Bellman equation to represent a utility(aka value) function
can we define the analogous Q value function for zerosum stochastic game ?
_ _
Qi(s,(a,b)) = Ri(s,(a,b)) + γ∑ [T(s,(a,b),s') max Qi(s',(a',b'))] # notice something is wrong here
s' a',b'
===> naively assumes joint actions of max benefit will be chosen for a particular player i
===> to adopt it to zerosum stochastic game, we need to use minimax instead
_ _
Qi(s,(a,b)) = Ri(s,(a,b)) + γ∑ [T(s,(a,b),s') minimax Qi(s',(a',b'))] # modified bellman equation for zerosum stochastic game
s' a',b'
now, recall the Q learner update equation. to solve the Bellman equation.
Q(s,a) = (1α)Q(s,a) + α[r + γ max Q(s',a')] # α = learning rate
a'
===> let's adopt this to zerosum stochastic game
given an experience of a transition <s,(a,b),(r1,r2),s'> # took (a,b) in s, got r and landed in s'
Qi(s,(a,b)) = (1α)Q(s,(a,b)) + α[r + γ minimax Qi(s',(a',b'))] # neat, this is aka "minimaxQ"
a',b'
===> turns out we get nice properties!!
 value iterations works
 minimaxQ converges
_
 unique solution to Q
 policies can be computed independently # because you just want to minimize your opponent's max score (this idea extends to morethan2player games too, as long as it's zerosum)
 update efficient (i.e. polynomial time, because the above minimax can be solved by linear programming)
 Q functions sufficient to specify policy
###
### generalsum stochastic games
###
generalsum means not zerosum. so there can be winwin, loselose pairs of strategies, like prisoner's dilemma.
let's extend the same Qlearning to generalsum games.
instead of minimax (which was useful in zerosum game), we use Nash equilibrium.
_ _
Qi(s,(a,b)) = Ri(s,(a,b)) + γ∑ [T(s,(a,b),s') Nash Qi(s',(a',b'))]
s' a',b'
Qi(s,(a,b)) = (1α)Q(s,(a,b)) + α[r + γ Nash Qi(s',(a',b'))]
a',b'
====> called "NashQ"
sadly, with NashQ in generalsum game, we don't get any of the properties from minimaxQ in zerosum stochastic games.
 value iteration doesn't work
 NashQ doesn't converge
_
 no unique solution to Q
 policies cannot be computed independently # because N.E. is a joint behavior
 update not efficient (unless P = PPAD)
 Q functions not sufficient to specify policy
===> doomed! but there are lots of good ideas
(1) repeated stochastic games
 play a stochastic game (which repeats turns for players) and repeat the game itself. that allows us to build a folk theorem like ideas. already some algorithm exists for this.
(2) cheaptalk
 let players to talk/share some info, to coordinate. correlated equilibria.
(3) cognitive hierarchy
 instead of trying to solve for an equilibirum, you just assume about others, so you can apply MDP Q learning type of approaches. (turns out this is often how human plays games, and often gives a near best response)
(4) side payments (coco values) # coco = cooperative competitive
 as in, hey i get a big reward if we take this joint actions, then i pay back to you some.
###############
### outro ###
###############
cascade learning
semisupervised learning
spectral clustering : using sophisticated linear algebra to avoid local optima
TFIDF : term frequency  inverse document frequency
cross entropy
########################
#### Appendix ####
########################
###
### normal distribution # aka Gaussian distribution
###
(ref) https://en.wikipedia.org/wiki/Normal_distribution
informally, it's a common continuous probability distribution, where it's symmetric along the mean. and most of the data points are around the mean.
mathematically, it's expressed in terms of mean μ and variance σ^2
e^( (xμ)^2 / 2σ^2 )
y = Pr(x) =  on the domain x ∈ (∞,∞) # this function is called "probability density function"
sqrt(2πσ^2)
note: its shape is nothing but a bell curve. μ defines the center, and σ defines the width
###
### Sampling Distribution
###
(ref) http://davidmlane.com/hyperstat/A11150.html # sampling distrib
(ref) http://davidmlane.com/hyperstat/A13660.html # sample distrib of the mean
let's say you have a lot of data points (can be continuous or discrete), with whatever underlying distribution.
and you pick a set of samples as below:
S_a = {s1,s2,s3,,,,,sn} # a sample set of size n
S_b # another set
S_c # more
..
..
S_infinite
note: beware of the confusing wording.
sample size n refers to how many data sample points within a sample set, like above.
and we take lots of sample sets, like S_a, S_b, S_c, so on, like above.
we assume we get infinitely many sets in the context of "sample distribution" (by definition)
for each set, let's say we compute some statistics, like the mean (or median or whatever)
so you get the below
S_a's mean
S_b's mean
S_c's mean
..
..
S_infinite's mean
===> if we plot the frequency distribution of the values of those means, we know the below properties.
##
## properties of the sampling distribution of the mean
##
given a population with a mean of μ and a standard deviation of σ,
the sampling distrib of the mean has a mean of μ, and a stddev of
σ
σ_M =  where n = sample size
sqrt(n)
σ_M is also called "standard error of the mean" # it's really just "stddev of the sampling distrib of the mean"
this is intuitive, because if you take more data points within each sample set, then each set's mean is more likely to have a value closer to the mean of the original global population of data. i.e. less fluctuation, i.e. smaller stddev (of the sample distribution)
also, as n increases, we know the sampling distrib of the mean approaches a normal distribution (known as central limit theorem)
let's write it down as an independent note below.
###
### Central Limit Theorem (CLT)
###
 an important fundamental theorem in statistics # Alan Turing's master's thesis was on its proof
given "any" (even crazy) distribution of data (can be continuous or discrete), with a mean μ and a variance of σ^2,
its sampling distribution of the mean approaches a normal distribution with a mean μ and a variance of σ^2 / N, as N increases, where N = sample size
note: variance σ^2 / N means stddev σ/sqrt(N)
 this is why many ML algo and scientific/engineering experiment assume normal distribution of the data they analyze.
(ref) http://davidmlane.com/hyperstat/A14043.html # central limit theorem
###
### Cross Validation VS Model Complexity
###
training your model with training data, and testing(validating) the model with test data.
there are lots of sophisticated techniques.
(ref) https://en.wikipedia.org/wiki/Crossvalidation_(statistics)
(1) kfold cross validation:
 split the data in k equal sized sets. pick 1 set as test data, and others as training data.
 repeat cv k times (hence k folds), using each set exactly once.
 out of k results, take the mean, as your final estimate.
(2) stratified kfold cross validation
 it's a special type of kfold cv where each set is selected to have the same mean (of response variable y) as the other sets.
"Model Complexity" analysis is when you do cross validation training/testing using a learning algo (could be any, like DT, NN, KNN, SVM, SA, MIMIC, etc), there are hyper parameters to each algo (like for DT, what depth, pruning method, what method to use to determine the best attribute/value to split on?) and if you try different sets of parameters, using train/test and pick the param set that gives the best performance on the test data, then it is considered the knowledge of the test dataset leaking into the training.
so, the idea is first prepare train+test data (usually 70/30% or 80/20%), then split the train data into train + validation data sets (like 80/20%), then do training on the train data and testing on the validation set, to determine the best parameters (GridSearchCV in scikitlearn library), then try the untouched, heldout test data set. also note, because the initial train data is split into train+validation sets, we kind of lost the data size, so this is where kfold CV is done.
> and sometimes people assume this Model Complexity analysis as part of Cross Validation process.
##
## Sample Complexity
##
it's just another way of referring to "input data size" aka "problem complexity"
##
## "data scientist"
##
sadly nowadays we see so many fancy job titles thrown around like data scientist, ML engineer, quant stat researcher, big data/model/AI analyst, and so on.
a data scientist in one company may mean something completely different from a data scientist in other company.
so many schools now offering "analytics" "data science" masters these days.
i think people who put "machine learning" as expertise on their resume ideally should have at least one paper published in one of icml, jmlr, nips, aaai, uai, ijcai, colt, icann, or related ieee/acm journals or international conferences. you know, then we know he/she is legit.