### data visualization and analytics

##############################################
####   data visualization and analytics   ####
##############################################

lecture outline
1. intro to R
2. data visualization
3. data pre-processing
4. logistic regression
5. linear regression
6. regularization

#########################
###    intro to R    ####
#########################

R docs
https://www.r-project.org/
http://www.statmethods.net/
https://www.r-bloggers.com/how-to-learn-r-2/

(1) use Rstudio IDE

(2) run from command line
e.g.
\$ Rscript foo.R   (or simply put #!/usr/bin/Rscript at the first line of foo.R and run itself)

# useful cmds

help()
ls()      # list variabes
save.image(file=”R_workspace”)  # Saving variables to a file
save(new.var, legal.var.name, file = “R_workspace”) # save specified variables
system(“ls -al”) # runs a unix shell
Sys.sleep(123)  # sleep for 123 seconds
stop("foo bar") # halts the execution, and errors out
file.size()

# syntactic sugar

- white space dropped (indentation has no sytactic meaning)
- semicolon at the end of each line not needed, unless you want to put two lines into one line.
- comment is "#"
- a dot "." is not an operator, but just a char, so it can be part of function name, etc
- note variables are dynamically typed. i.e. determined at run time, rather than compile time

# factors

factors are variables in R which take on a limited number of different values/
e.g.
current.season = factor("summer", levels = c("summer", "fall", "winter", "spring"), ordered = TRUE)

# conditonal statements

if (a == b){  #  inequality is  !=
d = 1;
}else if(a > b){
d = 2;
}else{
d = 3;

print(d)

AND &&
OR  ||

# loop

for(i in 20:25){
print(i)            # this will print 20,21,22,23,24,25

for(i in c(20,21,22,23,24,25)){  # an alternative way
print(i)

for(i in seq(20,25,1)){   # an alternative way
print(i)
if(i > 23){
break         # how you break out. NOTE: break breaks out the current immediate loop
# next
}

##
##  sequence in R
##

> 1:5         # a way to generate a vector
1 2 3 4 5

==> or,  seq() function is more flexible. its parameters are seq(from,to,by) where "by" is 1 in default, and direction goes negative if from > to

> seq(0, 1, 0.1)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

> seq(0, 5, by = 1)
0 1 2 3 4 5

> seq(5, 0, by = 1)
Error in seq.default(5, 0, by = 1) : wrong sign in 'by' argument

> seq(5, 5)

> 1.14:pi
1.14 2.14 3.14

> 2:0
2 1 0

> seq(1, 1, by = 0)

> seq(0, 5, by = 0)
Error in seq.default(0, 5, by = 0) : invalid (to - from)/by in seq(.)

> 0:-9   # also seq(0,-9)    seq(0,-9,by=-1)
0 -1 -2 -3 -4 -5 -6 -7 -8 -9

> seq(0,-9,by=1)

##
##  formatted print
##

foo = 12345
print(foo)
sprintf("here is a number %d ", foo)

NOTE: unfortunately, there is no printf() in R, so you must do this.

print(sprintf("here is a number %d"),foo)

%s :  string
%f %a %e : float, double

(ref) https://stat.ethz.ch/R-manual/R-devel/library/base/html/sprintf.html

# function definition

my_add = function(a = 3, b = 5){return(a+b);}     # unlike Perl, you really need parenthesis () for every function
foo = my_add(b = 7, a = 9);  # you can do this
foo = my_add(b = 7);         # this assumes a=3

# vector

foo = c('a','b','c')      # c means concatenate
print(foo)                # it will print 'a','b','c'
foo = c(foo,'d')          # this is one way to append a new element
foo = append(foo,'d')     # an alternative way.
foo[1]                    # NOTE: this returns a vector (not a scalar elem) of one elem 'a'
#       also, the indexing starts at 1, confusingly enough
foo[-2]                   # this DOES NOT return 'b' but in fact returns a vector excluding 'b' i.e. c('a','c')

print(length(foo))   # prints the length of foo

NOTE: vector elem must be of the same data type
e.g.
y = vector(mode = "logical", length = 4)     # y = FALSE FALSE FALSE FALSE
z = vector(length = 3, mode = "numeric")     # z = 0 0 0

NOTE: vector is one dimensional array. matrix is 2d array.

more examples
q = rep(3.2, times = 10)         # q = 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2
w = seq(0, 1, by = 0.1)          # w = 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
w = seq(0, 1, length.out = 11)   # w = 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

more useful functions
w <= 0.5             #  w = TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
any(w <= 0.5)        #  TRUE
all(w <= 0.5)        #  FALSE
which(w <= 0.5)      #  1 2 3 4 5 6
w[w <= 0.5]          #  0.0 0.1 0.2 0.3 0.4 0.5
subset(w, w <= 0.5)  #  0.0 0.1 0.2 0.3 0.4 0.5
w[w <= 0.5] = 0      #  w = 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.7 0.8 0.9 1.0

NOTE: there is a function for everything, so google more.

sd()       # standard deviation
unique()   # like distinct in SQL

##
## array / matrix
##
lots of Pandas-like operations available.

z = seq(1, 20,length.out = 20)
x = array(data = z, dim = c(4, 5))

[,1] [,2] [,3] [,4] [,5]
[1,]  1    5    9    13   17
[2,]  2    6    10   14   18
[3,]  3    7    11   15   19
[4,]  4    8    12   16   20

x[2,3] = 10
x[2,] = 2 6 10 14 18

x[-1,] =                  # -X means exclude X, and not choosing any column/row means choose everything
[,1] [,2] [,3] [,4] [,5]
[1,]  2    6    10   14   18
[2,]  3    7    11   15   19
[3,]  4    8    12   16   20

y = x[c(1,2), c(1,2)]   # you can specify which row/column
[,1] [,2]
[1,]  1    5
[2,]  2    6

e.g.
2 * x + 1

# matrix multiplication
e.g.
x %*% x    #   %*% is the operator syntax

# transpose
t(x)

# outer product
e.g.
outer(x[,1],x[,1])
The outer product is a matrix whose (i,j) element is a product of the i'th component of the 1st vector multiplied by the j'th component of the second

# rbind() and cbind()
e.g.
rbind(x[1,], x[1,])         # bind as rows
[,1] [,2] [,3] [,4] [,5]
[1,] 1    5    9    13   17
[2,] 1    5    9    13   17

cbind(x[1,], x[1,])    # bind as columns
[,1] [,2]
[1,] 1    1
[2,] 5    5
[3,] 9    9
[4,] 13   13
[5,] 17   17

###
###  List    # can contain diff data types
###
e.g.
L=list(name = 'John', age = 55, no.children = 2, children.ages = c(15, 18))

e.g.
names(L)            | name age no.children children.ages
L[[2]]              | 55
L\$name              | John
L[‘name’]           | John
L\$children.ages[2]  |  18
L[[4]][2]           |  18

NOTE: Double square bracket notation in lists returns the actual value – if we only used a single bracket it would return another list

###
###  dataframe    # yes, there is such a data structure in R
###
e.g.
vecn = c("John Smith","Jane Doe")
veca = c(42, 45)
vecs = c(50000, 55000)
R = data.frame(name = vecn, age = veca, salary = vecs)   # dimnames = c("name","age","salary")

name  age salary
1  John Smith   42  50000
2  Jane Doe     45  55000

names(R) = c("NAME", "AGE", "SALARY")    # this will change the column names
e.g.
NAME  AGE SALARY
1  John Smith   42  50000
2  Jane Doe     45  55000

------------------------------------------------------
List the dimension (column) names    | names(iris)        # assume iris is a dataframe object
find the number of rows              | nrow(iris)
find the number of columns           | ncol(iris)
Show the first four rows             | head(iris,4)
Show the first row                   | iris[1]
Sepal length of the first 10 samples | iris\$Sepal.Length[1:10]
Allow replacing iris\$Sepal.Length with shorter Sepal.Length  | attach(iris, warn.conflicts = FALSE)
Average of Sepal.Length across all rows | mean(Sepal.Length)
Means of all four numeric columns       | colMeans(iris[,1:4])
Create a subset of sepal lengths less than 5 in the setosa species | subset(iris, Sepal.Length < 5 & Species == “setosa”)
number of rows corresponding to setosa species | dim(subset(iris, Species == “setosa”’))[1]
summary of the dataset iris                    | summary(iris)
how to remove a column                         | df\$salary = NULL

(ref) http://stackoverflow.com/questions/6286313/remove-an-entire-column-from-a-data-frame-in-r

# sort by column

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))

R> dd[with(dd, order(-z, b)), ]
b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

ref: http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns

df\$col_foo = NULL      # removed

#  strsplit()

a very useful builtin function, to split a string into an array.
str = "a b c"
str_arr = unlist(strsplit(str," "))    # unlist is just to get around a quirk of the returned data type. basically the first elem (well [0] is just the property, so technically [1] is the second elem in the list) of that list is your array.

http://www.dummies.com/programming/r/how-to-split-strings-in-r/

#  substr("foobar",1,3)  # returns "foo"

NOTE: notice how indexing is slightly different from Perl syntax.

#  R functions

nchar("foobar")     # returns 6, the length of the string
toString(123)       # converts to string
as.numeric("123")   # converts string "123" into numeric
as.vector()
as.matrix()
as.data.frame()
data.matrix(frame, rownames.force = NA)  # convert a dataframe into a numeric matrix
as.integer(3.14)    # 3
is.integer(3)       # TRUE
as.character()
str(foo)        # compactly shows the structure of foo object (dimension, number of col/rows, data type for each, etc)
dim()
nrow()
ncol()
t()
replace()    #  https://stat.ethz.ch/R-manual/R-devel/library/base/html/replace.html
e.g.
> x = c(3, 2, 1, 0, 4, 0)
> replace(x, x==0, 1)
[1] 3 2 1 1 4 1

sample(x,size=n, replace=FALSE )    # x is a vector, n is a number.  useful function to sample N elements out of a big data chunk. with replacement option is good too.

rnorm()   # normal(gaussian) distribution random number generator
runif()   # uniform distribution random number generator

which()
e.g.
x <- c(3, 7, NA, 4, 8)
> which(x == 7)
[1] 2
which(is.na(x))
[1] 3

is.infinite()    # e.g. x[is.infinite(x)] = NA
mode()           # most frequent occurring number
class()
write(a_vector,sep=",",file="")   # lets you print with a delimiter
abs()
exp(x)     # e^x
log()
mean()         # also remember na.rm=TRUE/FALSE parameter  e.g.   mean(a_vector, na.rm = TRUE)
sum()
rowSums()
rowCols()
%*%
%in%
e.g.                               # ref http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element
v <- c('a','b','c','e')
'b' %in% v     ## returns TRUE
match('b',v)   ## returns the first location of 'b', in this case: 2

#  performance tuning

(1) vectorize
e.g.
Vectorized code​ means we find another way to write loops

a=1:10000000; res = 0
for(e in a) res = res + e^2

sum(a^2)

.Call()  # old native API
Rcpp     # package, there are others that let you load Cpp code

#  how to measure elapsed time

system.time(<R cmd>)

# to increase recursion limit

options(expressions=10000)

#####################################
####     visualization with R    ####
#####################################

there are more packages, but here we'll look at two.

(1) base graphics
- native package, already loaded by default. plot() followed by low level helper functions.
e.g.
plot(x=dataframe_foo\$col1, y=dataframe_foo\$col2)
title(main="col1 col2 relationship")
grid()
legend()
lines()

# good tutorial  http://www.harding.edu/fmccown/r/
# man page: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html
# for parameters, see below
pch  http://www.endmemo.com/program/R/pchsymbols.php
type http://www.statmethods.net/graphs/line.html

plot(x,y,xaxt="n") #this prevent the plotting of x-axis labels

# more ref: https://www.r-bloggers.com/mastering-r-plot-part-2-axis/

(2) ggplot2
- implements grammar of graphics
- philosophically, ggplot2 tries to separate data and graphing
- (supposed to be) concise and maintainable code
[code]
install.packages('ggplot2')      # one time installation
library(ggplot2)                 # every time you use it
e.g.
qplot()    # simpler
ggplot()   # complex

qplot(x=x1, y=x2, data=myDataFrame, main=”figure title”, geom=”point”)  # point means scatterplot
ggsave("foobar", device="png")  # very convenient

ref: http://www.sthda.com/english/wiki/qplot-quick-plot-with-ggplot2-r-software-and-data-visualization
ref: http://stackoverflow.com/questions/7144118/how-to-save-a-plot-as-image-on-the-disk

###
###  types of graphs
###

(1) strip plot
- a plain x-y graph where x is index, and y is some value for each index
e.g.
library(ggplot2)
data(faithful)
plot(faithful\$eruptions, xlab="sample number", ylab="eruption times (min)", main = "Old Faithful Eruption Times")

(2) histogram
- "graphs one-dimensional numeric data by dividing the range into bins and counting number of occurrences in each bin."
it bins the data into a certain data range. so you can see frequency(or probability) of occurrences for each range.
obviously, it is important to carefully choose the right bandwidth(bin size) for each range/bin.
e.g.
qplot(x=waiting, data=faithful, binwidth=3, main="Waiting time to next eruption")
ggplot(faithful, aes(x=waiting)) + geom_histogram(binwidth=1)

(2.5) smoothed histogram
- refer to the lecture videos/notes for the theoretical/mathematical details.
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L2+Data+Visualization+-+Course+Notes.pdf

(3) line plot
- a graph displaying a relation between x and y as a line in a Cartesian coordinate system.
- The relation may correspond to an abstract mathematical function or to a relation between two samples (for example, dataframe columns)
- essentially a strip plot and connect each adjacent points.
e.g.
data(mtcars)
# create a vector that contains values between -2 and 2 and have 30 values exually spaced
x<- seq(-2,2,length.out = 30)
y<-x^2
# create line plot using qplot
qplot(x, y, geom = "line")
# create a line and point plot
# same as above but put in the points
qplot(x, y, geom= c("point", "line"))
# now do the same thing but use ggplot
myData <- data.frame(x=x, y=y)
ggplot(myData, aes(x=x, y=y)) + geom_line() + geom_point()

(4) scatter plot
- A scatter plot graphs the relationships between two numeric variables.
- It graphs each pair of variables as a point in a two dimensional space whose coordinates are the corresponding x, y values.
e.g.
## Plot a scatter plot using the bas graphics
plot(faithful\$waiting, faithful\$eruptions,
pch=17, #type of marker
col=2, #color
cex=1.2, #size
xlab="Waiting Times (min)", ylab = "Eruption Time (min)")

# Variable Relationships and Scatter Plots

- The relationship between two numeric variables and a categorical variable can be graphed using a scatter plot where the categorical variable  controls the size, color, or shape of the markers.
e.g.
x may be height, y may be weight, and each x,y pair you plot the gender of an individual.
==> you may observe something like male/female are distinctly separated in groups by height, etc etc

"correlation"
-- positive: the more x, the more y
-- negative(inverse): the more x, the less y
e.g.
high positive correlation means very rigorous correlation
low means loose

"multi-variable scatter plots"
--> when you want to plot the relationship betwen three numerical data sets.
e.g.  height,weight,salary
--> pick one for x, one for y, then the third col can be represented by the size of each dot(marker).
e.g.
# we are going to attempt to visually indicate the relationship of 3 separate data features across a 2D graph
# to do this we are going to use the marker size to indicate the 3rd feature - the cylinder
qplot(x=wt, y=mpg, data = mtcars, size=cyl, main = "MPG vs Weight (x1000 lbs) by Cylinder")

##
##  aside: Facets
##
Facets allow us to visualize more than two dimensions
- Similar to the technique we used to plot 3 concepts earlier in the lesson
Facets have two separate panels
- The two panels are aligned perfectly
- Each have same axis concepts and ranges
- Each have a unique 3rd concept that is clearly labeled in the title
e.g.
let's say you have data of cars x=weight, y=miles_per_gallon_of_oil
and let's say you have two distinct kinds of cars: hybrid and regular.
so while you can share x=weight, but MPG may be significantly diff between hybrid and regular, so you may wanna share x-axis but plot two graphs on top with their own y-axis.

# The argument facets in qplot or ggplot takes a formula a~b where a and b specify the variables according to which way the column and rows are organized.

see example graphs in
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L2+Data+Visualization+-+Course+Notes.pdf
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/DAVA2-DataVisualization.pdf

(5) Contour Plots
- graph relationships between 3 numeric variables
- Typically, ‘z’ is a function of x and y
An example of a contour plot is a topographic map (e.g. x,y geographical coordinate, plus z=height/altitude)
e.g.
# create a grid of x values
# create 100 points equally spaced between -1 and 1
xGrid <- seq(-1,1,length.out=100)
# create the y grid - since its identical to x, just copy x
yGrid <- xGrid
# create an expanded grid
R<-expand.grid(xGrid,yGrid)
# label the axis
names(R) = c('x','y')
# compute the z values; since they should be circular, the z function / elevation is
# the square of the x-axis plus the square of the y-axis
R\$z <- R\$x^2 + R\$y^2
ggplot(R,aes(x=x,y=y,z=z)) + stat_contour()

(6) Box Plots
- alternatives to histograms that are usually more lossy, in the sense that they lose more data.
- As a plus, they emphasize quantiles and outliers in a way that a histogram cannot
- histograms are good for showing the shape of the data (the modes etc), but a box plot is better for showing the median, IQR, whiskers, and outliers
- Box plots emphasize percentile information
The R percentile​ of a numeric dataset is the point at which approximately R percent of the data lies underneath and approximately 100-R percent if the data lie above.
- Another name for the R percentile is the zero point R quantile
The median​ (also: 2nd​Quartile​) is the point which 50% of the data lies under.

quartile == 25%

IQR: inter quartile range means 25~75% range (that's 50% in the middle)

# all sorts of sample code here
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L2+Data+Visualization+-+Course+Notes.pdf
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/DAVA2-DataVisualization.pdf

(7) QQPlots
Quantile-quantile plots, or QQPlots, are scatter plots of quantiles of one dataset on the x-axis vs the quantiles of another dataset on the y-axis.

# again, see example graphs/code here
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L2+Data+Visualization+-+Course+Notes.pdf
https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/DAVA2-DataVisualization.pdf

##################################
####   Data (pre)processing   ####
##################################

(recommended textbook chapter: quite good)

###
###  missing data
###

"absence of evidence is not the evidence of absence"

- data could be missing for many reasons.
- The presence of no data or missing data can sometimes convey some information about the data in itself which may be useful in the data analysis task

MCAR: missing completely at random
- a concept that states the probability of an observation being missing does not depend on observed or unobserved measurements

MAR: missing at random
- states given the observed data, the probability that data is missing does not depend on the unobserved data (i.e. missing data depends on observed data)

NOTE: in R, the missing data is represented "NA"
e.g.
is.na()           # returns TRUE if NA, FALSE otherwise
complete.cases()  # returns a vector whose componens are FALSE for all samples, TRUE otherwise
na.omit()         # returns a new dataframe omitting all samples containing NA
na.rm             # a parameter: if set TRUE, changes function behavior so that it proceeds to operate on the supplied data after removing all rows containing NA

e.g.
mean(movies\$length)               # avg length
mean(movies\$budget)               # avg budget
mean(movies\$budget, na.rm = true) # avg budget, remove missing values
mean(is.na(movies\$budget))        # frequency of missing budget
moviesNoNA = na.omit(movies)      # returns a dataset with all missing data removed.

e.g.
moviesNoNA = na.omit(movies)
qplot(rating, budget, data = , size = I(1.2)) + stat_smooth(color = “red”, size = I(2), se = F)

###
###  handling missing data
###
(1) remove rows with missing values
(2) fill the missing values with the mean/median from the rest. (substitution)
(3) build a more sophisticated probabilistic model to fill the missing values. (imputation)

MCAR: three techniques are reasonable. (tho some are better)
MAR or non-MAR: the techniques may introduce systemic bias into the data analysis process.

###
###  outliers
###
(1) corrupted values  # e.g. human input errors
(2) unlikely values   # i.e. substantially unlikely value violating the modeling assumption. (can be correct data)
==> outliers may lead to drastically wrong conclusions

# quiz:  which is more robust?  mean or median ?   A: median
e.g.
39 72 63 81 46 777777 19 53    # outlier 777777 will mess up the mean

### Bias
- This seems to be a problem with the mean and outliers
- Could be system error if the data is corrupted
### Precision (uncertainty)
- Also referred to as random error
- This is the deviation of the data from the mean

###
###  handling outliers
###
(1) truncating: removing all values deemed outliers
(2) winsorization: shrink outliers to border of main part of data
(3) robustness: analyze the data using a robust procedure (like taking the median)

###
###  detecting outliers
###
there are a few common ways.
(1) values below the alpha percentile or above alpha -100 percentile
(2) values more than c times stanrad deviation away from the mean (assuming the data has gaussian distribution.)
==> but the calculation of the mean and the stddev will include outliers. so it's tricky.
(2.1) first remove extreme values,
(2.2) then use the percentile filtering

R function:  winsorize()

###
###  skewness and transformations
###
- data is often skewed.
- there are a few kinds of transformations that map the data to a common distribution form that can be processed by common models.
- a suitable model then can be fitted to the transformed data

e.g.
power transformation family
e.g.
qplot(carat,price, size = I(1), data = diamonsSubset)
qplot(log(carat), price, size = I(1), data = diamonsSubset)
qplot(carat, log(price), size = I(1), data = diamonsSubset)
qplot(log(carat), log(price), size = I(1), data = diamonsSubset)

==> log-log scaling(that's a real term btw) gives a clean linear correlation, so you can do linear regression based on log carat, then analyze the result as log price, which you can convert back to price, etc etc

qplot(log(carat), log(price), size = I(1), data = diamonsSubset)
===> also can be written as  qplot(carat, price, log="xy" size = I(1), data = diamonsSubset)

###
###  variables
###
(1) numeric: real numbers, euclidian space
(2) ordinal: anything you can sort, including number, but also like seasons(spring, summer, fall, winter)
(3) categorical: anything not numerical and not ordinal(like food: pasta, kebab, steak, etc)

###
###  binning (aka discretization)
###
- is a data transformation that takes a numeric variable (typically a real value, though it may be an integer), dividing its range into several bins, and
replacing it with a number representing the corresponding bin.
- Binarization: a special case of binning; it replaces a variable with either 0 or 1 depending on whether the variable is greater or smaller than a certain threshold.
- Discretization can be done in the R language by using the function ‘cut’

###
### Why would we want to bin values
###
(1) Data reduction
- Storing bin values and processing them may be much faster than storing the original values while taking less space
- It can improve scalability of data analysis procedures that handle big data
(2) It can help capture nonlinear effects when using linear models-- A common technique is to bin a variable and then different nonlinear transformations are operated on the bin values with all values being concatenated into a vector of measurements that are then inserted into a linear model.
This can capture nonlinear relationships between the data using linear modelling tools.

###
###  indicator variable
###
replace a variable x (which may be numeric, ordinal, or categorical) taking k values with a binary k-dimensional vector v, such that v[i] (or vi in mathematical notation) is 1 if and only if x takes on the i-value in its range.
- In other words, replace variable (that may take multiple values) by a vector that is all zeros, except for one component that equals 1 (corresponding to the index representing the original value of the variable).
Often, indicator variables are used in conjunction with binning
e.g.
suppose you have a big dataframe NYCrestaurant with many columns, maybe review_star(from 1 to 5), neighborhood, genre(italian, japanese, french, chinese, etc)
then for categorical variable (or ordinal), it's hard to apply all the useful statistical analysis, so an idea is to get a binary vector where each index corresponds to each possible value
e.g. suppose we only consider italian, japanese, french, chinese, and if it's french, we denote
[0,0,1,0]

==> advantage is (1) you can apply numerical analysis, (2) even if the vector is long,i.e. high dimentional, it's almost all zeros(hopefully) then storage cost is minimal, and computation efficiency is also high.

###
###  data manipulation
###
(1) shuffling:  sampling (using random permutation) to create train & test data sets, to avoid bias.
- concept of with or without replacement: without means do not allow the same element to be re-picked.
- in R, sample(k,k) generates a random permutation of order k
- Example in R
#create an array of 4 rows and 5 columns and insert the numbers 1-20
D <- array(data=seq(1,20,length.out = 20),dim=c(4,5))
#use sample() to re-order the dataframe
#the first 4 represents the numbers to use, and the second 4 indicates the number of times to draw a sample
#since we need without replacement, both numbers must be the same, which is the row count!
#note that it defaults to without replacement; to use with replacement use 'replace=TRUE'
D_Shuffled = D[sample(4,4),]
- Since we are using without replacement, the code simply shuffles the rows
- This is a common operation as we frequently want to take the data, randomly shuffle, then create a training and testing set.
In other words, this is helpful for cross validation. This is important because earlier measurements in the data may be different than later measurements.

(2) Partitioning (aka splitting)
In some cases we need to partition the dataset’s rows into two or more collection of rows
- For example, a training set and a testing set
-- Generate a random permutation of k objects (using sample(k,k)), where k is the number of rows in the data, and then divide the permutation vector into two or more parts based on the prescribed sizes, and new dataframes whose rows correspond to the divided permutation vector.
Splitting data into a training and testing set in R
- Uses the above example code to shuffle
- Code example
#create an array of 4 rows and 5 columns and insert the numbers 1-20
DataSet <- array(data=seq(1,20,length.out = 20),dim=c(4,5))
randomPermutation <- sample(4,4)
trainingIndicies <- randomPermutation[1:floor(4*0.75)]      # 75%
testingIndicies <- randomPermutation[(floor(4*0.75)+1):4]   # 25%
trainingData <- DataSet[trainingIndicies,]
testingData <- DataSet[testingIndicies,]

###
###  Tall data
###
Data in tall format​ (tall data​) is an array or data frame containing multiple columns where one or more columns act as a unique identifier and an additional column (or several columns) represents value.
e.g.
Date       Item    Quantity
---------------------------
2015/01/01 apples  200
2015/01/01 oranges 150
2015/01/02 apples  220
2015/01/02 oranges 130

===> the first two columns (Date & Item) constitute an unique identifier.

- Tall data is convenient for adding new records incrementally and for removing old records.
- The tall data format makes it hard to conduct analysis or summarizing. (e.g. it's hard to compute avarage daily sales with tall data in the above)

###
###  Wide data
###
- transpose tall data
- pros/cons opposite to tall data
-- pros: simpler/easier to analyze.
-- cons: harder to add/remove entries.

e.g.

[tall data]                       [wide data]
Date       Item    Quantity       Date        apples  oranges
---------------------------       ----------------------------
2015/01/01 apples  200            2015/01/01  200     150
2015/01/01 oranges 150            2015/01/02  220     130
2015/01/02 apples  220
2015/01/02 oranges 130

###
###  reshaping data  # e.g. between tall and wide data
###
The R package 'reshape2' converts data between tall and wide formats
- melt() function accepts a frame in the wide format, and the indices of the columns that act as unique identifiers with the remaining columns acting as measurement or values. i.e. a tall version of the data is returned.

- acast(), dcast()  convets tall to wide. # acast() returns array, dcast() returns dataframe
The arguments are a dataframe in wide form, a formula a~b~c… where each of a, b, ... represents a sum of variables whose values will be displayed along the dimensions of the returned array or data frame (a for rows, b for columns, etc), and a function fun.aggregate that aggregates multiple values into a single value.
- This is needed in case multiple values that are mapped to a specific value or cell in the new array/dataframe

Example
- using the ‘tips’ dataset from reshape2 package.
qplot(total_bull,tip,data=tips, size=I(1.5), facets=sex~time)
The tips dataset has columns: total_bill, tip, sex, smoker, day( of week), time(lunch/dinner), size (of party)
- In previous lessons we used facets to break this out, but we can also do this in code
- We want to transform the data into a series of small tables that will show different
insights
- Code example
library(reshape2)
data(tips)
#this is only needed to get the data in tall format
tipsm<- melt(tips, id = c("sex","smoker","day","time","size"))
#get the mean of measurement variables broken down by sex
dcast(tipsm, sex~variable, fun.aggregate = mean)

###
###  split-apply-combine   # using plyr package
###
many data analysis operations on dataframes can be decomposed to 3 stages:
- split
- apply
- combine
===> use the "plyr" package

| array | dataframe | list | discarded
-----------------------------------------------
array| aaply | adply     | alply| a_ply
dataframe| daply | ddply     | dlply| d_ply
list| laply | ldply     | llply| l_ply

e.g.
library(plyr)
names(baseball)
# count number of plyers recorded for each year
bbPerYear = ddply(baseball, "year", "nrow")       # nrow is a ply function name
year nrow
1871    7
1872   13
1873   14
1874   17
..     ..

qplot( x = year, y = nrow, data = bbPerYear, geom="line", ylab="number of player per season")

===> as you can see above, plyr() is very powerful.

# compute mean RBI for all years. summarize is the apply function which takes as argument a function that computes the RBI mean
bbMod = ddply(baseball, "year", summarize, mean.rbi = mean(rbi, na.rm = TRUE))
qplot(x=year, y=mean.rbi, data=bbMod, geom="line", ylan="mean RBI per season")

# Ozone example
- ozone is a built in data set in plyr
e.g.
library(plyr)
latitude_mean  = aaply(ozone, 1, mean)    # 1 is the column index, mean is the function
longitude_mean = aaply(ozone, 2, mean)
time_mean      = aaply(ozone, 3, mean)
longitude      = seq(along = longitude_mean)                                         # here do the same thing for latitude
qplot(x = longitude, y = longitude_mean, ylab = "mean ozone level", geom = "line")   # by sub'ing longtitude with latitude
===> shows the mean ozone level for each latitude

months = seq(along = time_mean)
qplot(x=months, y=time_mean, geom="line", ylab="mean ozone level", xlab="months since jan 1985")
===> shows the mean ozone level for each month

###################################
#####   Logistic Regression   #####
###################################

logistic as opposed to numeric. so given a small animal, if you want to predict if it's a dog or a cat (it's categorical as opposed to numerical), then it's a logistic regression. if you try to predict the weight(it's numerical, like a real number) of a given animal based on its gender, height, width, then it's linear regression.
regression simply means approximation/estimation/classification/labelling

the idea is given some input data X (called independent variables, aka features), you train a model(aka coefficients, theta, or in logistic reg context, also called classifier), so when you do theta * X = Y, you predict an output Y (aka dependent variables).

"In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical"
(ref) https://en.wikipedia.org/wiki/Logistic_regression

Classification is the most common problem in machine learning
- Classification​ is predicting a label associated with a vector of measurements
e.g. Critical to spam or fraud detection, click prediction, newsfeed ranking, digital advertisement, etc
- We will learn the theory of logistic regression​ (the most common classifier) and how to use it in practice

When is it a good time to use Logistic Regression?
- When there is a binary (or nominal​) outcome
- Example: User either clicks or does not click on a website
- Basically two possible outcomes (usually yes or no)
- Note that this is not multiclass classification; binary classification strictly deals with 2 choices at most

Logistical Regression does have multiclass generalizations, but we will not cover it in this module
- Other examples of binary classification
-- Feed ranking
-- Recommendation Systems
-- Spam detection / filtering
-- Credit card fraudulent transaction
-- Medical testing (to determine illness)

###
###  definitions
###

(1) measurement vector
- denoted as "x"   e.g.   x = (x1, x2, x3, x4,,, xd)    # d dimenstions
- each element in x is called a "feature"
- there are usually multiple vectors, so we denote x^i  as in x^i = (x^i_1, x^i_2, x^i_3,,, x^i_d)

(2) response variable
- aka label "y"
-  yes = 1,  no = -1        # sometimes no = 0
- we are trying to predict y
- there can be multiple labels, e.g.  y^i corresponds to x^i

(3) parameter vector
- denoted as "Theta"
- describes the classifier
- has the same dimension as x
- aka weights vector
- the inner product with x is important. e.g. <x,Theta> = x1 * Theta_1 + x2 * Theta_2 + ... xd * Theta_d
-- as in, each feature xd has a weight Theta_d

###
###
- given a vector of features x, assign a label y
- The goal here is to take a bunch of historical data (training data / a training set of data) that has x features with a y classification, and then devise a mathematical method to determine.
- This is done based on a training set
-- A training set consists of pairs of measurement vectors and response
e.g. (x1,y1), (x2,y2), ... ,(xn,yn)
- Note the number of pairs is denoted as ‘n’
-- This is gathered from historic data
- Training data is a table (dataframe) of rows representing training set examples and labels
e.g.
x1 | x2 | x3 | x4 | y
----------------------
|    |    |    | click
|    |    |    | not-click

This is an example of ML applied to if a user will click on an ad in LinkedIn
- X = vector of features related to the specific user
-- These can be specific words, phrases, or topics
-- Can be length
-- Can be the relationship of the person who posts the feed to the user who views the feed (or their interaction counts)
- Y = the user clicks or does not click

==> obviously, the idea is you train the classifier, then you use it for future data.

###
###  Linear Classifiers
###
- Linear classifiers have a prediction rule that have the following algebraic form:
Y = sign(<Θ,x>)
-  <Θ,x> is the inner product
-  Θ is the parameter vector describing the classifier; it’s the same for all x vectors!
- The sign function works as follows
If the dot product of <Θ,x> is positive, Y is assigned to be ‘+1’
If the dot product of <Θ,x> is negative, Y is assigned to be ‘-1’
If the dot product of <Θ,x> is zero, Y can be either positive or negative (not fully defined)

===> obviously, getting a good Θ is important.

###
###  Why linear classifiers?
###
- They are easy to train. (train speed is not always important)
- The prediction(query) can be very quick.  i.e. LCs predict labels very quickly at serve (query) time, which is often crucially important.
- These can be parallelized (if the dimensioning is high) so different processors can hold/compute subparts of the inner product.
- These can also be fast if the parameter vector OR the measurement vector is mostly zeros.
--> The vector that is mostly zeros is called a sparse vectors
- In many industries, the prediction time is far more important than the training time; query speed is usually crucial.
- They have a well established statistical theory behind it; this well known statistical theory of linear classifiers leads to effective modeling strategies
--> Helps us understand which classifiers to use and which to not use
- Linear classifiers are particularly useful in high dimensions (a high ‘d’ value); they excel at high dimensions due to:
--- Their simplicity/scalability
--- Attractive computational load (even with millions of features)
--- Nice statistical properties
- For the visualization of data, assume dimension = 2    # typically its much higher

##
##  The Linear Plane
##
- Linear classifiers are called linear because the decision boundary - the area that distinguishes between what is positive and what is predicted as negative - will be a flat shape
- In 2D, it’s a line  (anything on one side is negative, anything on the other side is positive)
- In 3D, it’s a 2D plane
- In the general case of d dimensional space, it’s a d-1 dimensional hyperplane

Given a vector of two dimensions: x = (x1, x2)
- The classification will be: sign(ax1 + bx2 + c)   # c is what is known as an offset term
- Or, using theta: sign(Θ1*x1 + Θ2*x2 + Θ3)
- Note that technically, c (the last Θ) is not represented in the x vector; a workaround is to pad the x vector with a 1 at the end so the last Θ (the constant) can be used.

sign(ax1 + bx2 + c)   |  classification
------------------------------------------
-1           |  negative
+1           |  positive

When we take an inner product of these two datasets (in the graph) between the vector of measurements and the weight vector - IF the weight vector is normalized​ (meaning it has unit length to be of length 1) - the outcome of the unit product is basically the projection of the input vector into that unit (weight) vector.

The classification decision does not matter whether you renormalize theta or not
- That said, if you want to see the geometric interpretation, its easier to see that if the theta vector is normalized
- The decision boundary (the line that separates pos VS neg) will not change whether we multiply theta by a constant positive scalar or not (we will get the same decision boundary and the same prediction rule)
- The unit vector is the theta vector that corresponds or describes that projection direction
- The corresponding inner product would capture the distance of the point from the hyperplane (in the second graph) which will correspond to the decision boundary

The decision boundary, in 3D space, is the line that separates pos and neg prediction.
The decision boundary, in general d-dimenstional space, known as the hyperplane of d-1 dimension
- It’s a hyperplane instead of a line as it’s a line in 2D, but it will be a 2D plane in 3D, so on
- This decision plane / hyperplane describes the classifier
- The theta vector​ is perpendicular (aka orthogonal​) to the decision boundary
- The projection of each point onto the theta vector shows how far away it is from the decision boundary

NOTE: Prediction is only the first part – we also want to predict some confidence or probability in our prediction rule
-- We need to map the inner product into a confidence / probability
-- One function that can do this is the sigmoid function
-- A sigmoid function is one way of getting a probabilistic classifier that doesn’t just predict whether a label is +1 or -1; it also predicts a probability associated with that fact
- Knowing this probability can be helpful, as it can help us rank things
-- for example, ad space on a webpage – we can estimate the ‘most likely to be clicked’ links to be displayed first

###
###  Bias Term
###
- The bias term​ is ‘c’ which was mentioned earlier
-- This was mentioned in the discussion regarding padding the x vector with a 1 at the end; basically, it’s a theta with an artifically added corresponding x value (that additional x_d+1 being always 1)
- The bias term is useful as this is what allows the decision boundary to not be required to pass through the origin
- Recall that the trick to adding the bias term is adding the bias term as the ‘last’ theta and then adding a 1 to the end of the measurement vector (x vector)
- This trick becomes critical later, as there is some math involved that needs the bias term; if this trick is not done those math problems become more complicated

###
###  Increasing Data Dimensionality
###
- The biggest drawback of linear classifiers is the decision boundary; its quite possible the data cannot be divided by a hyperplane (that is to say, the points are intermingled) i.e. one line (which itself is a 2D hypoerplane) in 3D space may not really let you accurately separate/predict pos & neg.
- There is a trick to overcome this while keeping all of the benefits of a linear classifier (simplicity, scalability, etc)
-- For example, assume we have a two dimensional data vector; convert it to a six dimensional vector.
We transform x to x^ which has 6 dimensions
-- In the 6 dimensional space, the decision vector will now be hyperplaned (5D)
-- In the original space (2D) the decision boundary could be highly nonlinear, but using this trick helps mitigate that.
-- This helps us capture nonlinear trends when we use linear classifiers.

- Increasing the dimensions is worth noting as now there are more computations and more storage is needed
-- That said, we can work with this increase so long as the parameter or feature vectors are mostly zero
- It should also be noted that this increases the dimensionality of theta, to keep up with the dimension of x vector
-- This may lead to overfitting if we do not have enough data in our training set!
In summary
- It is possible to transform the data from x to x^, where the classifiers (y) stays the same
-- The original data must be transformed to the transformed vector; this also means data we wish to make predictions from must also be transformed

###
###  Classifiers
###
- Classifiers define a map from a vector of features x to a label
-- Sometimes we get a confidence, and sometimes that confidence is also the probability that the label is 1
- Probabilistic classifiers provide that tool by defining the probabilities of the labels +1 and -1 given the features of vector x
- We know from probability theory that
p(y = +1|x) +  p(y=-1|x) = 1
- That is to say, the probability of y being +1 (given x) and y being -1 (given x) will equal 1
-- Because of this, the probabilistic classifier should give us a way of either measuring the first OR second value
-- The other value we can just get by subtracting the first number from 1
- The probability that a given element of vector x will be classified as 1:   P_Θ (Y=1 | X=x)
- The probability that a given element of vector x will be classified as -1:  P_Θ (Y=-1 | X=x)

###
###  Maximum Likelihood Estimator (MLE)
###

MLE is one of the most well-known methods of training probabilistic classifiers
- Start with a vector of points in a Euclidian space (same as the other x measurement vector)
- Maximum likelihood assumes a specific mapping from theta to the probability that Y = 1 (or -1)
-- We will use logistic regression as this mapping

- The thinking behind the MLE: Give me a hyperplane that will maximize the likelihood of successfully classifying/explaining/modeling the data
- If we had a hyperplane that explained the data very well (in other words, maximizes the likelihood of the data) this will probably be a good classifier of the data
- Two statistical philosophies:
(1) Frequentists philosophy
(2) Bayesian philosophy

MLE is an example of frequentists philosophy
- The frequentists philosophy is a philosophy in statistics that talks about nature being something that we are trying to predict
--> There is one correct state of nature that we do not know
--> The task of the statistical procedure is to predict that state of nature using observations
i.e.
- Frequentists Maximum Likelihood Estimator uses pairs of feature vectors and labels:
((x1, y1), (x2, y2),,, (xn, yn))
usually from historic data to estimate correct classifier or nature.
- Frequentists philosophy is not bound to only the MLE
- Note that we will focus on the MLE for this course as it is the dominant philosophy in the industry at the moment.

Bayesian statistics is an alternative to MLE and frequentists philosophy
- Bayesian statistics claims that a single classifier cannot represent the ‘truth’. Estimate the revised probability that each classifier is correct and use them all.
i.e.
- There is no single correct state of nature that generates the data
- Therefore, there is no point in thinking about a specific model that generates data that we will try to predict as there is a collection of different models, with each model being correct with some probability.
-- We then use all of the models together to inform our decision making, with some probability assigned to each model.

Comparing both philosophies
- Frequentists philosophy
-- The industry still heavily relies on Maximum Likelihood Estimators
- Bayesian philosophy
-- Gaining ground lately
-- ML is starting to rely more on Bayesian
-- Takes more resources and time to produce (when compared to MLE)

###
###  MLE defined
###

maximize probability P as a function of Theta(=classifier)    # i.e. find Θ that maximizes likelihood of y given x, i.e. find Θ that best explains the data.

ΘMLE = argmax p_Θ(Y=y^1 | X=x^1) * p_Θ(Y=y^2 | X=x^2) * ... * p_Θ(Y=y^n | X=x^n)
Θ

- That is to say, the θ that maximizes the likelihood of the data (the argmax θ)
- The likelihood of the data is the product of the conditional probabilities of the labels given the feature vectors
-- We take a product over all of the probabilities of the training data pairs
-- The product is a likelihood, and it is a function of θ (because θ describes the probabilistic classifier)
-- Because it’s a function of θ, we can investigate which θ maximizes it and then take that maximizer as the classifier we will use (which is the MLE)
--- This θ we can use later in prediction/query time so we can classify new vectors/data

We also can use the log likelihood as below:

ΘMLE = argmax log(p_Θ(Y=y^1 | X=x^1)) +  log(p_Θ(Y=y^2 | X=x^2)) + ... + log(p_Θ(Y=y^n | X=x^n))
Θ

As it turns out, this is useful because log is a concave function, the maximizer of the product will be the same as the maximizer of the log of the product.

- The product and the log of the product of the maximum will be different, but the maximizer (the θ that maximizes the product and the θ that maximizes the log of the product) will be the same
-- In other words, the θ itself will not change; the value of the maximized result WILL change, but the θ used to create that maximum will not change

- The reason we use the log is the log of a product is the sum of the logs (computationally simpler!)
-- This simplifies the expression, especially later when we want to compute derivatives
-- We will be using the log likelihood and not the likelihood as the log likelihood is far superior for the task at hand

- The MLE is the maximizer of the likelihood (which is the same as the maximizer of the log likelihood, which is computationally more convenient)
- Justification for using the Maximum Likelihood Estimator
-- It converges to the optimal solution in the limit of large data (known as the consistency property​)
--- Disclaimer: Data is generated based on the logistic regression model family and n (number of rows in training data) approaches infinity while d (number of columns) is fixed
-- The convergence occurs at the fastest possible rate of convergence (statistical theory​)
--- Note that when the dimension count is high, some of these claims do not hold anymore as they assume d is fixed and n goes to infinity
--- In practice, d can be billions of features – greater than n even – which means this breaks down (recall the curse of dimensionality which claims you need a certain number of rows more than d)

MLE Quiz
- Question: Describe a computational procedure for teaching the value x for which f(x) is at a maximum. Does it scale to high dimensions of x?
Procedure
-- Compute f(x) on a grid of all possible values and find the maximum
-- Do this for scalars x or low dimensional vectors x
(note: this does not scale to high dimensions. An alternative technique that does scale is gradient ascent)

note:  we mean  log likelihood == log of likelihood

###
###  Probabilistic Classifiers
###
- Logistic Regression is the most popular probabilistic classifier (which is a model/mapping from a parameter vector Theta into a probability for y given x)
- it's scalable (cos it's a linear classifier), simple, thus popular.

Definition of Logistic Regression:    # it's just a probability estimator

1
P(Y=y | X=x)  =  -------------------   =   1 / (1 + e^ y*<θ,x>)
1 + exp(y<Θ,x>)

- The probability of Y being either +1 or -1  conditioned on a vector X taking on a specific value little x
- little y is the value that big Y (the randomvariable) takes
recall
(1)  y is either +1 or -1
(2)  <θ,x> is the inner product of θ and x

note:  y<Θ,x>  literally means y times <Θ,x>
note:  exp(a) means e^a  where e is just Euler number.   # pronounced Oiler, e = (1+1/n)^n  where n = ∞

The only unknown is the vector θ
- Once this is known, logistic regression will give you a probability for y equals +1 or -1 given any vector of measurements x

- two quick check on the equation:  is it really a probability estimator ?
(1) probabilities sum up to 1 ?
yes. here is a proof.
first, Y = +1 or -1, thus P(Y=1|X) + P(Y=-1|X) = 1

1                  1
----------------- + ------------------ = 1     # just a simple algebra
1 + exp(1*<Θ,x>)    1 + exp(-1*<Θ,x>)

(2) probability always non negative ?

exp(a) will be positive for any value of a, so 1/(1+exp(y<Θ,x>))  will be always positive.

note:  neg log likelihood = log(1+exp(yi<θ,xi>))     # aka loss function, aka cost function

Quiz: Decision Boundary
- Question:  Where should the decision boundary be placed
- Answer:  The set of points where p(Y= 1|x) = p(Y= -1|x) = 0.5     # i.e. separating pos & neg

##
##   Prediction Confidence
##
in summary

------------------------------------------------------------------------------------------------------------
predict the label y associated with a feature vector x | prediction rule  sign(θ,x)
measure the confidence of that prediction              | logistic regression: P(Y=y|X=x) = 1/(1+exp(y<θ,x>))
------------------------------------------------------------------------------------------------------------
note:
- this assumes you have figured out the vector θ
- Not only can this give a confidence, this has a direct interpretation as probability

NOTE from the forum: "Loss function" aka "cost function" is a general term in the context of optimization.  It means "the function you're going to minimize".  It's analogous to "objective function" which means "the function you're going to maximize".
(ref) https://en.wikipedia.org/wiki/Loss_function
In our context (and in maximum likelihood estimation in general), the loss function is the negative of the log likelihood of the training set given the model.

##
##  MLE and iterative optimization
##

recall the definition of Logistic Regression:   # how do we find the best Θ that maximizes the P(Y|X) ?  # i.e. Θ that best explains the data

1
P(Y=y | X=x)  =  -------------------
1 + exp(y<Θ,x>)

also recall MLE definition:                     # well simply use MLE

ΘMLE = argmax log(p_Θ(Y=y^1 | X=x^1)) +  log(p_Θ(Y=y^2 | X=x^2)) + ... + log(p_Θ(Y=y^n | X=x^n))
Θ

===> combining the above two, we can simply derive:

^             n             1               #  maximizing log likelihood
ΘMLE = argmax Σ log ------------------      #  log(a/b) = log(a) - log(b)
Θ  i=1     1 + exp(yi<Θ,xi>)      #  log(1) = 0
#  so we can rewrite this as below
n
= argmax Σ -log [ 1 + exp(yi<Θ,xi>) ]   # this means, maximizing the negative of negative log likelihood
Θ  i=1                             # because  log [ 1 + exp(yi<Θ,xi>) ]  is negative log likelihood

n                              # re-writing the above as minimizing the negative log likelihood
= argmin Σ log [ 1 + exp(yi<Θ,xi>) ]    # how to compute this ?
Θ  i=1                             # one way is to compute all possible values of Θ which is unrealistic for high dimensional Θd
# so we use Gradient Descent which is just calculus
##
##

recall from your glorious multivariate calculus days,
- gradient is just "a multi-variable generalization of the derivative. While a derivative can be defined on functions of a single variable, for functions of several variables, the gradient takes its place. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued."
- gradient descent is a first-order iterative optimization algo for finding the minimum of a function.
-- if your goal is to find the maximum, then gradient ascent.

(1) initialize Θ to random values   #  let's say d = Θ dimension
(2) for j = 1 to d,
n
∂ Σ log[1 + exp(yi<Θ,xi>)]
i=1
update  Θj = Θj - α ----------------------------         # if you are so inclined, you can solve this partial derivs as below
∂Θj                          # it's really just PDE exercise

n    exp(yi<Θ,xi>) * yi * xij
= Θj - α Σ ------------------------------
i=1    1 + exp(yi<Θ,xi>)

(3) repeat (2) until the updates become smaller than a threshold. (or stop after certain iteration count)

alpha α is called "step size": kind of like learning rate aka decay factor.
- we typically decay α as iteration goes on.
- there are certain theories on how α should be decided. e.g. sqrt(j) but we don't go into details in this course.

in logistic regression, the likelihood 1/(1 + exp(yi<Θ,xi>)) is concave, so Gradient Descent on MLE of logistic regression will find the single global maximum.
note: it's possible that for some j, the maximum could be at Θj = +∞ or -∞,  meaning, simply by ever increasing/decreasing Θ, the likelihood improves, in whcih case we need some criteria to decide stop iterating. (note: when this happens, it's often overfitting. so we need regularization)

note: to give more intuition on step (2),  # here, i use derivative and gradient interchangeably.
recall  log[1 + exp(yi<Θ,xi>)]  is negative log likelihood.
recall we want to find Θ that minimizes the negative log likelihood.
so it's just calculus, do partial deriv of the equation w.r.t Θ.  the "log" doesn't change the concavity, so it doesn't affect in terms of finding the gradient.
intuitively, using the below beautiful ascii representation of y=Θ^2, there is the global minimum. bigger-than-optimal Θ will give you a positive derivative.
similarly small-than-optimal Θ will give a negative derivative. since you want to adjust Θ iteratively to arrive at optimal Θ, you can Θ = Θ - α*gradient. (hence gradient "descent")
step size α is there to control the adjustment ratio, otherwise you may adjust too big/little. (again, we don't go into the black art of how to decide/adjust α in this course)
again, if the purpose is to finding Θ that maximizes the function, then we do Θ = Θ + α*gradient (hence gradient ascent).

|       |
+     +
+   +
+_+

##
##

(1) initialize Θ to random values   #  let's say d = Θ dimension
(2) randomly pick one labeled data vector xi,yi
for j = 1 to d,
∂ log[1 + exp(yi<Θ,xi>)]
update  Θi = Θj - α -------------------------
∂Θj

(3) repeat step (2) until the updates are smaller than a threshold. (generally reduce alpha at each iteration)

note: notice how each iteration only looks at one labeled data vector i (one instance), as opposed to entire dataset (all n instances), so computationally SGD iteration is faster, and thus SGD usually converges quicker.

==> thus we say

amount of data (size of n) | preferred technique
-------------------------------------------------
massive      |  SGD

##
##  Overfitting
##

- intuitive. if you perfectly fit your model (theta/classifier) to your train measurement(x) vector/data, then your model will probably not perform good on the test data.
- using logistic regression, overfitting is more likely to occur in high demensionality.

##
##  Regularization
##

(notes) https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/4-LogisticRegression.pdf

- one way to tackle overfitting is regularization, which is a technique to add Beta terms to the MLE cost function.
- Beta penalizes extreme theta (super positive/negative theta) to avoid overfitting.
(obviously too much of that leads model to undefit)

######################################
####    (5)  Linear Regression    ####
######################################

(slides) https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L5+Linear+Regression+-+Course+Notes.pdf
(notes) https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/5-LinearRegression.pdf

- Linear Regression is the task of predicting a real value based on a vector of measurements
-- should be called "numerical approximation"
-- an important ML algorithm
-- used heavily in finance, demand forecasting, and pricing strategies
- We will learn the theory of the most common regression model and how to use it in practice using R

ntoe: LR is a probabilistic model which does not make specific deterministic predictions but rather predictions about the probability of Y given X

##
##  Linear Regresssion model
##

(see video for maths equations) https://www.youtube.com/watch?v=Yz53nrdfCfo

d
essentially:   Y = Σ Θi * Xi
i=1
= Θ^t * X + ε

where
Y: a random variable, scalar, Y (real number)
X: a random vector X = (x1,x2,x3,,,,xd)
Θ: a weight vector you train
ε: a Gaussian random variable with mean zero and variance σ2 (representing noise)
(This is an assumption that linear regression makes, but it may not represent reality)
NOTE: LR does not attempt to model the distribution of X, i.e. it doesnt model p(X), but it tries to model Y given a particular X, i.e. it models p(Y|X)

###
###  Linear Transformation
###

LR is useful.
But, to be able to apply linear regression algo, you need Y & X represented in Linear combination format.
So you need to transform original non-linear data X' -> X for which a linear relationship exists between Y & X
Here, you just need to familiarize yourself with the language.
- you basicallly did a "non-linear" transformation X' -> X  to increase a linear relationship between X & Y
- in other words, we say X = (x1,x2,,,xd) may be non-linear functions of some original data X' = (x'1,x'2,,,,x'd)
- in other words, the linear regression model Θ^t * X is linear in X, but non-linear in the original data X'

##
##  Residual plot
##
- given a pair of Y & X, as your training data, then you train a model, apply the same X, to get your prediction value Y'
- then you plot Y' - Y, called residual value/plot. obviously a big redidual value means a bad prediction.
- if the residual plot shows a pattern, that means your LR is not working, because your LR model didn't capture a certain pattern.
- if the residual plot shows no pattern, just random, that likely means your LR worked.

homoscedasticity: variance of residual values is equal for diff values of X
hetroscedasticity: variance of residual values is unequal for diff values of X

###
###  Training Data
###

two ways to get data
- observational data: like temperature, nature, you cannot control p(x) e.g. you cannot control the weather/temperature/humidity, etc
- experimental data: like planting crops, you can control p(x)  e.g. you can control where you plant crops, which regions, how, etc

###
###  Minimizing the sum of square deviations
###

RSS(Θ) = || Y - XΘ ||
n
= Σ (Y^i - Θ^t * X^i)^2
i=1

Θ = argmin RSS(Θ) = argmax Σ log p(Y^i|X^i)
Θ               Θ    i

===> optimal Θ can be obtained by minimizing the sum of square deviations(aka residuals)
===> notice we can use the maximizing the log likelihood technique again. generally this computation is faster in linear regression than logistic rgression, because, as the below video shows, it's all simple matrix computations.

###
###  coefficient of determination
###
- recall RSS was one way to determine the quality of prediction
- another way is to use R^2 : aka coefficient of termination                                      ^     ^
- R^2 is the square of the sample correlation coefficientbetween values Y^i and the fitted values Y^i = Θ^t * X^i

the output of R^2 is between 0 & 1,  1 being perfect prediction, and 0 being s***ty prediction

##
##  Fitting Linear Models  in R
##

==> we will look at lm()

(doc)  https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

Description: lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

Usage

lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)

e.g.
lm(linear model formula, dataframe_of_both_Y_&_X) --> model object M that can be queried
predict(M, dataframe without labels Y) --> makes prediction of Y
coef(M) --> gives model parameters. this is Theta

##
##  using lm()
##

library(ggplot2)
data(diamonds)

diaSmall = diamons[sample(1:dim(diamons)[1],500),]
Price = theta_1 + theta_2 * carat + epsilon
M1 = lm(price~carat, diamSmall)   # see the below video for syntax of lm()
theta = coef(M1)
predict(M1,data.frame(carat=c(3,4,5)))
summary(M1)

(a must see good example R video)    https://www.youtube.com/watch?v=rVhInIhDXIc
(a must see good syntax lm() video)  https://www.youtube.com/watch?v=F5Xvtfs4QFg

##
##

Price = theta_1 + theta_2 * carat + theta_3 * carat^2 + theta_4 * carat^3 + epsilon
M2 = lm(price~carat+I(carat^2)+I(carat^3),dataSmall)
theta = coef(M2)

(how to improve the model) https://www.youtube.com/watch?v=RvgwjlTi8nA

##
##
- adding more X vectors, like carat + color

e.g.

M3 = lm(price~carat+color, diamSmall)
theta3 = coef(M3)

##
##  log model
##
- rather than predicting price as a linear function of carat, we predict log(price) as a linear function of log(carat)

e.g.

M4 = lm(log(price)~log(carat), diamSmall)
theta4 = coef(M4)

##
##  comparing models
##
- look at R^2 and RSS.
-- a simple linear model
-- a nonlinear transformed model
-- a linear model that adds more X vectors (explanatory variables)
-- a log transformed model (there are other mathematical transformation that might helps improve R^2 and RSS)

cor()

##################################
####   (6)  Regularization    ####
##################################

(notes) https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/Student+Generated+Notes/6-Regularization.pdf
(slides) https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cse6242/course+notes/L6+Regularization+-+Course+Notes.pdf

- We will learn about modeling in high dimensions
- What we have learned breaks down in high dimensions
- Overfitting and underfitting can be a problem
- Techniques to combat these problems
-- Shrinkage
-- Ridge Regression
-- Lasso Regression

##
##  What is Regularization?
##

Problems with overfitting the data
- Models can over-generalizes / overfits attributes that are too specific
- The classification rule is relatively complex and places too much emphasis on a irrelevant attribute
-- Even though the training data works, it will not generalize well to new data
-- Once we allow complex rules, we can overfit to the training data
-- Sometimes, simpler rules are better

Regularization discourages complexity in the prediction logic that is learned from the training data
- Regularization will increase the mistakes on the training set but decrease them on the test set

##
## Model Complexity
##
The case of Mean Squared Error​ – the average of the squared error between the true parameter values and the predicted parameter values

Model Complexity = bias^2 + variance + irreducible error

where  sum(bias^2 + variance) is called "estimation error"

Bias:
- is the square of the difference between the true model parameters and the expected value of the model prediction. i.e. residual.
- captures the systematic "deviation" from the model
- if the model went through every one of the training points, the bias / systematic deviation will be zero.
- low bias (with high variance) usually means overfitting

Variance:
- the variance(i.e. variability, fluctuation) of the estimator
- variance captures systematic "error"
- low variance (with high bias) is underfitting

Irreducible Error:
- not in our control

==> obviously you want to reduce all three components.
==> reducing bias means you overfit, adding more parameters, more complexity, etc
==> reducing variance means you underfit, because simpler models gives consistent results.
- obviously, you have to optimize between that trade off.

##
## Goldilocks Principle
##
- as you increase complexity (overfit,low bias,high variance), the model's prediction error decreases against the training data.
- but at the same time, the model's prediction error against the test data decreases only up til some point of complexity, then increases again because past that point you overfit.

(good visual) https://www.youtube.com/watch?v=QEV_wch_62s    # Goldilocks Principle

##
##  how to prevent overfitting
##
- reduce the magnitude/importance/dominance of certain features  (first step)
- reduce the number of features to use (if the first step is not enough, but be careful, this can be very useful but can be tricky)

##
##  MLE revisited
##

Recall that the MLE is a technique for estimating model parameters.
- It answers the question: which parameters will most likely properly characterize the dataset / best predicts the data?
- Recall that the likelihood function is a product of the probabilities of the training set samples.

We search for the parameter that maximizes the likelihood function.
Sometimes, instead of looking at the likelihood we look at the log likelihood as its easier to maximize computationally.
Traditional statistics usually refers to where the number of parameters (the model dimensionality) is fixed while the training set size increases to infinity.

This is often denoted as d<<n
▪ ‘d’ is the number of parameters
▪ ‘n’ is the training set size
▪ ’<<’ implies ‘much smaller’

when d<<n, the MLE performs well

Asymptotic optimality kicks in
▪ Asymptotic optimality​: as your training set approaches infinity, the model will converge to the ground truth or ‘nature’ that generates the data, and the convergence happens at the fastest possible rate.
▪ Asymptotic variance of the MLE is the best possible
▪ Asymptotic variance is the inverse fisher​ information

The d<<n turns out to be a huge assumption.
- We do not know that d is much smaller than n
- Typically, d is not much smaller than n
-- If this is the case, MLE performs poorly
-- We must control for overfitting: it's hard to tell which features are noise and which are important signals.
- this is a big problem as many datasets out there have, in some cases, millions of features and not enough records
- Regularization is important

##
##  MSE revisited
##

Mean Squared Error: MSE = Bias^2 + Variance

The Maximum Likelihood Estimator for linear regression is unbiased (meaning bias == 0)
- In other words, variance is the only component of the MSE
-- Somehow, its still possible to overfit with bias = 0
-- We can actually increase the bias here
- We can use regularization by introducing bias to reduce variance
- Why would we want to increase the bias via regularization?
-- Because it can reduce the variance to the point where we have a lower overall MSE
-- Bias can reduce variance and lower MSE overall

In the case of linear models, bias to simplicity​ can reduce variance and lower MSE overall
- Simplicity bias​ (another way of saying bias to simplicity) can be thought of as reducing the parameters of the model towards 0; this does increase the bias but it lowers the estimation variance and thus lowers the MLE, which may result in better overall estimation accuracy
and less overfitting of training data.
- In other words, increase bias to reduce variance and the MSE overall
- We basically have to reduce |Θ^MLE|
-- That is to say, reduce the values of the parameters
-- Note it’s the absolute value of the coefficients we learned – we need to reduce these to 0 as close as we possibly can
==> This would mean the model places less importance on the corresponding feature, which is the entire point.
==> Even though the number of features stay the same, the importance of the features are reduced and thus the model becomes simpler and thus less likely to overfit.

##
## Regularization Methods
##
let us look at a few:
◦ James-Stein Shrinkage
◦ Breiman’s Garrote
◦ Ridge Estimator
◦ Lasso Estimator
◦ Elastic Net Estimator
All of these can be used when the traditional statistical case where d<<n falls apart

##
## James-Stein Shrinkage
##

Note that the James-Stein Estimator isn’t necessarily the ‘best’ form of regression, but we talk about it as it has important historical connotations

Traditional statistical theory states ‘no other estimation rule for means is uniformly better than the observed average’
◦ i.e. In traditional statistics theory, there is no better way to estimate the mean or expectation than by computing the average
◦ i.e. all it's saying is we sum it all up and divide by the number of samples.

The James-Stein Estimator​ says: when estimating multiple means: shrink all individual averages towards a grand average (= the mean of all the means together)
◦ The James-Stein estimator addresses a situation where we have to estimate multiple means at the same time (means of different, random variables)
◦ We want to estimate the mean simultaneously
e.g.
batting hit average of diff baseball players.
In the James-Stein, we compute all entity’s separate means, but then we shrink all individual means to the ‘grand average’
◦ This was shocking at first, but it has been shown to – in some cases – perform better than the traditional way of just computing the average for each individual entry.

The weakness of James-Stein Estimator is that JSE shrinks all the dimensions of Θ^mle uniformly towards the grand average.
(sometimes that's what we want, but often we require to adjust each dimention differently.)

1
Θ^JS =  ----- * Θ^mle    where t>0     # t is called component or constraint coefficient "tau"
1+t

##
##  Breiman's Garrote
##

to overcome james stein shrinkage method's limitation,  BG allows for shrinking each dimension differently.

^BG       ^mle
Θ   = t  * Θ      where j = 1..d
j     j    j
^mle
where t = argmax log(t * Θ  )        # maximizing the log likelifood
t        j   j

d
sibject to ∑ t_j <= c     # c is a tuning parameter called "hyper parameter"
j=1

==> we try diff values for c, to tune/optimize.  Also, sometimes we see more than one hyper parameters.
==> computational cost is increased almost prohibitively at high dimension. so be careful.

note: when we constraint t_j > 0, it is called non-negative garrote

##
##  Ridge Regularization (aka Ridge Regression)
##

James-Stein was proposed for means and garrote was proposed for regression
- In the same vein, Ridge Regression was also proposed for linear regression but the same principle can apply for other models (logistic regression and other methods)

Definition of Ridge Regression

^                                       2
Θridge = argmax log(Θ)  subject to ||Θ||    <= c
Θ                            2

Maximize the likelihood (or the log likelihood) subject to a constraint
▪ The constraint is the L2 norm of the vector Θ is less than some constant c
Recall that the L2 norm is the sum of squares for Θ
▪ The L2 norm squared is just the sum of the squares of the dimensions of Θ

Keep in mind that Θ is a vector
In other words, the L2 norm squared is Θ1^2 + Θ2^2 + Θ3^2 + ... + Θd^2
More generally:

d
||v||  = ( ∑ |v_j|^p )^2
p    j=1

you can think of the p norm (in this case it’s the L2 norm but also the Lp norm) we have the absolute value of each component of the vector raised to the power p (that is to say, |v_j|^p)  summed up and then taken to the power 1/p.
If we do the math and substitute 2 for p and we get what we saw earlier (in the ‘subject to’ part above).
Because its raised to the second power it cancels with the square root, so you just get the sum of squares.

The original problem is not easy to solve computationally because it’s a constraint optimization problem.
◦ This can be reformulated(re-written) using something called the Lagrangian, which is a tool from calculus.
▪ Lambda is considered to be the Lagrange multiplier, aka shrinkage parameter (that controls how much to srhink Θ)

So, solving the original problem becomes solving the re-written problem by setting the derivative of the Lagrangian to zero with respect to theta.
note: in reality, you don't need to solve the equiation, just solve for certain values of lambda, and derive the best optimization.
lambda, in this case, is really a parameter(aka ridge parameter) that controls how much you penalize big Θ.
as you increase lambda, Θ becomes smaller.

-----

###
###  LASSO estimator(regression)
###

lasso: least absolute shrinkage and selection operator

it's essentially the same as Ridge Estimator(regression), but the only difference is we use L1 norm.
what happens when you use L1 norm is some Θ may get penalized to zero (i.e. gone)
while using L2 norm, you may penalize Θ to smaller values but never zero.
So using L1 norm, because it effectively removes some Θ, can be seen as doing "variable selection" as well as "shrinkage(parametr estimation)"

i.e.  Lasso encourages sparse solution. this is useful when your dimension d is too huge. so Lasso becomes computationally more efficient.

###
### linear regression maths
###

note: regularization techniques can be applied to both lin reg and logistic regression.

(video)

###
###  Elastic Net Estimator
###
- linear regression that combines L1(lasso) & L2(ridge) regularization.
- good for dealing with high dimensional data
- commonly used in real industry
- basically you get two lambda parameters. one for L1, one for L2.
- computationally no different from Lasso. so use this in practice.

1. 2016-10-02 16:34:19 |
2. Category : gatech
3. Page View: