Machine Learning for Trading

Machine Learning for Trading.  cs7646 notes. 
 
(ref) https://blog.quantopian.com/9-mistakes-quants-make-that-cause-backtests-to-lie-by-tucker-balch-ph-d/ 
(ref) http://www.economist.com/news/finance-and-economics/21706278-central-bank-may-exert-strange-sway-over-stockmarket-returns-long-arm 
 
########################################################## 
####  Part 1: manipulating financial data in python   #### 
########################################################## 
 
why python ? 
- strong scientific libraries 
- well maintained 
- fast (low level operation written in C) 
 
data 
- we assume csv files that come with headers 
- CSV: comma separated values 
 
stock data 
- timestamp, open, high, low, close, adj_close, volume, turnover, etc 
 
 
 
################################## 
###  (1.1)  Pandas dataframe   ### 
################################## 
 
- Pandas :developed by AQR hedge fund. "Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series." 
- its key component is "dataframe" (for our purpose, it's basically a 2d array. 1d array is "series", and a more elaborate array is alled "Panel") 
 
(ref) http://pandas.pydata.org/pandas-docs/stable/dsintro.html 
(ref) http://pandas.pydata.org/pandas-docs/stable/index.html 
(ref) http://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-then-filling-it 
 
 
----------------- pandas_df_example.py 
import pandas as pd 
 
def test_run(): 
    df = pd.read_csv("data/aapl.csv") 
    print df          # print the entire csv file 
    print df.head()   # print the top 5 lines 
    print df.tail()   # print the last 5 lines 
    print df.tail(45) # print the last 45 lines 
    print df[10:21]   # print row from 10 to 20.  NOTE: the range is semantaically really like [a,b) 
 
if __name__ == "__main__": 
    test_run() 
---------------- 
 
an examle output 
---------------- 
            Date    Open    High     Low   Close    Volume  Adj Close 
3170  2000-02-07  108.00  114.25  105.94  114.06  15770800      28.39 
3171  2000-02-04  103.94  110.00  103.62  108.00  15206800      26.88 
3172  2000-02-03  100.31  104.25  100.25  103.31  16977600      25.72 
3173  2000-02-02  100.75  102.12   97.00   98.81  16588800      24.60 
3174  2000-02-01  104.00  105.00  100.00  100.25  11380000      24.96 
---------------- 
 
 
------------------------ compute_mean.py 
"""Compute mean volume""" 
 
import pandas as pd 
 
def get_mean_volume(symbol): 
    """Return the mean volume for stock indicated by symbol. 
 
    Note: Data for a stock is stored in file: data/<symbol>.csv 
    """ 
    df = pd.read_csv("data/{}.csv".format(symbol))  # read in data 
    return df['Volume'].mean()  # TODO: Compute and return the mean volume for this stock 
                                # see how powerful pandas dataframe is. 
                                # - you can specify the column by the header 
                                # - also all sorts of functions like mean() max(), etc 
 
def test_run(): 
    """Function called by Test Run.""" 
    for symbol in ['AAPL', 'IBM']: 
        print "Mean Volume" 
        print symbol, get_mean_volume(symbol) 
 
 
if __name__ == "__main__": 
    test_run() 
------------------------- 
 
-------------- output 
Mean Volume 
AAPL 21491431.3386 
Mean Volume 
IBM 7103570.80315 
------------- 
 
 
## 
##  matplot/pyplot 
## 
- it's just so powerful and useful. see example code at 
http://matplotlib.org/users/pyplot_tutorial.html#working-with-text 
 
-------------------------------- 
"""Plot Close & AdjClose prices for IBM""" 
 
import pandas as pd 
import matplotlib.pyplot as plt 
 
def test_run(): 
    df = pd.read_csv("data/IBM.csv") 
    df[['Close','Adj Close']].plot() 
    plt.show()                      # must be called to show plots 
    plt.savefig('output/plot.png')  # if you wanna save 
 
if __name__ == "__main__": 
    test_run() 
------------------------------ 
 
 
## 
##  build a data frame in pandas 
## 
 
-------------------- 
import pandas as pd 
def test_run(): 
    start_date='2010-01-22' 
    end_date='2010-01-26' 
    dates=pd.date_range(start_date,end_date)   # just a list 
    df1=pd.DataFrame(index=dates)              # those dates are used as index, instead of bare integers 
 
    # assume there exists a column named "Date" in SPY.csv 
    # na_values specifies what should be interpreted as NaN 
    # usecols specify which columns to use 
    dfSPY = pd.read_csv("data/SPY.csv", index_col="Date", parse_dates=True, usecols=['Date','Adj Close'], na_values=['nan']) 
    df1=df1.join(dfSPY)  # typical left join  OR do df1=df1.join(dfSPY,how='inner') then that takes only the rows whose indices are common in both tables 
    df1.dropna()         # dropping NaN 
    print df1 
 
if __name__ == "__main__": 
    test_run() 
-------------------- 
 
# left join is like you take the left table as your base, and join any corresponding rows from the right table. 
# in this case take dfSPY rows whose index dates exist in df1, where it doesnt exist, set it as NaN 
 
 
 
## 
##  another example: read multiple stock adj close, and display days when SPY traded 
## 
 
-------------------------- 
import os 
import pandas as pd 
 
def symbol_to_path(symbol, base_dir="data"): 
    """Return CSV file path given ticker symbol.""" 
    return os.path.join(base_dir, "{}.csv".format(str(symbol))) 
 
 
def get_data(symbols, dates): 
    """Read stock data (adjusted close) for given symbols from CSV files.""" 
    df = pd.DataFrame(index=dates) 
    if 'SPY' not in symbols:  # add SPY for reference, if absent 
        symbols.insert(0, 'SPY') 
 
    for symbol in symbols: 
        # TODO: Read and join data for each symbol 
        df_tmp = pd.read_csv(symbol_to_path(symbol), index_col="Date", parse_dates=True, usecols=['Date','Adj Close'], na_values=['nan']) 
        df_tmp = df_tmp.rename(columns ={'Adj Close':symbol}) 
        df = df.join(df_tmp) 
        if symbol == "SPY": 
            df = df.dropna(subset=['SPY']) 
    return df 
 
 
def test_run(): 
    # Define a date range 
    dates = pd.date_range('2010-01-22', '2010-01-26') 
 
    # Choose stock symbols to read 
    symbols = ['GOOG', 'IBM', 'GLD'] 
 
    # Get stock data 
    df = get_data(symbols, dates) 
    print df 
 
 
if __name__ == "__main__": 
    test_run() 
----------------------------    #  expect an output like below 
 
 
               SPY    GOOG     IBM     GLD 
2010-01-22  104.34  550.01  119.61  107.17 
2010-01-25  104.87  540.00  120.20  107.48 
2010-01-26  104.43  542.42  119.85  107.56 
 
 

#  slicing dataframes 

there are soooo many powerful syntax, so learn them. 
e.g. 
df2 = df1[[col_name_foo,col_name_bar]] 
df2 = df1.ix[s_idx:e_idx,[col_name_foo,col_name_bar]] 
 
REF: http://pandas.pydata.org/pandas-docs/stable/indexing.html 
 
 

#   more about plotting 

 
example code snippet 
-------------------- 
import matplotlib.pyplot as plt 
def plot_data(df, title="Stock prices"): 
    """Plot stock prices with a custom title and meaningful axis labels.""" 
    ax = df.plot(title=title, fontsize=12, grid=True) 
    ax.set_xlabel("Date") 
    ax.set_ylabel("Price") 
    plt.savefig('myplot.png')  # (ref) http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.savefig 
                               # make sure you save BEFORE you call plt.show() otherwise your saved .png will be blank 
    plt.show() 
 
-------------------- 
 

#  normalizing: because plotting individual stock prices directly will make it hard to see lower priced stock momentum 

 
def normalize_data(df): 
return df/ df.ix[0,:]      # df.ix[0,:] will give you just the first row. enjoy the powerful syntax of pandas 
 
 

#  using "in" 

 
"in" lets you check whether df contains an element. but it checks "column" 
 
df = 
     col_1 
0    a 
1    b 
3    c 
4    d 
 
'a' in df         # False 
'col_1' in df     # True 
'a' in df.values  # True 
 
 
 
########################## 
###   (1.3)  Numpy     ### 
########################## 
 
pandas is a wrapper to numpy which is a wrapper to the underlying C/fortran code. so it is fast. one of the reasons why python is used for financial research. 
note you may notice some nuance btwn pandas and numpy. 
e.g. 
for std deviation calc, pandas uses biased estimator, while numpy uses unbiased, which gives you a slightly diff val. 
 
numpy gives you what they call an ndarray (= N dimentional array) which is an array with elaborate features. 
 
import numpy as np 
 
## 
##  creation 
## 
 
pandas.DataFrame.values: Underlying values as ndarray 
numpy.array: Create a NumPy ndarray from given sequence 
numpy.ndarray: NumPy n-dimensional array type 
(ref) http://docs.scipy.org/doc/numpy/user/basics.creation.html 
 
nd1 = df.values 
 
nd1 = np.empty(5)       # 1d array of size 5 (value is not initialized so whatever is in memory will be there) 
nd1 = np.empty((5,4,3)) # 3d array of size 5 4 3 for xyz coordinates 
nd1 = np.ones((5,4), dtype=np.int)   # ndarray default dtype is float 
nd1 = np.ones((5,4)) 
(ref) http://docs.scipy.org/doc/numpy/reference/routines.array-creation.html 
 
 
## 
##  random 
## 
 
nd1 = np.random.random(5,4) 
 
numpy.random.random: Samples a Uniform distribution in [0.0, 1.0) 
numpy.random.rand: Like random, but slightly different syntax 
numpy.random.normal: Normal or Gaussian distribution 
numpy.random.randint: Integers from Uniform distribution 
(ref) http://docs.scipy.org/doc/numpy/reference/routines.random.html 
 
 
## 
##  accessing elements (indexing, slicing) 
## 
 
(ref) http://docs.scipy.org/doc/numpy/user/basics.indexing.html 
(ref) http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html 
 
nd1[x,y]   # an element at coordinate x,y 
nd1[0,0] 
nd1[0:3,2:5]  # ":" indicates a range, called "slicing". so you specified range 0-2 for x, and 2-4 for y 
              # note the syntactic sugar that the end_index is not inclusive 
nd1[:,-1]     # a plain ":" like this means ALL rows in x axis.  "-1" means counting from the end 
 
nd1[0:2,0:2] = nd1[-2:,2:4]   # you can do something neat like this also. 
 
nd1[:,0:5:2]  # n:m:t means consider index from n to m-1, and take every t index starting n 
              # so in this case you take column 0,2,4 
 
nd1 = np.random.rand(5)   # an array of 5 random elems 
indices = np.array([1,1,2,3]) 
print nd1[indices]        # using an array as index to other array  (numpy magic) 
 
## more fancy example 
-------------------------------// fancy_index.py 
#!/usr/bin/python 
 
import numpy as np 
 
arr = np.array([(1,5,4),(3,2,6)]) 
print "original array:\n",arr 
print "mean:",arr.mean() 
print "elems how is less than mean:",arr[arr<arr.mean()] 
arr[arr<arr.mean()] = 7 
print "replaced all elems under mean:\n",arr 
-------------------------------- 
$ python mean.py 
original array: 
[[1 5 4] 
 [3 2 6]] 
mean: 3.5 
elems how is less than mean: [1 3 2] 
replaced all elems under mean: 
[[7 5 4] 
 [7 7 6]] 
 
 
 
NOTE: pandas has an even more elaborate indexing mechanism where it infers a lot from the context, so it helps if we write explicitly. 
e.g. 
.loc  # for label based access 
.iloc # for position based access (i.e. integers) 
.ix   # tries label then tries position based access 
 
the massive confusion occurs when you have an index label "3", not as the positional index integer. 
you may think you specified the 4th element but in reality you may get the elem of whichever position the label "3" points to. 
see the stackoverflow example. 
 
(a good ref) http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation 
(a good goto ref) http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing 
 
NOTE: Pick up a particular element (a note from a colleague in piazza) 
- to pick up a particular element from a dataframe: 
e.g. 
df = 
     col1    col2 
0    a        e 
1    b        f 
3    c        g 
4    d        h 
 
df.col1[df.col2 == e]: this will return a Series object, not that particular value! Although you know only one element is there and you are ready to use it, but how does Pandas know there's only one? It will always return a Series in this case. 
df.col1[df.col2 == e].iloc[0]: OK, returns 'a' 
 
 
 
## 
##  attributes 
## 
 
numpy.ndarray.shape: Dimensions (height, width, ...) 
numpy.ndarray.ndim: No. of dimensions = len(shape) 
numpy.ndarray.size: Total number of elements 
numpy.ndarray.dtype: Datatype 
(ref) http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html 
 
nd1.shape[0]   # number of rows 
nd1.shape[1]   # number of columns 
 
 
## 
##  mathematical operations 
## 
 
(must read ref) http://docs.scipy.org/doc/numpy/reference/routines.math.html 
 
----------------------------------------// math_sample.py 
#!/usr/bin/python 
 
import numpy as np 
 
np.random.seed(854)        # this fixes your rand generator to always produce the same random output 
                           # otherwise your code will produce something diff every time 
a = np.random.randint(0,10,size=(5,3)) # 5×3 array with random integers [0,10) 
print "array:\n",a 
print "sum of every elem:",a.sum() 
print "sum of each col:",a.sum(axis=0) 
print "sum of each row:",a.sum(axis=1) 
print "min:",a.min() 
print "min of of each col:",a.min(axis=0) 
print "min of each row:",a.min(axis=1) 
print "max:",a.max() 
print "index of the max elem:",a.argmax() 
print "mean:",a.mean() 
---------------------------------------- 
$ python math_sample.py 
array: 
[[4 3 8] 
 [7 8 6] 
 [9 5 1] 
 [6 7 2] 
 [8 1 8]] 
sum of every elem: 83 
sum of each col: [34 24 25] 
sum of each row: [15 21 15 15 17] 
min: 1 
min of of each col: [4 1 1] 
min of each row: [3 6 1 2 1] 
max: 9 
index of the max elem: 6 
mean: 5.53333333333 
 
 
## 
##  arithmetic operations 
## 
 
nd1 + nd2    # simply adds up corresponding index elems 
nd1 - nd2    # likewise 
nd1 / nd2    # likewise 
nd1 / n      # you can use a number n 
nd1 * nd2    # likewise (note this is NOT dot product matrix multiplication) 
nd1 * n 
nd1.dot(nd2) 
 
NOTE: when doing arithmetic operations, if all numbers are integers, then python assumes integers. 
e.g. 
3/2   = 1 
3/3.0 = 1.5 
 
NOTE: if you add 2 numpy arrays, they will add by "position", but if you add 2 pandas dataframes, they will be aligned by "indices", not position. 
e.g. 
 
df1 = 
     col1 
0    0.1 
1    0.2 
3    0.3 
4    0.4 
 
df2 = 
     col1 
0    0.1 
1    0.2 
2    0.3 
4    0.4 
 
==>  df1 + df2 will yield 5 indices, with some new NaN.  to add by pos, you can do:  df1.values + df2.values 
 
numpy.add: Element-wise addition, same as + operator 
numpy.subtract: Element-wise subtraction, same as - 
numpy.multiply: Element-wise multiplication, same as * 
numpy.divide: Element-wise division, same as / 
numpy.dot: Dot product (1D arrays), matrix multiplication (2D) 
 
(ref) http://docs.scipy.org/doc/numpy/reference/routines.math.html#arithmetic-operations 
(ref) http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html 
(ref) http://docs.scipy.org/doc/numpy/reference/routines.linalg.html 
(ref) http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html 
(ref) http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html 
 
 
 
 
## 
##  time 
## 
 
(ref) https://docs.python.org/2/library/time.html#time.time # simple time 
(ref) https://docs.python.org/2/library/timeit.html         # average 
(ref) https://docs.python.org/2/library/profile.html        # profiling 
 
-------------------------// time.py 
#!/usr/bin/python 
 
from time import time 
import numpy as np 
 
def manual_mean(arr): 
    total = 0 
    for i in range(0,arr.shape[0]): 
        for j in range(0,arr.shape[1]): 
            total = total + arr[i,j] 
    return total / arr.size 
 
def numpy_mean(arr): 
    return arr.mean() 
 
def how_long(func,*args): 
    t1 = time() 
    print "mean:",func(*args) 
    t2 = time() 
    return "took "+str(t2-t1)+" seconds" 
 
nd1 = np.random.random((1000,10000)) # a sufficiently large array 
print "numpy builtin mean() function:",how_long(numpy_mean,nd1) 
print "      manual mean calculation:",how_long(manual_mean,nd1) 
 
------------------------------ 
$ python time.py 
numpy builtin mean() function: mean: 0.500048407196 
took 0.00843000411987 seconds 
      manual mean calculation: mean: 0.500048407196 
took 2.81500196457 seconds 
 
====> as above, numpy/pandas/scipy builtin functions are fast, because they are compiled C code whereas if you do a for loop, python does one thing in lower level C, then comes back to python layer to increment a loop, then goes back to C, and so forth. you see how it's orders of magnitudes different. 
 
 
## 
##  sort, search, counting 
## 
 
(ref) http://docs.scipy.org/doc/numpy/reference/routines.sort.html 
 
 
 
################################################################ 
####    (1.4)  statistical analysis of time series data     #### 
################################################################ 
 
we will now go back to Pandas, and do some serious number crunching. 
 
(ref) global stats 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html   # mean 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html # median 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.std.html    # std deviation 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html    # sum 
http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats           # more API 
 
 
# assume df1 contains a table where x = yyyymmdd, y = ticker, xy = close price 
e.g. 
df1.mean() 
df1.median() 
 
 
(ref) rolling stats 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.rolling_mean.html 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.rolling_std.html 
http://pandas.pydata.org/pandas-docs/stable/computation.html?highlight=rolling%20statistics#moving-rolling-statistics-moments 
 

# Bollinger Bands: A way of quantifying how far stock price has deviated from some norm. (2-stddev from the mean) 

 
note rolling stats methods are not methods of dataframe, but pandas functions. 
e.g. 
rolling_mean_SPY = pd.rolling_mean(df1['SPY'], window=20)  # specified the window as 20 rows(i.e. days) 
 
# to plot 
ax = df['SPY'].plot(title="SPY rolling mean", label='SPY') 
rolling_mean_SPY.plot(label='Rolling mean', ax=ax) 
ax.set_xlabel('Date') 
ax.set_ylabel('Price') 
ax.legend(loc='upper left') 
plt.show() 
 
 
 
-------------------------------- // sample code from udacity quiz 
"""Bollinger Bands.""" 
 
import os 
import pandas as pd 
import matplotlib.pyplot as plt 
 
def symbol_to_path(symbol, base_dir="data"): 
    """Return CSV file path given ticker symbol.""" 
    return os.path.join(base_dir, "{}.csv".format(str(symbol))) 
 
 
def get_data(symbols, dates): 
    """Read stock data (adjusted close) for given symbols from CSV files.""" 
    df = pd.DataFrame(index=dates) 
    if 'SPY' not in symbols:  # add SPY for reference, if absent 
        symbols.insert(0, 'SPY') 
 
    for symbol in symbols: 
        df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date', 
                parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan']) 
        df_temp = df_temp.rename(columns={'Adj Close': symbol}) 
        df = df.join(df_temp) 
        if symbol == 'SPY':  # drop dates SPY did not trade 
            df = df.dropna(subset=["SPY"]) 
 
    return df 
 
 
def plot_data(df, title="Stock prices"): 
    """Plot stock prices with a custom title and meaningful axis labels.""" 
    ax = df.plot(title=title, fontsize=12, grid=True) 
    ax.set_xlabel("Date") 
    ax.set_ylabel("Price") 
    plt.show() 
 
 
def get_rolling_mean(values, window): 
    """Return rolling mean of given values, using specified window size.""" 
    return pd.rolling_mean(values, window=window) 
 
 
def get_rolling_std(values, window): 
    """Return rolling standard deviation of given values, using specified window size.""" 
    return pd.rolling_std(values, window=window) 
 
 
def get_bollinger_bands(rm, rstd): 
    """Return upper and lower Bollinger Bands.""" 
    upper_band = rm + 2*rstd 
    lower_band = rm - 2*rstd 
    return upper_band, lower_band 
 
 
def test_run(): 
    # Read data 
    dates = pd.date_range('2012-01-01', '2012-12-31') 
    symbols = ['SPY'] 
    df = get_data(symbols, dates) 
 
    # Compute Bollinger Bands 
    # 1. Compute rolling mean 
    rm_SPY = get_rolling_mean(df['SPY'], window=20) 
 
    # 2. Compute rolling standard deviation 
    rstd_SPY = get_rolling_std(df['SPY'], window=20) 
 
    # 3. Compute upper and lower bands 
    upper_band, lower_band = get_bollinger_bands(rm_SPY, rstd_SPY) 
 
    # Plot raw SPY values, rolling mean and Bollinger Bands 
    ax = df['SPY'].plot(title="Bollinger Bands", label='SPY') 
    rm_SPY.plot(label='Rolling mean', ax=ax) 
    upper_band.plot(label='upper band', ax=ax) 
    lower_band.plot(label='lower band', ax=ax) 
 
    # Add axis labels and legend 
    ax.set_xlabel("Date") 
    ax.set_ylabel("Price") 
    ax.legend(loc='upper left') 
    plt.show() 
 
 
if __name__ == "__main__": 
    test_run() 
---------------------------------------------- 
 
 

# Daily returns: day-to-day percentage change in stock price. 

 
def:   daily_ret[t] = (price[t]/price[t-1]) - 1 
 
it is a useful indicator. e.g. you can compare a particular industry stock daily ret against SP500 daily ret, to extract a pattern, etc. 
 
 
here is a sample script. i provide three implementation examples. 
 
----------------------------// sample from udacity quiz 
"""Compute daily returns.""" 
 
import os 
import pandas as pd 
import matplotlib.pyplot as plt 
 
def symbol_to_path(symbol, base_dir="data"): 
    """Return CSV file path given ticker symbol.""" 
    return os.path.join(base_dir, "{}.csv".format(str(symbol))) 
 
 
def get_data(symbols, dates): 
    """Read stock data (adjusted close) for given symbols from CSV files.""" 
    df = pd.DataFrame(index=dates) 
    if 'SPY' not in symbols:  # add SPY for reference, if absent 
        symbols.insert(0, 'SPY') 
 
    for symbol in symbols: 
        df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date', 
                parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan']) 
        df_temp = df_temp.rename(columns={'Adj Close': symbol}) 
        df = df.join(df_temp) 
        if symbol == 'SPY':  # drop dates SPY did not trade 
            df = df.dropna(subset=["SPY"]) 
 
    return df 
 
 
def plot_data(df, title="Stock prices", xlabel="Date", ylabel="Price"): 
    """Plot stock prices with a custom title and meaningful axis labels.""" 
    ax = df.plot(title=title, fontsize=12, grid=True) 
    ax.set_xlabel(xlabel) 
    ax.set_ylabel(ylabel) 
    plt.show() 
 
 
def compute_daily_returns(df): 
    """Compute and return the daily return values.""" 
    # Note: Returned DataFrame must have the same number of rows 
 
    daily_ret = df.copy() 
 

    # implementation 1 (naive, do NOT do this, as it defeats the purpose of pandas/numpy) 
    # 
    x_len = len(df.ix[:,0]) 
    y_len = len(df.ix[0,:]) 
    for i in range(0,x_len): 
        for j in range(0,y_len): 
            if(i == 0): 
                daily_ret.ix[i,j] = 0 
            else: 
                daily_ret.ix[i,j] = df.ix[i,j] / df.ix[i-1,j] - 1 
 
    # 
# implementation 2 (not too bad) 

# note, .values is necessary because otherwise pandas by default would try match each row by index when doing element wise arithmetic operation on two data frames. 
    daily_ret[1:] = (df[1:] / df[:-1].values) - 1 
    daily_ret.ix[0,:] = 0   # this is to explicitly mark NaN as zero 
 
    # 
    # implementation 3 (probably the most elegant) 
    # 
daily_ret = (df / df.shift(1)) - 1 
daily_ret.ix[0,:] = 0   # this is to explicitly mark NaN as zero 
 
    return daily_ret 
 
 
def test_run(): 
    # Read data 
    dates = pd.date_range('2012-07-01', '2012-07-31')  # one month only 
    symbols = ['SPY','XOM'] 
    df = get_data(symbols, dates) 
    plot_data(df) 
 
    # Compute daily returns 
    daily_returns = compute_daily_returns(df) 
    plot_data(daily_returns, title="Daily returns", ylabel="Daily returns") 
 
 
if __name__ == "__main__": 
    test_run() 
------------------------------------- 
 
daily_ret = df.copy() 
# note, .values is necessary because otherwise pandas by default would try match each row by index when doing element wise arithmetic operation 
daily_ret[1:] = (df[1:] / df[:-1].values) - 1 
# alternatively you can write:    daily_ret = (df / df.shift(1)) - 1 
daily_ret.ix[0,:] = 0   # this is to explicitly mark NaN as zero 
return daily_ret 
 
 
 

#   Cumulative Return: 

  pretty much the same concept as daily ret but we do:  cumret[t] = price[t]/price[0] - 1 
  notice the only difference is the denominator is always price[0] 
 
 
 
##################################### 
####   (1.5)  Incomplete Data    #### 
##################################### 
 
data is often not complete. 
- amalgamated from different pri/secondary venues 
- corp actions, causing merger/acquisition, delist, venue change, etc 
- illiquid symbols only trade sporadically (how do you treat the absence period when doing rolling ave/std analysis? usually roll fwd the last close price, fill backward if you cannot fill fwd) 
 
pandas has fillna() function. 
(ref) http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html 
 
e.g. 
fillna(method='ffill', inplace=True)   # lean about the params method (accepts string) and inplace (accepts boolean) 
 
----------------------// sample from udacity quiz 
 
"""Fill missing values""" 
 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import os 
 
def fill_missing_values(df_data): 
    """Fill missing values in data frame, in place.""" 
    df_data.fillna(method='ffill',inplace=True) 
    df_data.fillna(method='backfill',inplace=True) 
    ########################################################## 
    pass  # TODO: Your code here (DO NOT modify anything else) 
    ########################################################## 
 
 
def symbol_to_path(symbol, base_dir="data"): 
    """Return CSV file path given ticker symbol.""" 
    return os.path.join(base_dir, "{}.csv".format(str(symbol))) 
 
 
def get_data(symbols, dates): 
    """Read stock data (adjusted close) for given symbols from CSV files.""" 
    df_final = pd.DataFrame(index=dates) 
    if "SPY" not in symbols:  # add SPY for reference, if absent 
        symbols.insert(0, "SPY") 
 
    for symbol in symbols: 
        file_path = symbol_to_path(symbol) 
        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", 
            usecols=["Date", "Adj Close"], na_values=["nan"]) 
        df_temp = df_temp.rename(columns={"Adj Close": symbol}) 
        df_final = df_final.join(df_temp) 
        if symbol == "SPY":  # drop dates SPY did not trade 
            df_final = df_final.dropna(subset=["SPY"]) 
 
    return df_final 
 
 
def plot_data(df_data): 
    """Plot stock data with appropriate axis labels.""" 
    ax = df_data.plot(title="Stock Data", fontsize=2, grid=True) 
    ax.set_xlabel("Date") 
    ax.set_ylabel("Price") 
    plt.show() 
 
 
def test_run(): 
    """Function called by Test Run.""" 
    # Read data 
    symbol_list = ["JAVA", "FAKE1", "FAKE2"]  # list of symbols 
    start_date = "2005-12-31" 
    end_date = "2014-12-07" 
    dates = pd.date_range(start_date, end_date)  # date range as index 
    df_data = get_data(symbol_list, dates)  # get data for each symbol 
 
    # Fill missing values 
    fill_missing_values(df_data) 
 
    # Plot 
    plot_data(df_data) 
 
 
if __name__ == "__main__": 
    test_run() 
----------------------------------- 
 
 
################################################## 
####   (1.6)  Histograms and Scatter Plots    #### 
################################################## 
 
histograms: a bar chart where y-axis shows the number of occurrences for each x-val 
 
the histogram of daily return values typically looks like a Gaussian(normal) distribution. this is what quants believed for the (daily,weekly,monthly, so on) returns of mortgage backed securities. i.e. all mortgage defaulting at once is unlikely as each mortgage is independent from one another. but that assumption proved to be wrong. 
 
obviously, you can calc mean, meadian, std dev, kurtosis for histograms. 
 
kurtosis: it's an indicator of how fat/thin the tail of the histogram is compared to a pure Gaussian/normal distrib. 
        positive kurtosis: fat tail (more occurrences on the edge than there would be if it was a prestine Gaussian distrib) 
        negative kurtosis: skinny tail 
 
 
(a good quiz) https://www.youtube.com/watch?v=KMOaqWXq4xA 
 
 
#  assume the same daily return function from the prev lecture where index/row = dates, column = ticker symbols 
 

#  plot a histogram with mean, std dev 

 
dr = compute_daily_returns(df) 
dr['SPY'].hist(bins=20, label='SPY')    # bins default is 10 
print dr['SPY'].kurtosis()              # gives kurtosis 
mean = dr['SPY'].mean() 
stddev = dr['SPY'].std() 
plt.axvline(mean,color='w',linestyle='dashed',linewidth=2)   # axis vertical line function 
plt.axvline(mean + stddev,color='r',linestyle='dashed',linewidth=2)   # color w = white 
plt.axvline(mean - stddev,color='r',linestyle='dashed',linewidth=2)   # color r = red 
plt.show() 
 

#  plot multiple stocks on the same histogram 

dr = compute_daily_returns(df) 
dr['SPY'].hist(bins=20, label='SPY') 
dr['IBM'].hist(bins=20, label='IBM') 
plt.legend(loc='upper right') 
plt.show() 
 
 
### 
###  scatter plots / Beta & Alpha 
### 
 
scatter plot daily returns of two stocks(x-axis:SPY and y-axis:a stock of your choice), and fit a line for each stock, using linear regression ( 
a scatter plot is just a graph of two variables relationship. the way you do this is, 
e.g. 
daily returns 
          SPY  IBM 
20160829  0.02 0.009   # then we just plot x=0.02 y=0.009 
20160830  0.01 0.03    # do the same for the rest 
.. 
.. 
 
(for the sake of simplification, assume SPY represents the US stock market.) 
just a least square fit of the data points, i.e. the sum of the distances from the data points to the line are at a minimum). 
the slope of the line, in financial terminology, is called "Beta" i.e. polynomial coefficient. 
if slope = N, on average when the market goes up M%, then the stock goes up N*M%. 
 
length of the intersection of the y-axis vertical line from x,y = 0,0 to the line is called "Alpha" i.e. the intercept 
positive alpha means, on average the stock performes betters than market, vice versa 
 
(good visual video) https://www.youtube.com/watch?v=DHcgfSgjFwU 
 
NOTE: slope != correlation 
 
slope is just the steepness 
correlation is the tightness of the fit of the line   (normally expressed 0 to 1) 
 
(good quiz video) https://www.youtube.com/watch?v=29PTCuEZvh4 
 
## 
##  plot a scatter plot 
## 
 
# assume we have the previously defined functions df = get_data(symbols,dates) , dr = compute_daily_returns(df) 
----------------- 
dr.plot(kind='scatter', x='SPY', y='XOM')               # learn the paramter "kind" 
beta_XOM,alpha_XOM = np.polyfit(dr['SPY'],dr['XOM'],1)  # learn numpy.polyfit() here we just want degree 1 polynomial y=ax+b 
                                                        # a = coefficient(=beta),  b = intercept(=alpha) 
plt.plot(dr['SPY'], beta_XOM * dr['SPY'] + alpha_XOM)   # basically you specify x and y 
plt.show() 
 
# calculate correlation coefficient 
print dr.corr(method='pearson')          #  pearson is the most common method 
---------------- 
 
 
 
############################################################ 
###  (1.7)  Sharpe ratio and other portfolio statistics  ### 
############################################################ 
 
a portfolio is allocation of funds to a set of stocks. 
 

#  daily portfolio value 

 
suppose you have the usual dataframe df (where x = date, y = ticker, [x][y] = close price) 
then normalize it.  norm = df/df[0]   so the first day price is 1 for each ticker 
then multiply by allocation (percentage that adds up to 1.0 in total) alloc = normed * alloc_factor 
then multiply by real allocated funds (you get position value for each stock) alloc * start_val 
then sum up all columns in each row, to get the daily portfolio value : port_val = pos_val.sum(axis=1) 
then do the usual daily_ret 
 
df -> norm -> alloc -> pos_val -> port_val -> daily_ret 
 

#  portfolio statistics 

 
(0) daily_rets = daily_rets[1:]    # so you filter out day 1 
 
(2) cum_ret = port_val[-1] / port_val[0] - 1 
 
(3) avg_daily_ret = daily_rets.mean() 
 
(4) std_daily_ret = daily_rets.std()    # standard deviation == risk, volatility 
 
(5) sharpe_ratio =  [see below] 
 

#  sharpe ratio 

- consider the return(reward) in the context of risk 
in general 
- higher return is better (obviously) 
- lower volatility(i.e. lower std dev, lower risk) is better 
 
(good quiz) https://www.youtube.com/watch?v=1ykUfwPnzL8 
 
- sharpe ratio is a metric that adjusts return for risk 
-- considers "risk free rate of return" i.e. the interest rate you get from a bank savings accnt or LIBOR or 3-month short term treasury bill (which is 0% as of 2015) 
 
Rp : portfolio return 
Rf : risk free rate of return 
Sp : stddev of portfolio return 
 
sharpe ratio = E[Rp-Rf] / std[Rp-Rf]      # E = expected 
             = mean[daily_rets - daily_rf] / std[daily_rets - daily_rf] 
             = mean[daily_rets - daily_rf] / std[daily_rets]              # if daily_rf is a constant, then we can remove it from std dev calc 
 
 
#  arithmetic mean VS geometric mean 
arithmetic is given N elements, divide the sum by N. (sum == add them all up) 
geometric is given N elements, take Nth root of the product (product == multiply them all) 
(ref) https://en.wikipedia.org/wiki/Geometric_mean 
 
===> obviously both are different ways of mean(== average) 
 
 
#  How do we compute Rf ? 
- suppose your bank savings accnt interest rate is 1% annually. 
-- then (1 + daily_Rf)^252 = 1.01       # 252 is just assuming 252 business days annually 
- or use 0% as it's been that way for so long. 
 
 
NOTE:  sharpe ratio can vary depending on the frequency of data sampling 
     - SR is originally meant as an annual measure, so when we say SR, we assume SRannualized. if SR you get is of diff frequency such as daily|weekly|monthly, we annualize it as below 
     - SRannualized = K * SR_freq_X 
     - K = sqrt(# of samples per year in freq_X) 
 
if X = daily,   then K = sqrt(252) 
if X = weekly,  then K = sqrt(52) 
if X = monthly, then K = sqrt(12) 
 
(good quiz)  https://www.youtube.com/watch?v=d8HN4WND07w 
 
given 60 days of data, 
- avg daily ret = 10bps 
- daily risk free rate = 2bps 
- stddev of daily ret = 10 bps 
 
sqrt(252) * mean(10-2) / 10    =  12.7       # note how we still use "252" in sqrt(252) not 60 
 
 
NOTE: sharpe ratio treats both "upward" and "downward" deviations. other measures focus on the downside (because that's really the risk) such as Sortino ratio. 
 
 
########################################################## 
###  (1.8) optimizers : building a prameterized model  ### 
########################################################## 
 
an optimizer is an algorithm to 
(1) find minimum values of objective functions (e.g. f(x) = x^3 + 2x +5 ) 
(2) build parameterized models based data 
(3) refine allocations to stocks in portfolios 
 
key words: objective function, decision variables(like weights for each stock), constraints(like sum of all weights should be 1.0) 
 
 
## 
##  minimizer 
## 
- you just define your function, then the scipy minimizer (you can specify various methods) does it for you. 
- given f(x), it finds x such that f(x) is minimized 
- what kind of method? 
-- gradient descent 
-- newton's method 
-- etc 
NOTE: we give the minimizer an initial guess but if you have no clue, you can give it a random value or some standard value. 
NOTE: it's gonna find a minima, but not guaranteed to be the minima 
 
(good quiz) : https://www.youtube.com/watch?v=pY8s-GY_mEY 
 
-----------------------------------// sample scipy code 
# (code example) https://www.youtube.com/watch?v=1S0JxAAqLCs 
""" Minimize an objective function, using SciPy """ 
""" i.e. given f(x), it finds x such that f(x) is minimized """ 
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import scipy.optimize as spo 
 
def f(X): 
   # given a scalar X, return some value (a real number) 
   Y = (X - 1.5)**2 + 0.5 
   print "X = {}, Y = {}".format(X,Y)  # for tracing 
   return Y 
 
def test_run(): 
   Xguess = 2.0  # just a guess 
   min_result = spo.minimize(f, Xguess, method='SLSQP', options={'disp':True}) 
   print "minima found at: " 
   print "X = {}, Y = {}".format(min_result.x, min_result.fun) 
 
   # plotting 
   Xplot = np.linspace(0.5, 2.5, 21) 
   Yplot = f(Xplot) 
   plt.plot(Xplot, Yplot) 
   plt.plot(min_result.x, min_result.fun, 'ro') 
   plt.title("Minima of an objective function") 
   plt.show() 
 
if __name__ == "__main__": 
   test_run() 
------------------------------------ 
 
===> output as below 
 
X = [ 2.], Y = [ 0.75] 
X = [ 2.], Y = [ 0.75] 
X = [ 2.00000001], Y = [ 0.75000001] 
X = [ 0.99999999], Y = [ 0.75000001] 
X = [ 1.5], Y = [ 0.5] 
X = [ 1.5], Y = [ 0.5] 
X = [ 1.50000001], Y = [ 0.5] 
Optimization terminated successfully.    (Exit mode 0) 
            Current function value: [ 0.5] 
            Iterations: 2 
            Function evaluations: 7 
            Gradient evaluations: 2 
minima found at: 
X = [ 1.5], Y = [ 0.5] 
 
===> works great !!! 
 
 
 
 
## 
##  convex problems 
## 
- a class of problems easy for optimizers to solve. 
- "a real-valued function defined on an interval is called convex (or convex downward or concave upward) if the line segment between any two points on the graph of the function lies above or on the graph" i.e. one local minima that is the global minima 
(see the video for visual examples) https://www.youtube.com/watch?time_continue=69&v=7QmGj1_i3MU 
 
(ref) https://en.wikipedia.org/wiki/Convex_function 
 
NOTE: doesn't have to be 2-dimentional, it can be any degree polynomial convex 
 
 
## 
##  building a parameterized model 
## 
 
e.g.   f(x) = ax + b   # a = slope, b = y_intercept, we call them 'coefficients' or 'parameters' in this context 
 
you want to best fit the line to given data points (like a x,y scatter plot), using the least square method. 
 
----------------------------// code example (complete) 
# (code example) https://www.youtube.com/watch?v=dCBX0nAvsJg    # 1d polynomial 
""" fit a line to data points that minimize the error """ 
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import scipy.optimize as spo 
 
def error(line, data): 
   """ compute error between given line model and observed data 
   parameters 
   ---------- 
   line: tuple/list/array (c0,c1) where c0 = slope, c1 = y_intercept 
   data: 2D array where each row is a point (x,y) 
   returns error as a single real value. 
   """ 
   # Metric: sum of squared y-axis differences 
   err = np.sum((data[:,1] - (line[0] * data[:,0] + line[1]))**2) 
   return err 
 
def fit_line(data, error_func): 
   """fit a line to given data, using a supplied error function. 
   parameters 
   --------- 
   data: 2D array where each row is a point (x0, y) 
   error_func: function that computes the error_between a line and observed data points 
   returns line that minimizes the error_func 
   """ 
   # generate initial guess for line model 
   line_guess = np.float32([0, np.mean(data[:,1])]) # slope = 0, intercept = mean(y value) 
 
   # plot initial guess (optional) 
   x_ends = np.float32([-5,5]) 
   plt.plot(x_ends, line_guess[0] * x_ends + line_guess[1], 'm--', linewidth=2.0, label="initial guess") 
 
   # call optimizer to minimize error function 
   result = spo.minimize(error_func, line_guess, args=(data,), method='SLSQP', options={'disp':True})   # see the syntax for "args=(data,)" 
   return result.x 
 
def test_run(): 
   # define original line 
   line_orig = np.float32([4,2]) 
   print "Original Line: c0 = {}, c1 = {}".format(line_orig[0],line_orig[1]) 
   X_orig = np.linspace(0,10,21) 
   Y_orig = line_orig[0] * X_orig + line_orig[1] 
   plt.plot(X_orig, Y_orig, 'b--', linewidth=2.0, label="original line") 
 
   # generate noisy data points 
   noise_sigma = 3.0 
   noise = np.random.normal(0, noise_sigma, Y_orig.shape) 
   data = np.asarray([X_orig, Y_orig + noise]).T 
   plt.plot(data[:,0], data[:, 1], 'go', label="Data points") 
 
   # try to fit a line to this data 
   line_fit = fit_line(data,error) 
   print "fitted line: c0 = {}, c1 = {}".format(line_fit[0], line_fit[1]) 
   plt.plot(data[:,0], line_fit[0] * data[:,0] + line_fit[1], 'r--', linewidth = 2.0, label='fitted line') 
 
   plt.legend(loc='upper left') # (ref) http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend 
   plt.title("fitted line") 
   plt.show() 
 
if __name__ == "__main__": 
   test_run() 
---------------------------------------- 
=====> output as below 
 
Original Line: c0 = 4.0, c1 = 2.0 
Optimization terminated successfully.    (Exit mode 0) 
            Current function value: 160.511686532 
            Iterations: 5 
            Function evaluations: 24 
            Gradient evaluations: 5 
fitted line: c0 = 4.32064306468, c1 = -0.863187705267 
 
 
--------------------------------------- // more code example (partial) 
# (code example) https://www.youtube.com/watch?v=pmoEqFfKTSE    # N degree polynomial 
 
def error_poly(C, data): 
   """ compute error between given polynomial model and observed data 
   parameters 
   ---------- 
   C: numpy.poly1d object or equivalent array representing polynomial coefficients 
   data: 2D array where each row is a point (x,y) 
   returns error as a single real value. 
   """ 
   # Metric: sum of squared y-axis differences 
   err = np.sum((data[:,1] - np.poly1d(C,data[:,0]))**2) 
   return err 
 
def fit_line(data, error_func): 
   """fit a polynomial to given data, using a supplied error function. 
   parameters 
   --------- 
   data: 2D array where each row is a point (x0, y) 
   error_func: function that computes the error_between a plynomial and observed data points 
   returns a polynomial that minimizes the error_func 
   """ 
   # generate initial guess for poly model 
   Cguess = np.float32([0, np.mean(data[:,1])]) # slope = 0, intercept = mean(y value) 
 
   # plot initial guess (optional) 
   x_ends = np.float32([-5,5,21]) 
   plt.plot(x_ends, np.polyval(Cguess,x_ends), 'm--', linewidth=2.0, label="initial guess") 
 
   # call optimizer to minimize error function 
   result = spo.minimize(error_func, Cguess, args=(data,), method='SLSQP', options={'disp':True}) 
   return np.poly1d(result.x) # convert optimal result into a pol1d object 
 
--------------------------------------- 
 
 
(quiz) https://www.youtube.com/watch?v=uIrwpQG-KvI 
 
NOTE: why 'SLSQP' method?  - it's just convenient as we can specify bounds and constraints. but it only finds a local minima, to find the global minima, you must use other minimization function from scipy. 
(ref) http://docs.scipy.org/doc/scipy/reference/optimize.html#global-optimization 
 
 
 
########################################################## 
####  (1.9) optimizers: how to optimize a portfolio   #### 
########################################################## 
 
given a set of assets and a time period, find an allocation of funds to assets that maximizes performance. 
 
what is performance ? 
- we can choose from a number of metrics: 
-- cumulative return 
-- volatility or risk 
-- risk adjusted return (i.e. Sharpe Ratio) 
 
(quiz) https://www.youtube.com/watch?v=ZGS9xsTxW-w 
 
## 
##  framing the problem 
## 
- provide a function to minimize f(x) = sharpe ratio * -1  # minimize nagative sharpe ratio = maximize sharpe ratio 
  where x is an allocation(like a list of percentage) for each stock in the portfolio 
- provide an initial guess for x 
- call the optimizer 
 
## 
##  ranges and constraints 
## 
- ranges(aka bounds): you can specify the ranges then the optimizer can run quicker 
e.g. allocation can only be between 0 to 1 
 
- constraints: properties of x that must be true. 
e.g. the sum of the allocation must add up to 1.0 
 
 
 
############################################ 
####  Part 2: computational investing   #### 
############################################ 
 
 
########################################################### 
####  (2.1)  so you want to be a hedge fund manager ?  #### 
########################################################### 
 
## 
##  types of funds 
## 
 
          ETF        | Mutual Fund          | Hedge Fund 
------------------------------------------------------------- 
 buy/sell like stock | buy/sell at EOD      | buy/sell by agreement 
 basckets of stocks  | quarterly disclosure | no disclosure 
 transparent         | less transparent     | not transparent 
 3,4 letter ticker   | 5 letter ticker      | no ticker, just the fund name 
 
a hedge fund usually has no more than 100 investors. 
 
 
## 
##  incentives: how are the fund managers compensated? 
## 
 
AUM: asset under mgmt. it means how much money is managed in the fund. 
 
             | expense ratio 
------------------------------------ 
ETFs         | 0.01 ~ 1.00 % of AUM 
mutual funds | 0.5 ~ 3.00 % of AUM 
hedge funds  | 2% of AUM, plus 20% of profit (aka "two and twenty" but recently lowered like one and ten) 
 
(quiz) https://www.youtube.com/watch?v=EQGcoxbfUsI 
 
 
## 
##  investors of hedge funds 
## 
who ? 
- individuals 
- institutions: retirement/pension funds like CalPERS, university foundations 
- funds of funds: like a hedge fund that invests in other several funds 
 
why ? 
- track record 
- simulation/back_test + story 
- good portfolio fit (how fit is your strategy to their objectives) 
 
 
## 
##  hedge fund goals and metrics 
## 
goals 
- beat a benchmark (e.g. how much outperform SP500) 
- absolute return (long/short) 
metrics 
- cumulative return  # val[-1] / val[0] -1 
- volatility         # daily_rets.std() 
- risk/reward        # Sharpe ratio =  sprt(252) * mean(daily_rets - RF) / daily_rets.std() 
                     # RF = risk free ratio 
                     # 252 = how many trading days in a year. as we in this case use daily_rets 
 
## 
##  computational infra requirement of a hedge fund 
## 
- huge DBs 
- robust/high-bandwidth NW connectivity 
- low latency 
- real time processing 
 
 
/--/--[historycal data]--[trading algorithm]--[orders]---------[market] 
|  |  [target portfolio]_/                  \_[live portfolio]_/ 
|  |       |                                     | 
|   \_[portfolio optimizer]----------------------/ 
|          | 
|     [N-day forecast] 
|          | 
 \____[forecast (ML) algo] 
           | 
      [info feed] 
 
 
 
################################### 
####  (2.2) Market Mechanics   #### 
################################### 
 
- an order 
-- buy or sell 
-- ticker 
-- # of shares 
-- limit or market 
-- price if limit 
 
- the order book 
-- depth 
 
(quiz) https://www.youtube.com/watch?v=47ebviu1bbA 
 
note: execution price for a limit order can be better than the specified price. 
 
- NBBO 
- primary, secondary mkts 
- if a broker internally match buy and sell orders from its clients, then it still must follow NBBO price, and report the trade price & size to the primary mkt 
- dark pool (inter-broker liquidity pool) 
note: these days 80~90% of retail investor orders get executed at broker/dark pool level, never going to the exchanges for execution. 
 
- colo 
- arbitrage (exploiting and profiting from temporary price deviation from intrinsic value) 
- stop loss/gain   # if a stock price drops/rises to a certain threshold, then sell/buy 
- trailing stop # dynamic stop loss 
- short sell # naked VS located 
 
 
############################################# 
####  (2.3)  what is a company worth ?   #### 
############################################# 
 
(quiz) https://www.youtube.com/watch?v=DynqV_ELgsM 
 
- intrinsic value   # based on future dividend     # future 
- book value        # balance sheet                # current 
- market cap(value) # price * outstanding shares   # supposedly reflects both future and current 
 
## 
##  intrinsic value   (future cash flow based valuation) 
## 
- interest rate 
- discounted cash flow   # $1 today is more worth than $1 in the future (well we have negative interst rate so that is even debatable) 
 
present_value = future_value / (1 + interest_rate)^i   # i is the time from present, usually how many years 
 
this interset_rate is called "discount rate" 
this present_value is essentially the intrinsic value of the future value 
 
===> all this is normally discussed in the context of fixed income maths 
===> but suppose you consider dividend as future fixed income (future_value) 
you get a fixed amount dividend every year. 
 
then you can sum up the intrinsic value of the future dividend, till infinitely to the future. 
 
inf 
 Σ FV/(1 + DR)^i  =  FV/(n-1)       #   n = 1 + DR 
i=0               =  FV/DR 
 
so the dividend divided by the discount rate is the intrinsic value of the company ! 
(it's just one way to look at the company's value) 
 
e.g. 
if dividend is $1 every year, and discount rate is 5%, then the intrinsic value of the company is $20 
 
(quiz)  https://www.youtube.com/watch?v=1GmhJHs2yIY 
 
NOTE: ones of the rules of thumb: the 72 rule 
      how many periods does it take to double your money? 
      periods = 72 / rate    # where rate is % gain per period 
      e.g. 3% annual gain. then it takes 24yrs to double your money. 
 
 
## 
##  book value   (current asset based valuation) 
## 
 
def: total assets minus intangible assets and liabilities 
 
  tangible assets: e.g. factories, private jets 
intangible assets: e.g. patents 
liabilities : e.g. a loan from a bank 
 
 
(quiz)  https://www.youtube.com/watch?v=4TESS_ZfqVQ 
 
 
[ALMOST ALWAYS]  book_value < market_cap   because you can otherwise buy the company at market cap, and sell at book value in theory. 
 
NOTE: Book value (current asset based valuation) is usually the lower bound, and further discount by 30-50% for the worst case. 
      In contrast, future cash flow based valuation tends to overestimate, as there can be surprise scandalous events for the company. 
      Combination works good. While maths is the easy part, judging the assumptions (conservative/aggressive scenario range) is the hard part. 
      Benjamin Graham: margin of safety. If your investment strategy looks good even for the conservative range, then go for it. 
 
 
####################################################### 
####   (2.4) CAPM : capital asset pricing model    #### 
####################################################### 
 
a portfolio: a weighted set of assets 
 
Wi : portion of funds in asset i 
 
Σ[abs(Wi)]= 1.0       # note: a short position is also considered a weight just like a long position 
i                     # you can have 75% of funds allocated to IBM, -25% to GOOG 
                      # note: a 'leveraged' portfolio may sum to over 1.0 
 
# return of the portfolio at time t 
Rp(t) = Σ[Wi * Ri(t)] 
        i 
 
 
(quiz) https://www.youtube.com/watch?v=Wlc1OpIpeDE 
 
 
## 
##  the market potfolio 
## 
 
major indices 
- US: sp500 
- UK: ftse 
- JP: topix, n225 
 

# "Cap weighted" (aka mkt value weighted) index 

 simply the weight is based on its mkt cap 
Wi = mktcap_i / Σ[mktcap_j] 
                j 
 
most major indices are 'cap weighted' 
(ref) https://en.wikipedia.org/wiki/Capitalization-weighted_index#Some_capitalization-weighted_indices 
 

# other weight methods 

- "fundamentals weighted": https://en.wikipedia.org/wiki/Fundamentally_based_indexes 
-- based on company fundamentals e.g.  sales, earnings, book value, cash flow and dividends, number of employees, etc 
 
- "price weighted" : https://en.wikipedia.org/wiki/Price-weighted_index 
-- e.g. n225, Dow Jones Industrial Average 
 
- some ETF is 'equal weighted' 
 
## 
##  the CAPM equation 
## 
 
Rm(t) = return of the market in time t 
 
return of a stock name "i" in time t 
 
Ri(t) = Bi*Rm(t) + Ai(t)      # review "daily return" scatter plot section on Beta and Alpha definitions 
                              # CAPM says the exeptation E for Ai(treated as a random variable) is zero 
                              # in reality, Ai is not always zero 
 
# as you can see, CAPM separates mkt component(beta coefficient) and individual stock performance component(alpha intercept) 
 
(quiz) https://www.youtube.com/watch?v=KMvckeTu_Iw 
 
 
## 
##  CAPM(passive) vs Active Mgmt 
## 
- passive : buy index and hold 
- active  : pick stocks (diff weight from index) 
 
recall CAPM equation:  Ri(t) = Bi*Rm(t) + Ai(t) 
 
===> CAPM says Ai(t) is purely "random" and therefore E(A)=0 
===> Active managers believe they can predict Alpha 
 
 
## 
##  CAPM for potfolios 
## 
 
Rp(t) = Σ[Wi * (Bi*Rm(t) + Ai(t))] 
        i 
 
Bp = Σ[Wi * Bi] 
     i 
 
Rp(t) = Bp*Rm(t) + Ap(t)    # CAPM asserts  Ap(t) = 0 
 
Rp(t) = Bp*Rm(t) + Σ[Wi*Ai(t)]    # Active mgrs believe they can predict Ai individually 
                   i 
 
 
(good quiz) https://www.youtube.com/watch?v=YhQDFRDQUfI 
 
 
## 
##  Implications of CAPM 
## 
 
Rp = Bp * Rm + Ap 
 
- E(A) = 0 
- only way to beat market is to choose B 
-- choose high B in upward mkts 
-- choose low B in downward mkts 
- EMH (efficient mkt hypothesis), CAPM assumes, says you can't predict the mkt 
 
 
## 
##  APT: arbitrage pricing theory 
## 
 
- 1976, Stephen Ross 
- assets that you can break Beta into smaller components by sectors, like f:finance, t:tech, m:manufacturing  because a particular stock may have diff degrees of exposure to certain industry sectors 
 
Ri = Bi*Rm + Ai 
   = Bif * Rf  +  Bit * Rt  +  Bim * Rm  + .... + Ai 
 
 
############################################## 
##  (2.5)  how hedge funds use the CAPM    ### 
############################################## 
 
recall:  for each stock 
- Beta : observed from historical data (not guaranteed to be the same for the future) 
- Alpha: preditcted based on your whatever research, e.g. ML, horoscope 
 
for positive Alpha stocks: long 
for negative Alpha stocks: short 
 
recall: 
Rp(t) = Σ[Wi * (Bi*Rm(t) + Ai(t))] 
        i 
      = Bp * Rm + Ap 
 
 
(quiz) https://www.youtube.com/watch?v=iOZPvdGdd98 
 
NOTE: even with the perfect prediction of Beta & Alpha, if you mess up the funds allocation(i.e. weight), you can still lose money 
 
e.g. 
stock X: predict Ax = +1%,  Bx = 1.0 , alloc long 0.5 of funds 
stock Y: predict Ay = -1%,  By = 2.0 , alloc short -0.5 of funds 
 
if sp500 goes up +10%, i.e. Rm = +10%,  Rp = (0.5 * (Bx * Rm + Ax)) + (-0.5 * (By * Rm + Ay)) 
                                           = 0.055 - 0.095 
                                           = -0.04 
==> -4% return 
 
if sp500 goes down -10%, i.e. Rm = -10%,  Rp = 0.06 
==> +6% return 
 
==> this means, you wanna reduce mkt risk by choosing weights such that Bp = 0 
then, Rp = Bp * Rm + Ap 
         = Ap 
 
(quiz) https://www.youtube.com/watch?v=AtzQx3vfsWQ 
 
 
 
###################################### 
####  (2.6)  Technical Analysis   #### 
###################################### 
 
fundamentals : earnings, book value, dividends, economy, etc (more decision complexity) 
technical : price and volume history only. -> compute heuristics statistics called "indicators" 
 
(quiz) https://www.youtube.com/watch?v=9deLoyqwVGU 
 

# why technical analysis might work 

- there is information in price 
- heiristics work 
 
 

# when is technical analysis effective ? 

- individual indicators, by themselves, are weakly predictive 
-- combinations of them are stronger 
- look for contrasts (stock VS mkt) 
- work on shorter time periods better (whereas fundamentals work on longer periods) 
-- faster decision speed 
 
## 
##  indicators 
## 
 
there are many. here we look at a few popular ones. 
 
(1) momentum (over a period of time) 
 
  momentum[t] = price[t] / price[t-n] - 1 
 
==> normally between -0.5 ~ 0.5 
 
 
(2) simple N-day moving average (SMA) 
-- smoothed & lagging indicator 
-- when the current price "crosses" the moving average, it may indicate a significant shift in the price movement 
-- big excursion from the moving average may be considered arbitrage oppotunity. how do you define "a big excursion"? -> bollinger band 
-- normally between -0.5 ~ 0.5 
 
 
(3) bollinger band 
- BB is the band defined as SMA +- 2*std_dev 
- if the price goes beyond (either above or below) the band, then at this point it is considered "deviated" 
- ONLY "after" it comes back toward the band, then we consider it as a signal 
 
 BB[t] = (price[t] - SMA[t]) / 2*std[t]    #  BB[t] = 1.0 ~ -1.0 means within the band 
                                           #  BB[t] > 1.0 means ABOVE the BB 
                                           #  BB[t] < -1.0 means BELOW the BB 
 
(a must see quiz) https://www.youtube.com/watch?v=z3OpSx4q0zM 
 
(4) PE ratio 
- fundamentals indicator 
- normally between 1 to 300 
 
 
====> different indicators have diff 'normal' range. 
====> must normalize them when combining in ML analysis 
 
normed = (values - mean) / values.std() 
 
 
 
 
##################################### 
####  (2.7) Dealing with Data    #### 
##################################### 
 
we analyze open, high, low, close for an interval of choice 
 
## 
##  interval 
## 
 
tick-by-tick 
minute 
hourly 
daily     # we only deal with daily data in this course 
weekly 
monthly 
quarterly 
 
 
## 
##  adjustment 
## 
- stock split : price too high (to increase liquidity in options mkt cos one option is in the unit of 100 shares) 
- dividend 
 
==> know the corp actions adjusted close price, so you can treat the whole historical data as continuous data 
==> this means, if you look at adjusted close price of a stock at 20101023, current_day=20120511, and look at the same 20101023 adj close again but current_day=20150213, then your adj close price for 20101023 will be a diff value because it got further adjusted for any corp action that happened after 20120511 and by 20150213. 
 
 
(quiz) https://www.youtube.com/watch?v=bwfQ7roGcew 
(quiz) https://www.youtube.com/watch?v=1NSIHspARlw 
(quiz) https://www.youtube.com/watch?v=fRNBuERr7nI 
 
 
## 
##  survivor bias 
## 
- 68 names of sp500 as of 2007 underperformed and dropped by 2009 
- if you do test based of sp500 universe as of 2012, over 2007-2012 data, then you only deal with survivors who obviously performed better than those stocks that didnt survive. if you include the non-survivors, your test won't perform as much the survivor only univ version, that diff is called "suvivor bias" 
- you must use suvivor bias free data 
 
 
## 
##  latency 
## 
 
## 
##  format 
## 
- ascii vs binary 
- fix protocol 
- compressed 
- proprietary protocol 
- api, csv availability, etc 
 
 
 
#################################################### 
####  (2.8) EMH : efficient market hypothesis   #### 
#################################################### 
 
## 
##  EMH assumptions 
## 
- large number of investors in for profits 
- new info arrives randomly 
- prices adjust quickly 
- prices reflect all available info 
 
## 
##  where does info come from ? 
## 
- price/volumes 
- fundamental 
- exogenous   # info about the world that affect the stock/company, e.g. oil price rise -> airline stock drops 
- company insiders 
 
## 
##  3 forms of EMH 
## 
- weak: future prices cannot be predicted by analyzing histoical prices (fundamentals are still effective) 
- semi-strong: prices adjust rapidly to new public information (even fundamentals are not useful, but insider info still is) 
- strong: prices reflect all info public and private (even insider info is useless, already reflected) 
 
(quiz) https://www.youtube.com/watch?v=5XNiMlYnB2k 
 
## 
##  is EMH correct? 
## 
- some hedge funds prformance seem to refute stronger forms EMH 
- we have seen insider traders make profit (and go to jail incidentally) 
- history shows lower P/E ratio correlate to higher return 
- if EMH is true, no point in active investment, BUT active investor/mgr may be precisely the agents who make a market efficient. 
 
(PE ratio and annual return) https://www.youtube.com/watch?v=_MnGekooUQU 
 
 
################################################################ 
####  (2.9)  The Fundamental Law of Active Portfolio Mgmt   #### 
################################################################ 
 

# Grinold's fundamental law (of active portfolio mgmt) 

 
performance: aka IR(information ratio)   # how much you can out perform the mkt, i.e. SP500 
skill      : aka IC(information coefficient) 
breadth    : opportunities, applications/methods to apply your skill to pick stocks 
 
performance = skill * sqrt(breadth) 
         IR = IC * sqrt(BR) 
 
recall  Rp(t) = Bp * Rm(t) + Ap(t)    #  Bp * Rm(t) : mkt component 
                                      #       Ap(t) : skill component 
 
IR:information ratio = Sharpe Ratio of excess return (ret due to the skill component) 
 
IR = mean(Ap(t)) / stdev(Ap(t))      #  mean(Ap(t)) : reward component 
                                     #  stdev(Ap(t)): risk component 
 
IC:information coefficient = correlation of forecasts to returns (val btwn 0 to 1)  # 0 means no correlation 
BR:breadth = number of trading opportunities per year 
 
 
 
## 
## the coin flipping casino 
## 
- flip coins. alpha 0.51 heads 
- uncertainty is like beta. 
 
- betting N tokens 
-- win: 2*N 
-- lose: 0 
 
- casino 
-- 1000 tables (each table flips a coin) 
-- 1000 tokens 
-- game runs in parallel 
 
(quiz) https://www.youtube.com/watch?v=0pI3vLnykyI 
 
## 
##  betting 1000 tokens on a table(= a single coin) VS betting 1 token per table times 1000 
## 
- expected return 
-- single bet : 0.51 * 1000 + 0.49 * -1000 = 20 tokens 
-- multi bet  : 1000 * (0.51 * 1 + 0.49 * -1) = 20 tokens 
 
- risk 1: probability of losing it all 
-- single bet: 49% 
-- multi bet : 0.49^1000 = almost zero 
 
- risk 2: stddev of individual bets 
-- single bet: 31.62     # first bet result is 1000 or -1000, and the rest is all 0 
-- multi bet : 1.0       # every bet result is 1 or -1 
 
===> expected return is the same, but risk is reduced in multi bet 
 
(quiz) https://www.youtube.com/watch?v=DNUe2fQ7WmI 
 
lets define "sharpe ratio" in this context. 
SR_single_bet : 20/31.62 = 0.63 
SR_multi_bet  : 20/1     = 20 
 
SR_multi_bet = SR_single_bet * sqrt(num_of_bets) 
          20 = 0.63 * sqrt(1000) 
 
===> can be interpreted as 
 
 performance = skill * sqrt(breadth) 
 
===> fundamental law of active portfolio mgmt observed! 
 
 1. higher alpha generates higher sharpe ratio 
 2. more execution opportunities provide higher sharpe ratio 
 3. sharpe ratio grows as sqrt(breadth) 
 
 
## 
##  applied to real world 
## 
- RenTec hedge fund trades 100k times /day 
- Warren Buffet holds 120 stocks for a looong period 
==> both similar performance 
==> can a single theory relate these two ? 
 
(quiz) https://www.youtube.com/watch?v=A7rNXQHpqqs 
- both RenTec and Buffet have same IR 
- RenTec's algo is 1/1000 as smart as Buffet's 
- Buffet trades 120 times /year 
- how many trades must RenTec execute? 
 
IC_b * sqrt(120) = IC_r * sqrt(x) 
                 = IC_b/1000 * sqrt(x) 
1000 * sqrt(120) = sqrt(x) 
     120,000,000 = x 
 
 
#################################################################### 
####  (2.10) portfolio optimization and the efficient frontier  #### 
#################################################################### 
 
given: 
- set of equities 
- target return 
 
find: 
- allocation to each equity that minimizes risk 
 
## 
##  what is risk?   (revisited) 
## 
for our purpose: risk == volatility == stdev of historical daily returns 
(it's just one way.) 
 
you can plot for each stock, for its (risk,ret) coordinate. 
 
  | 
r |       . 
e |   .. 
t |   .  . 
  |  . 
  |------------- 
      risk 
 
===> suppose you create a portfolio out of it, then you can plot your portfolio's (risk,ret) in the graph as well 
 
(quiz) https://www.youtube.com/watch?v=t_dHxckcMW4 
 
NOTE: stddev is the square root of the sum of the squared deviations in per period prices versus the mean for all periods in the sample. 
 
## 
##  the importance of "covariance" 
## 
- Harry Markowitz's discovery (nobel price winner) 
- by carefully choosing allocations, your portfolio's risk can be smaller than any of its constituent stock 
- stock + bond blend can have lower risk than only stock or only bond portfolio 
 
covariance of one stock against another == correlation coefficient 
 
assuming you pick stocks that have similar returns, and then if you can combine anti-correlated stocks in overall equal weights, then your portfolio still keeps the same return with significantly reduced volatility 
 
(visual) https://www.youtube.com/watch?v=qOl04hw7f9g 
 
 
## 
##  MVO: mean variance optimization 
## 
 
the idea is to pick stocks that have anti-correlation (i.e. anti-varianced, meaning when one goes down, the other goes up) in short term, and positive-correlation in long term. 
 
[input] 
- expected return (of each stock) 
- target return (of portfolio. obviously min & max bounds are those of min & max ret individual stocks) 
- volatility (of each stock) 
- covariance (matrix of correlation between each asset and every other asset) 
 
[output] 
- asset weights for portfolio that minimizes risk (and achieves the target return) 
 
 
## 
##  the efficient frontier 
## 
 
  |           ... 
r |       ... 
e |    ... 
t |   . 
  |    . 
  |----------------- 
        risk 
 
===> for any target return, there is an optimal portfolio that gives a particular risk value 
===> you can plot a curve of those optimal (ret,risk) points, called "efficient frontier"  (see the video) 
     (incidentally, when you really go lower on target ret, then your optimal risk goes up as you cannot really diversify) 
===> intersection of the tangent line from origin to the frontier curve: (theoretical) max sharpe ratio of the portfolio 
 
(must must see visual) https://www.youtube.com/watch?v=vnAbsNN3SbA 
 
 
 
################################################# 
####   (2.11)  building a market simulator   #### 
################################################# 
 
assume MOC (mkt on close) orders in this simulator. # we will use adjusted close. 
 
orders.txt : each row contains  date(=yyyy-mm-dd),symbol(e.g. AAPL),action(=buy or sell),shares 
 
take care of the accounting of the portfolio, i.e. keep track of its value. 
 
## 
## steps 
## 
 
(1) read in historical data. pandas dataframe, call it 'prices' 
e.g. 
 
date, symbol_0, symbol_1, symbol_2,,,  symbol_N, cash 
------------------------------------------------------ 
start_date, [ajudsted close price of each symbol], 1.0   # cash is always 1.0 
 .. 
 .. 
 .. 
end_date 
 
 
(2) build another DF 'trades' 
e.g. 
 
date, symbol_0, symbol_1,,, symbol_N, cash 
------------------------------------------                                     N 
sdate, [how many shares we buy/sell], [how much cash you pay on that date]  #  Σ (price_i * size_i) = cash 
 ..   e.g. +100, -400 for each symbol                                       # i=1 
 ..                                                                         # obviously, you get cash when you sell shares, vice versa 
 .. 
edate 
 
 
(3) build the 3rd DF 'holdings' 
- basically you can create this from 'prices' & 'trades' 
e.g. 
 
date, symbol_0, symbol_1,,, symbol_N, cash 
------------------------------------------ 
sdate, [how many shares you hold as of yyyymmdd], [how much cash you hold as of yyyymmdd] 
 .. 
 .. 
 .. 
edate 
 
 
(4) build the 4th DF 'values'    #  values = prices * holdings 
- we want to calc the value of each asset for each date. 
e.g. 
 
date, symbol_0, symbol_1,,, symbol_N, cash                                         port_val 
------------------------------------------                                         -------- 
sdate,                                     # sum all columne values in this row -->   xyz 
 ..                                        # sum(axis=1) does this column summing 
 ..                                        # sum(axis=0) does row summing 
 .. 
edate, 
 
NOTE: obviously, on start_date, total values sum up to your initial fund. 
NOTE: 20110615  # ignore orders (don't execute/trade any, though there may be holdings) on this date. 
 
 
## 
##  leverage 
## 
 
intuitive definition:  your investment in the mkt / liquidation value of your portfolio 
 
formal definition: Σ(|investment|) / Σ(investments + cash)     # normally this has to be less than 2.0 
 
NOTE: in theory, leverage can go negative, but to get there, it has to pass thru an infinitely big leverage, so in practice, your broker will liquidate your portfolio before leverage can get negative. 
 
 
 
######################################### 
####  Part 3:  ML algo for trading   #### 
######################################### 
 
################################################################### 
####  (3.1) intro: how ML is used in a financial institution   #### 
################################################################### 
 
data ->[ML] 
        | 
X -> [Model] -> Y 
 
X : observation of the world. can be multi-dimensional (e.g. bollinger band, PE ratio) 
Y : prediction. normally a single dimension. like a future stock price, a future portfolio val. 
 
(quiz) https://www.youtube.com/watch?v=N76YcGfn4HY 
 
 
## 
##  Supervised Regression Learning 
## 
- one flavour of ML techniques 
- "supervised" == provide example x,y (to train) 
- "regression" == numerical prediction|approximation (as opposed to classification) 
- "learning"   == training with data 
 
there are many algorithms that solve supervised regression learning problems 
e.g. 
- linear regression (parametric learning) : finds paramters for a model 
- KNN: k nearest neighbor (instance based): it keeps data it used (hence instance based) 
- decision trees  : tree-structure of node representing a question, leaf representing an answer value 
- decision forests: combinations of decision trees 
 
NOTE: often "regression" in our context i.e. linear regression becomes synonymous with "minimize the sum of squared error" but "regression" in its purest term is simply numerical approximation 
 
e.g.  regression 
a person is female, height X, wears a size Y t-shirt, size Z pants, length V hair, then what is her weight? 
 
e.g.  classification 
it is yellow, weighs X pounds, Y feet high, Z feet width, four legs, then what animal is this? 
 
 
## 
##  how it works with stock data 
## 
 
X = measurable quantities about the company that can be predictive of Y. 
  = the usual pandas dataframe where index_row=yyyymmdd, index_column=ticker, the val can be close price, PE ratio, price momentum, boolinger band, etc (obviously how we choose what we should use for x, is another important topic) 
 
Y = historical price 
 
so based on X, your model may predict 5-day-ahead price. so you can use historical data, to train your model. 
 
## 
##  backtest 
## 
 
you specify the training period, and measure performance (portfolio val/ret, etc) for a chosen symbol universe 
and roll fwd to the current day. 
e.g. 
you can train your model based on 2009 data, and test against 2011 data, and model generates orders and you can measure the performance. 
 
## 
##  problems with regression 
## 
- noisy and uncertain 
- challenging to estimate confidence 
- ideal holding period, allocation 
 
==> policy learning via reinforcement learning offers improvement 
 
 
############################## 
####  (3.2)  Regression   #### 
############################## 
 
## 
##  parametric regression 
## 
 
building a model represented with parameters. 
 
we already know this. recall x,y scatter plots 
e.g. 
x = change in atmosphere pressure 
y = amount of rain 
 
you plot data points, and fit an N-th degree polynomial. (if 1D, then it is a line fitting) 
once you get your cofficients, e.g. y = c0 * x^3 + c1 * x^2 + c2 * x + c3 
then you can throw away data points, only keep the coefficients(aka parameters) which you use next time you predict a Y from a given X 
 
(visual) https://www.youtube.com/watch?v=6EC1w_fs5u8 
 
 
## 
##  KNN : K nearest neighbor  (instance based regression) 
## 
 
very simple. 
 
we keep the data points, and use them when we predict. (aka data centric, instance based approach) 
given a new X, you find K nearest known Xs in your collection of data points, then take the mean of their Y values 
 
if we weigh each Y based on the distance of the Xs to your target X, then it is called "Kernel regression" 
 
(good visual) https://www.youtube.com/watch?v=CWCLQ6eu2Do 
(good visual) https://www.youtube.com/watch?v=ZhJTGBbR18o 
 
 
## 
##  parametric VS non-parametric learner 
## 
 
depends on the nature of the problem. 
does the underlying mechanism follow any mathematical model? 
yes -> "biased" 
no -> "unbiased" 
 
(good quiz) https://www.youtube.com/watch?v=PVOWHYJV8P4 
 
parametric: 
- no need to store data, space efficient, 
- but new evidence comes then you have to re-do the whole training 
- thus training is slow, but query is fast   # faster than BDD 
 
non-parametric: 
- need to store data, space consuming 
- but able to quickly add new evidence 
- training is fast (no need to learn new parameters), query can be "potentially" slow 
- suitable for complex problems where we are not sure of the underlying mathematical model 
 
 
## 
##  training and testing 
## 
- "out of sample testing" : we separate the data we train on and test on. 
- because if we train and test on the same data, obviously the result will be very good. 
- we call the X & Y data we trained on "Xtrain" "Ytrain", and we define "Xtest" and "Ytest" accordingly 
- does the output of your model give the similar result to "Ytest" when given "Xtest" ? 
 
## 
##  learning API example 
## 
 
#  for linear regression 
learner = LinRegLearner() 
learner.train(Xtrain,Ytrain) 
y = learner.query(Xtest) 
assess(y)                 # either compare(y,Ytest) or measure_error(y) 
 
 
#  for KNN 
learner = KNNLearner(k=3) 
learner.train(Xtrain,Ytrain) 
y = learner.query(Xtest) 
assess(y)                 # either compare(y,Ytest) or measure_error(y) 
 
 
e.g. 
 
class LinRegLearner::    # same for KNNLearner 
   def __init__(): 
      pass 
   def train(X,Y): 
      self.m, self.b = favorite_linreg(X,Y) 
   def query(X): 
      y = self.m * X + self.b 
      return y 
 
 
################################################## 
####  (3.3)  Assessing a learning algorithm   #### 
################################################## 
 
(good visual of KNN) https://www.youtube.com/watch?v=tJzbX_Pqxx4 
(must do quiz) https://www.youtube.com/watch?v=1M7Wppz-ZIU 
(must do quiz) https://www.youtube.com/watch?v=hv6QQRAYcFo 
 
Liniear Regression (parameter model) 
- as we increase the degree of polynomial, we overfit 
- LinReg is better than KNN when it comes to outer edge samples as it can extrapolate 
 
KNN 
- as we decrease K, if overfit 
- KNN is weaker than LinReg when it comes to outer edge samples as it cannot extrapolate 
 
(good quiz) https://www.youtube.com/watch?v=jGJHxyfI10I 
 
 
## 
##  metric 1: RMS error 
## 
- RMS = root mean square 
- one way to approximate the average error, but emphasis on larger errors (because we square the error) while getting rid of -/+ signs 
 
RMSE = sqrt(sum( (Ytest-Ypredict)^2 ) / N) 
 
(video) https://www.youtube.com/watch?v=sA1K22Hmh1g 
 
==> you can measure RMSE on both Ytrain(= in-sample) and Ytest(out-of-sample) 
==> obviously RMSE will be larger on out-of-sample data points, cos the model has not seen them before. 
 
NOTE: 
- RMSE can be 0 to infinity 
- one other valid way is to take the abs value of the error, but that does not emphasize larger errors like RMS does. 
- also mathematically, least sum of squares is preferable to minimizing the abs value function, with respect to taking the first derivatives. 
(ref) http://stats.stackexchange.com/questions/46019/why-squared-residuals-instead-of-absolute-residuals-in-ols-estimation 
(ref) http://math.stackexchange.com/questions/967883/why-get-the-sum-of-squares-instead-of-the-sum-of-absolute-values 
 
 
## 
##  cross validation 
## 
- train and test 
commonly, we split the data points 60% for train, 40% for test 
also, if you slice them into a few sets, then you may pick some for train, the rest for test. this will allow you do assess your algo even if there arent many data points. 
 
in case of financial time series data, we must avoid look-ahead bias. 
"roll foward cross validation" means your training data is (chronologically) BEFORE testing data. 
 
 
## 
##  metric 2: correlation 
## 
- correlation between Ytest & Ypredict 
e.g. 
np.corrcoef()     # gives -1 to 1 
                  # -1 negative correlation 
                  # 0  no correlation 
                  # 1  positive correlation 
 
(ref) http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html 
 
===> overall, in general, the bigger correlation, the lower RMSerror 
 
(good quiz) https://www.youtube.com/watch?v=1Cq3MjlDJw4 
 
(how to calc correlation) https://www.mathsisfun.com/data/correlation.html 
        n                                        n                    n 
 Corr = Σ[(x_i - x_ave)*(y_i - y_ave)]  /  sqrt( Σ[(x_i - x_ave)^2] * Σ[(y_i - y_ave)^2] ) 
        i                                        i                    i 
 
NOTE: when constructing a portfolio of stocks, to mitigate risk (one metric is volatility such as standard dev), you want every stock to be not correlated from each other (otherwise it's no better than a portfolio with one stock). Bonds and stocks are normally relatively uncorrelated by themselves but when macro env is going down, they suddenly get correlated and portfolio need re optimized frequently. i.e. in a market meltdown nothing rises except correlation. 
 
 
## 
##  Overfitting 
## 
- for in-sample data, as you increase degree of freedom(polynomial), then error decreses. 
 
e | 
r | __ 
r |   \_ 
o |     \_ 
r |       \ 
  |        ------- 
  ------------------- 
   degree of freedom 
 
 
===> now, think about out-of-sample data, it's likely that to some extent, increasing the decgree of freedom helps. 
     but, after a threshold, its error increases. past that inflection point is called "overfitting" zone 
 
e | 
r | __          / 
r |   \_      _/ 
o |     \_  _/ 
r |       -- 
  | 
  ------------------- 
   degree of freedom 
 
 
"overfitting" definition: in-sample error decreasing while out-of-sample error increasing 
 
====> for KNN, the graph looks the other direction, but conceptually the same 
 
(good visual) https://www.youtube.com/watch?v=mfzHchd5La8 
(good quiz) https://www.youtube.com/watch?v=Xu6xYBcXxaQ 
 

# 'underfit' 

- i.e. over-generalization 
- e.g. given 1000 data points, if you simply take and return the average(or median), that is a good example. 
 
 
############################################### 
####   (3.3.5)  Decision Trees (part 1)    #### 
############################################### 
 
(video) https://www.youtube.com/watch?v=OBWL4oLT7Uc 
 
A Decision Tree    # we assume "binary" in this course 
- factors : x1, x2, x3,,,, xn    # factors aka "attributes" 
- labels  : y 
- nodes   : factors are used, split_value, left_link, right_link 
-- root node 
-- branches 
-- leaves 
 
NOTE: not all factors need to be used. and the same factors can be repeatedly used. 
NOTE: obviously we prefer a balanced tree to guarantee O(log_2(n)) 
 
x2  = 0.98 
x8  = 0.997 
x10 = 0.55 
x11 = 9.7 
 
                    root                    # suppose nodes are numbered 
                 (x11 < 10)                 # node_0 
             yes/          \no 
          (x2 < 0.5)       (x11 < 12)       # node_1 node8 
       yes/       \no    yes/      \no 
   (x10 < 0.5) (x8 < 1)   [4]      [5]      # node_2 node_5 node_9 node_10 
  y/       \n  y/    \n 
 [3]       [4][3]    [4]                    # node_3 node_4 node_6 node_7 
 
 
====> how do we represent this in a data structure ? 
====> a numpy ndarray works. (don't use a pure OOP which will be computationally less efficient) 
 
node_number factor split_val left_link right_link(=link to a node number) 
-------------------------------------------------- 
    0(root)  x11     10.0        1          8     # NOTE:here we use absolute node number 
    1        x2       0.5        2          5     # but we can use a relative reference like right_link = left_link + K 
    2        x10     12.0        3          4     # as long as you know the size of the left sub tree 
                      ..                          # in fact, this is easier for array-based tree implementation 
                      ..                          # if you use depth-first node numbering, then lefttree is always +1 from the current node 
                      ..                          # righttree is lefttree.size + 1 
 
 
############################################### 
####   (3.3.6)  Decision Trees (part 2)    #### 
############################################### 
 
(video) https://www.youtube.com/watch?v=WVc3cjvDHhw 
 
a recursive algorithm 
 
here a node consists of [factor, splitval, lefttree_link, righttree_link]  and its position in the array is the node number. 
and that node number is what we specify by 'lefttree_link' and 'righttree_link' 
e.g. 
you may have a tree: 
[[factor, splitval, lefttree_link, righttree_link]   # node 0 
[factor, splitval, lefttree_link, righttree_link]    # node 1 
[factor, splitval, lefttree_link, righttree_link]    # node 2 
[factor, splitval, lefttree_link, righttree_link]]   # node 3 
 
---- psuedo code ---- 
build_tree(data): 
   if data.shape[0] <= leaf_size: return [NA, data.y, NA, NA]  # 1 = leaf_size,  data.y should be the mean 
   if all data.y same: return [NA, data.y, NA, NA]             # factor, left/right trees are NA in this case (you can use -1 in code) 
   else: 
      determine best feature(==factor) i to split on       # diff approaches exist: entropy, correlation, gini index, etc 
      splitval  = data[:,i].median()                       # median to split the tree in half as much as possible 
      lefttree  = build_tree(data[data[:,i] <= splitval])  # a numpy technique called "comprehension" 
      righttree = build_tree(data[data[:,i]  > splitval])  # take all data whose factor i values are <= or > splitval 
      root      = [i, splitval, 1, lefttree.shape[0] + 1]  # here we specify the relative paths for lefttree and righttree 
      return (append(root, lefttree, righttree))   # append makes a new ndarray in this case 
----------------------                             # NOTE: a colleague in piazza suggests np.vstack((root, lefttree, righttree)) instead 
 
(ref) http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html    # numpy.append() 
 
NOTE: how to check if all elems within an numpy array are the same? 
      e.g. 
      len(np.unique(arr)) == 1 
      np.all(arr[0] == arr) 
 
## 
##  leaf_size 
## 
- it's just how many samples we want to aggregate to a leaf. 
 
suppose our data is: 
X1 Y 
10 1 
20 2 
30 3 
40 4 
50 5 
 
and leaf_size = 2 
 
suppose we split (based on whatever split logic we decide to use) as below 
 
   root    # split_val=35 
  /    \ 
10 1   40 4 
20 2   50 5 
30 3 
 
==> the righttree <= leaf_size == 2, so we take the mean i.e. 4.5 
==> the lefttree > leaf_size so we split more 
 
(again, lets say we split the node by split_val=25, as below, based on whatever split logic) 
 
     root 
    /    \ 
  node   40 4 
 /   \   50 5 
10 1  30 3 
20 2 
 
then we have 3 leaves, each giving the Ypredict value 1.5, 3.0, 4.5 
 

[1,35,1,4],         # for lefttree_link & righttree_link, here we used 'relative' path 
[1,25,1,2],         # in this example, factor is always 1 because we only have X1 
[1,1.5,-1,-1],      # but if we have more Xn, then factor can be any of {1,n} 
[1,3,-1,-1], 
[1,4.5,-1,-1], 

 
NOTE: notice when it comes to a leaf node, "split_val" is not a split val because a leaf does not split anything. instead a leaf node's split value is actually the Ypredict value. 
 
## 
##  how to query a BDT 
## 
 
suppose you have a training data set 
[[x1,x2,x3,...,xn,y]   # y can be multi-dimensional as well, but in this course, we assume y is single dimension 
 [x1,x2,x3,...,xn,y] 
 [x1,x2,x3,...,xn,y] 
 ... 
 [x1,x2,x3,...,xn,y]] 
 
and you build a tree (i.e. train a learner/model), and once you have a tree, you can use the tree to predict Y for given Xs 
 
suppose you get a testX data set 
[[x1,x2,x3,...,xn]  # row 1 
 [x1,x2,x3,...,xn]  # row 2 
 [x1,x2,x3,...,xn]  # row 3 
 ...                # ... 
 [x1,x2,x3,...,xn]] 
 
the way you query(test) the testX is: 
(1) for each row in testX, you first look at the tree root node, and it tells you which factor to use, so effectively you only pick one X val out of x1,x2,...,xn in that row. 
(2) goto left or right child tree based on the split_val and the val of the X 
(3) eventually you get the Ypredict val for the row 
(4) repeat for the rest of the rows 
 
 
## 
##   how to determine the "best" feature 
## 
goal: divide and conquer 
- group data into most similar groups 
 
approaches 
- info gain: entropy 
-- a defacto standard just a way to measure the diversity(or randomness) of the data in the group 
- info gain: correlation 
-- the stronger corr factors are better suited to split on 
- info gain: Gini index # another measure of diversity 
 
row  x0  x1  .. xN   Y 
------------------------- 
 0  0.2 0.5  .. 9.1  8934 
 1  7.9 8.3  .. 6.5  583 
 2  0.6 0.2  .. 0.7  935 
 3  9.2 4.1  .. 3.1  194 
 .   .   .   ..  .    . 
 .   .   .   ..  .    . 
 
===>  lets build a tree (following the logic from the psuedo code) 
- determine the best factor. we will use correlation 
-- calc correlation between each of x0,x1,,,xN and Y 
-- suppose the strongest correlation is for x11 & Y 
- then use x11.median() as your splitval 
- all rows whose factor x11 val <= splitval  become lefttree 
- all rows whose factor x11 val >  splitval  become righttree 
- define the root node. [x11, splitval, lefttree, righttree]  again, notice how easy it gets if we use the relative path for left/right trees 
 
## 
##  random forests 
## 
 
in the psuedo code, where is the computation complexity bottleneck? 
- these two operations 
 
      determine best feature(==factor) i to split on 
      splitval  = data[:,i].median() 
 
===> instead, we simply take random picks as below 
 
      determine "random" feature(==factor) i to split on 
      splitval  = (data[random,i] + data[random,i]) / 2 
 
===> this does degrade/impare the quality of the decision tree. 
===> but we leverage the ensemble bagging learner method, hence "random forests"  (aka random trees) 
===> turns out "random forests" method wins against a single super robustly trained DT 
 
 
## 
##  strengths & weaknesses of decision tree leaners (DTL) 
## 
- cost of learning:  DTL > LinReg > KNN 
- cost of query   :  KNN >  DTL > LinReg 
- benefit         : no need to normalize your data (while KNN requires normalization) 
 
 
 
 
########################################################## 
####  (3.4) Ensemble learners, bagging and boosting   #### 
########################################################## 
 
 
an ensemble learner: a set of (weak) learners comprising a stronger learner as a whole 
 
e.g. in 2006, netflix announced a $1 mil prize for a ML algo that can do 10% better to predict user movie preference (awarded 2009) 
 
recall a learner that trains a model, which we test. 
 
 
                     train_data 
                   / |     |    \ 
                  /  |     |     \_____ 
                 /   |     |           \_ 
                /    |     |             \ 
[learner]      KNN LinReg Decision_Tree  SVM(support vector machine) 
               |     |     |              | 
[input] X -> model model  model          model 
               |     |     |              | 
               Ya   Yb     Yc            Yd 
                \    |     |             / 
                 \   |     |     _______/ 
                  \  |     |  __/ 
                    ensemble 
                 (vote, mean, etc) 
                       | 
                      Y_final 
 
## 
##  why ensemble? 
## 
- diff models augment each other (wrappers for existing methods) 
-- reduce error 
-- reduce overfitting(aka overfitage) 
 
(quiz) https://www.youtube.com/watch?v=KRb5xBs79dc 
 
 
## 
##  bootstrap aggregating (aka bagging) 
## 
 
recall we split our data 
train data : 60% 
test data  : 40% 
 
we then split the train data into m bags containing n' instances 
(bags, bins, buckets, the same thing) 
 
n : number of instances within the train data 
n': number of instances in a bag 
m : number of bags 
 
every time you pick an instance from the train data pool, you pick "randomly with replacement" i.e. you may pick the same elem again. 
so if the train data contains K distinct instances, in theory, you may pick the exact same thing n' times into a bag, which is ok. 
 
in most implementations, n' = n  but because of the 'random with replacement' selection, n' normally covers only 60% of n. 
 
e.g. 
 
data         picked "random with replacement" 
[train]--------------------------------------- 
               |      |     |      ....      | 
             bag_1  bag_2  bag_3   ....    bag_m 
               |      |     |      ....      | 
[test] X -> [model][model][model]  ....   [model] 
               |      |     |      ....      | 
               ------------------------------------[mean]--> Y 
 
 
e.g. 
 
consider 1NN model. -> extreme overfit. 
but if you train your 1NN models on m bags, and ensemble, then you get smoother, "less" overfit model(function). 
 
(visual) https://www.youtube.com/watch?v=sVriC_Ys2cw 
 
 
## 
##  Boosting: Ada Boost 
## 
 
boosting is a simple variation of bagging. 
"ada boost" is a famous example. 
 
- you create your 1st bag "randomly with replacement" 
- then before creating the 2nd bag, you test the model using the whole "train" data as your test data. 
-- for some Xs in the train data, the model does not yield good Y (most likely for instances that didnt get bagged), then you weight each instance in the train data pool based on the error from the test. 
- then you create your 2nd bag, this time error weighted selection, instead of "random with replacement". so the bigger error instances get more likely picked. 
- then train a model for each bag, and get the ensembled Y out of the n train data instances, then you again weight each instance in the train data according to the error. 
- create the 3rd bag based on the error weight. 
- repeat until you get m bags. 
 
 
===>  arguably, ada boost is more susceptible to overfitting, as it tries harder to fit not-well-fitted instances 
 
(quiz) https://www.youtube.com/watch?v=D5sa6IrbGvg 
 
 
 
(PERT) Perfect Random Tree Ensembles 
http://www.interfacesymposia.org/I01/I2001Proceedings/ACutler/ACutler.pdf 
 
(continuous boosting algorithm) http://www.mitpressjournals.org.prx.library.gatech.edu/doi/pdf/10.1162/neco.2006.18.7.1678 
 
 
########################################### 
####   (3.5) Reinforcement Learning    #### 
########################################### 
 
so far, learners we discussed provide price forecast, but no direction over when to exit the position, etc 
 
reinforcement learner trains "policy" that directs specific actions. 
 SL:   ƒ(x) = y 
 RL:   π(s) = a 
 
## 
## the RL problem 
## 
- RL really describes a problem, not a solution. 
- there are many algorithms that solve a RL problem. 
 
a typical RL model :  agent interacts with environment 
- environment: transition_function(state,action) = new_state 
- agent: policy(state) = action 
- reward: every time an agent takes an action based on a given state, it gets reward(pos or neg). the goal is to maximize the reward. 
 
==> agent should have an algo that modifies the policy based on reward. 
 
(quiz) https://www.youtube.com/watch?v=ZV1TQ8CiT4M 
 
 
## 
##  mapping trading to RL 
## 
- environment: market 
- state: market features, various tech indicators, holding 
- action: buy, sell, do nothing 
 
## 
##  markov decision problems  (MDPs) 
## 
- RL algos solve MDPs 
 
a MDP consists of 
- set of states S 
- set of actions A 
- transition function T[s,a,s']   # this basically refers to the probability of taking action a in state s leading to state s' (sum of all possible s' will be 1) 
- reward function R[s,a]          # reward func is simpler. it gives reward for action taken in state s 
 
find a policy func P(s) that maximize reward.  such optimal P() is denoted as P*() 
===> there are RL algorithms that take T() and R(), and find P*() 
1) policy iteration 
2) value iteration 
# but in the trading context, we know neither of T() and R(), so we have to take a diff approach. 
 
## 
##  unknown transitions T() and rewards R() 
## 
 
experience tuple:  <s,a,s',r>      # here s' = the state you will be in when taking action a in state s 
                   <s',a2,s'',r2>  # you can express the transition based on the outcome of the prev experience tuple 
                     ...           # and so on 
                     ... 
 

#  model-based RL:  build (statistically based on experience tuples) a model of T[s,a,s'] and R[s,a] 

- once you have T() and R() modeled, then you can apply {value|policy} iteration, to find P*() 
 

#  model-free RL: develop P*() directly out of experience tuples. 

- e.g. Q-learner 
 
 
## 
##  what to optimize ? 
## 
 
# infinite horizon                         inf 
- maximize the sum of all future rewards:   Σ Ri 
                                           i=1 
 
# finite horizon                                        n 
- maximize the sum of future rewards within horizon n:  Σ Ri 
(e.g. n steps, n years, etc)                           i=1 
 
                      inf 
# discounted reward:   Σ L^(i-1) * Ri      # it's the same logic for intrinsic value calc 
- discount ratio L    i=1 
  0 < L <= 1 
 L often is determined by interest rate in tech analysis context 
 
 
(quiz) https://www.youtube.com/watch?v=XHTiNDaBLNE 
 
 
################################ 
####   (3.6)  Q-Learning    #### 
################################ 
 
- a model free approach: does not use transitions T, or rewards R 
- instead, it builds a table of utility values (called Q-values), as agents interact with the world. 
- guarantees an optimal policy 
 
## 
##  Q-function 
## 
- returns the value from the table for a given set of s & a 
 
 s = state 
 a = action (you consider taking) 
 Q[s,a] = immediate reward + discounted reward 
 

# how to use Q ? 

 
optimal policy:  P(s) = argmax_a(Q[s,a])     # try every a, and return the max reward a case 
                P*(s) = Q*[s,a]              # another way to denote optimal is using '*' 
 

# how to build a Q table? (i.e. how to train a Q learner?) 

- select training data (and test data) 
- iterate over time: at each iter, you get an experience tupul <s,a,s',r> 
-- when you take a in s, you transitioned to s' with r 
(steps) 
--- set starttime, init Q[] 
--- compute s 
--- select a 
--- observe r,s' 
--- update Q 
- test diff P() 
- repeat until converge 
 

# more details 

- alpha A: learning rate 0 ~ 1.0    # usually we use 0.2 
- improved_estimate: immediate_reward + G * future_reward 
- gamma G: discount rate 0~ 1.0     # bigger G means bigger discount 
- future_reward 
 
  Q'[s,a] = (1-A) * Q[s,a] + A * improved_estimate 
  Q'[s,a] = (1-A) * Q[s,a] + A * (r + G * Q[s',argmax_a'(Q[s',a'])]) 
 
==> as you can see, a bigger val of alpha A causes a quicker Q learning (i.e. quicker convergence, which may not be necessarily more accurate) 
 

# more more details 

- success depends on exploration (of possible states and actions) 
-- how? ->  choose random action with prob c   # start at 0.3 then decrease to 0.0 over iterations 
--- there are two flips of coin involved. 
--- (1) randomly choose whether you pick a random action or pick the best Q val 
--- (2) if the former, then also randomly pick an action out of available actions 
 
## 
##  solving the trading problem with Q-learning 
## 
Actions: 
- buy 
- sell 
- do nothing 
 
Rewards: 
- immediate reward   # daily return 
- delayed reward     # return at the end of a trading cycle 
 
(quiz) https://www.youtube.com/watch?v=BgWJhpcxmzw 
 
States: 
generally avoid stuff you cant compare across stocks, like absolute price values, etc 
i.e. stuff like return, ratio, etc is useful 
e.g. 
- adjusted close / SMA 
- P/E ratio 
- bollinger band 
- whether we are currently holding the stock 
- return since entry 
 
(quiz) https://www.youtube.com/watch?v=_vcd_VtSvkQ 
 
## 
##  creating the state 
## 
states are combinations of "features aka factors" 
- express a state as an integer  # so it's easier, but technically you can use real num also 
- discretize each factor         # convert a real number to an integer in some range [0,N] 
- combine those integers (representing states and factors) into an integer 
 
## 
##  discretizing 
## 
------------------------------- psuedocode 
stepsize = size(data)/steps 
data.sort() 
for i in range(0,steps): 
    threshold[i] = data[(i+1)*stepsize]   # grouping at every 'stepsize' chunk into an integer 
---------------------------------------- 
 
- suppose you want to discretize inputs into a range [0,N] then you have N steps 
 
## 
## Q-learning recap 
## 
- building a model 
-- (1) define states, actions, rewards 
-- (2) choose in-sample training period 
-- (3) iterate: update Q-table 
-- (4) backtest, repeat (3),(4) until converge 
 
- testing a model 
-- backtest on later data 
 
## 
## Q-learning advantages 
## 
 
below summary is taken from udacity 
- The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined. 
- As a result, we do not need additional data structures to store transitions T(s, a, s') or rewards R(s, a). 
- Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state max_a(Q(s, a)) as well as the best policy in terms of the action that should be taken argmax_a(Q(s, a)). 
 
 
#################################################### 
####  (3.7)  Dyna: a special Q-learner example  #### 
#################################################### 
 
a problem with Q-learners is it takes too many experience tupuls to converge. 
i.e. too many real world interactions, but we don't want to let the learner have too much real trading just for practice. 
 
Dyna builds models of transition T, and a reward matrix R, so it hallucinates the learner to have many (a few hundred) interactions (after a afew real interactions), to update Q table. 
 
## 
## a normal Q-learner  (recap) 
## 
1. init Q[] 
2. observe s 
3. execute a, observe s' & r 
4. update Q with <s,a,s',r> 
5. repeat 2,3,4 until converge (expensive operation) 
 
## 
## Dyna-Q 
## 
1. init Q[] 
2. observe s 
3. execute a, observe s' & r 
4. update Q with <s,a,s',r> # not significant 
5. repeat 2,3,4 a few times, to learn models of T & R 
6. hallucinate 2,3 with T & R 
7. update Q with <s,a,s',r> 
8. repeat 6,7 until converge (cheaper operation !) 
 
T[s,a,s'] = probability that taking a in s ends up in s' 
R[s,a] = expected reward you get for taking a in s 
 
 

# how to hallucinate  (step 6) 

-- keep track of encountered state+action pairs) e.g. [(s0,a0),(s1,a1),,,,(sN,aN)] 
--- it's possible to get repeated state+action pairs. 
--- yes multiple actions may be associated with a state, because you can come to the same state as before and take a diff action. 
 
 s = randomly chosen from known(previously visited) states 
 a = randomly chosen from known(previously taken) actions for that random s you just chose 
 s'= infer from T[] 
 r = R[s,a] 
 
how do i infer s' from T[] ?  (assume you have the above described T[]) 
(1)  s'= np.argmax(T[s,a,:]) 
-- not necessarily good to blindly take the highest probability s_prime because if it's not the effective state to be in, it takes long to converge. 
(2) s' = np.random.choice(range(self.num_states),p=self.T[random_s,random_a,:]) 
-- this is better. 
 

# how to update T & R   # (step 5) 

- learning a model of T 
-- just count real world examples, and build a table Tc 
 init Tc[all] = 0.00001 
 while executing, observe s,a,s' 
 increment Tc[s,a,s'] 
 
 T[s,a,s'] = Tc[s,a,s'] / Σ(Tc[s,a,i]) 
                          i 
 
- learning a model of R 
 R[s,a] = expected reward for s,a 
 r = immediate reward 
 R'[s,a] = (1-A)*R[s,a] + A*r      # Alpha usually is 0.2 (this alpha can be diff from the alpha for Q table) 
 
 
NOTE: there are a few variants to dyna-Q implementation. the point is to facilitate the hallucinated learning. 
(ref) https://github.com/paulorauber/rl/blob/master/learning/model_building.py   # sample code 
(ref) https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node96.html     # theory 
(ref) http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%209.pdf  # theory 
 
 
 
################# 
##  appendix   ## 
################# 
 
work station: buffet03.cc.gatech.edu 
(ref) http://quantsoftware.gatech.edu/ML4T_Software_Setup 
 
you basically copy git repo locally then do your coding, then test, then maybe WinSCP and submit your python code via t-square 
 
scp -r <username>@buffetbuffet03.cc.gatech.edu:/path/to/directory /path/to/destination 
(or use WinSCP) 
 
for IDE, anaconda/spyder will be good. 
 

# git memo 

git clone https://github.gatech.edu/tb34/ML4T_2016Fall.git 
git pull   # to sync with the latest 
 
## 
## spyder 
## 
 
http://stackoverflow.com/questions/26679272/not-sure-how-to-use-argv-with-spyder 
==> alternatively you can do the below on the iphython console 
runfile('C:/Users/mel/Desktop/gatech/ml4t/ML4T_2016Fall/mc3_p1/testlearner.py', args='/Users/mel/Desktop/gatech/ml4t/ML4T_2016Fall/mc3_p1/Data/simple.csv', wdir='C:/Users/mel/Desktop/gatech/ml4t/ML4T_2016Fall/mc3_p1') 
 
 
## 
##  wall street lingo 
## 
- cyclical: means a stock, a company's business performance generally depends on the economy, like US steel. 
- secular: regression-resistant stock, like food and medicine, tooth brush company whose performance is consistent regardless of overall economy health 
-- know a stock is either a cyclical or secular growth stock. when the economy is down buy secular, when up buy cyclical. 
- rotation: a term often used when money flows between cyclical and secular stocks. 
 
 
### 
### what hedge funds really do 
### 
 
headge funds VS mutual funds 
H: less regulated. 
M: more regulated. has to declare strategy in prospectus. can advertise and access small investors. 
 
 
 

  1. 2016-08-22 23:30:32 |
  2. Category : gatech
  3. Page View:

Google Ads