computer networking

 
graduate computer networks.  cs6350 notes 
 
(supplimentary text) 
computer networking - Kurose, Ross 
computer networks - Tanenbaum, Wetherall (available online video below) 
http://media.pearsoncmg.com/ph/streaming/esm/tanenbaum5e_videonotes/tanenbaum_videoNotes.html 
 
"Introduction to Computer Networking" through Stanford for free online: 
https://class.stanford.edu/courses/Engineering/Networking/Winter2014/about 
 
(non-degree udacity version) 
https://www.udacity.com/wiki/ud436 
 
 
------------------------------------------------------------------------------ 
 
(1) intro 
(2) archtecture & principles 
(3) switching 
(4) routing 
(5) naming, addressing & forwarding 
(5.1) router design basics 
(5.2) DNS 
(6) congestion control & streaming 
(7) rate limiting and traffic shaping 
(8) content distribution 
(9) SDN 
(9.1) programming SDNs 
(10) traffic engineering 
(11) NW security 
(11.1) internet worms 
(11.2) spam 
(11.3) DoS attacks 
 
 
 
##################### 
####  (1) intro  #### 
##################### 
 
 
#  course goal 
 
survey on 
- tools 
- techniques 
- conceps 
in CN research 
 
project-based hands-on course 
lectures, then problem sets/assignments 
 
3 areas 
- protocols & architecture 
(principles, swtich/routing, naming, addressing, forwarding) 
- resource control & content distribution 
(congestion control, streaming, rate limiting) 
- operations & mgmt 
(SDN, traffic engineering, network security) 
 
 
Assignment 1: install mininet (submit only via T-sqaure)(use vbox ver. 4.3.12) 
https://www.udacity.com/wiki/cn/assignment1-mininet 
 
(all assignment code here) 
https://github.com/OMS6250/gt-cs6250 
 
 
 
 
######################################## 
###  (2) architecture & principles   ### 
######################################## 
 
- history of the internet 
late 1960s~ ARPANET , Sat Net, NPLnet in UK, etc 
1973 TCP/IP, getting standardized in late 70s, Arpanet adopting it.. 
1982~ DNS (replacing host.txt file) 
1988~ TCP congestion control 
1989~ NSF net, BGP routing 
 
1992~ video/audio 
1993~ web 
1995~ search engine 
2000~ p2p 
 
- took off around mid-90s 
 
## problems 
- IPv4 address space : only 2^32 
- congestion control : insufficient dynamic range (NG for wireless, high-speed intercontinental paths) 
- routing : BGP = no security, poor convergence, non-determinism, ease of misconfig 
- security : secure SW deployment 
- DoS 
 
===> all requires change to basic infra 
 
 
## Internet design principles 
 
Clark paper (1988): "design philosophy of DARPA net protocols" 
http://ccr.sigcomm.org/archive/1995/jan95/ccr-9501-clark.pdf 
 
goal: multiplexed(shared) utilization of existing interconnected networks 
goal: survivability (replication, fate sharing) 
goal: heterogeneity: IP narrow waist, best effort service model 
goal: distributed mgmt: addressing(ARIN, RIPE, etc), naming (DNS), routing (BGP) 
goal: cost 
goal: ease of attachment 
goal: accountability 
(missing) 
goals: security, availability, mobility, scaling 
 
 
## Interconnection 
 
OSI stack model - IP being the "narrow waist" 
 
L2 - point2point connectivity 
L3 - end2end connectivity (universal fwding) 
TCP- reliable transport, congestion control 
 
concepts 
(1) packet switching (statistical multiplexing) 
-- no NW state ahead of time, only best effort 
-- contrast to phone NW (dedicated circuit switching), obviously pros & cons for each 
e.g.       PS            CS 
pros: shared resources  dedicated resources between sender & receiver 
cons: variable delay    busy signal(phone) 
 
(2) fate sharing 
 
- end2end principle 
- "ok to lose a state info of an entity if the entity itself is lost" 
e.g. a router crashes, all of the state on the router is lost. state shares the fate. 
==> makes engineering of withstanding complex failure easier 
==> there are examples of the current internet violating this principle 
 
 
http://en.wikipedia.org/wiki/Fate-sharing 
 
 

# end-to-end argument   (sometimes violdated, for good reasons like performance enhancement, etc) 

 
"End-to-end arguments in system design" by JH Saltzer, DP Reed, DD Clark 
http://web.mit.edu/saltzer/www/publications/endtoend/endtoend.pdf 
http://www.cc.gatech.edu/home/keith/pubs/hotnets07-end-to-end.pdf 
(extra papers) 
http://nms.csail.mit.edu/6829-papers/bravenewworld.pdf 
http://www.fclj.org/wp-content/uploads/2013/01/Vol.63-2_2011-Mar._Art.-02_Clark.pdf 
 
excerpt from the paper: The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.) 
 
 
summary: the intelligence required to implement an application on a communication system should be placed at the end points rather than in the middle of the network. 
i.e. dumb NW, intelligent end-points 
 
e.g. error handling in file transfer 
e.g. end-to-end encryption 
e.g. TCP/IP split error 
 
==> we need application-layer (=higher layer) check anyway, but that does not mean lower-level check is meaningless, in fact sometimes it can help. 
 
what are the "ends" ?  ==> depends on the context, can be PC, or routers, etc. 
 
# e2e violation examples 
- NAT 
- VPN tunnels 
- TCP split 
- spam filters 
- p2p 
- cache in NW aggregation 
 
===> questions: what's in VS out ?  what functions should be in dumb NW ? (routing, multicast, QoS, NAT, ? ) 
 
RFC1918 : NAT private IP addr space  http://tools.ietf.org/html/rfc1918 
e.g. 192.168.0.0/16 
 
NAT table example: leverages port numbers 
 
private IP addr       pub IP addr 
192.168.1.31:1000 <-> vvv.xxx.yyy.zzz:50873 
192.168.1.32:1000 <-> vvv.xxx.yyy.zzz:50874 
 
 
STUN (Session Traversal Utilities for NAT) 
"a standardized set of methods and a network protocol to allow an end host to discover its public IP address if it is located behind a NAT" 
(ref) 
http://en.wikipedia.org/wiki/STUN 
 
===> a device behind NAT just sends some initial outbound packet, simply to create a NAT entry and know its pub IP addr/port. 
 
 
 
 
############################################################################# 
####      Assignment 2 - mininet topology (submit only via T-square)     #### 
############################################################################# 
 
https://www.udacity.com/wiki/cn/assignment2-topology 
 
NOTE: step 2, this cmd may not work, then do it individually 
sudo sed -i "s/(security.ubuntu.com\|mirrors.kernel.org)/old-releases.ubuntu.com/g" /etc/apt/sources.list 
 
(fix) 
sudo sed -i "s/security.ubuntu.com/old-releases.ubuntu.com/g" /etc/apt/sources.list 
sudo sed -i "s/mirrors.kernel.org/old-releases.ubuntu.com/g" /etc/apt/sources.list 
 
==> this is needed cos Ubuntu release 13.04 not supported anymore. we need to go to old release path. 
 
NOTE: for somereason, virtualbox has many issues. my bandwidth never got around 50Mbps. VMWare worked instead. 
(VMWare Player for Windows: desktop end user computing) 
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0 
===> just import .ova mininet-VM image from VMWare, and the rest is exactly the same setup as was on virtual box. 
===> some other issues/solutions with vbox discussed on piazza. vbox seems full of issues. 
(ref) 
https://www.virtualbox.org/ticket/10157?cversion=0&cnum_hist=3 
 

#  NOTE:   line-rate VS bit-rate 

 
line-rate : actual speed with which bits are sent onto physical wire (physical layer GROSS bit rate) 
bit-rate  : data transfer rate offered by physical layer to data link layer (physical layer NET bit rate) 
 
(ref) 
http://blog.ipspace.net/2009/03/line-rate-and-bit-rate.html 
http://www.diffen.com/difference/Gross_vs_Net 
 
 

#  NOTE : why first ping RTT takes long ? 

the first ping takes longer due to the ARP lookup. If you have not gotten to lesson 3 yet, you'll learn about how the ARP lookup occurs. 
 
Since the sender doesn't know where the receiver is, it must first send a ARP request to find the machine. The query must travel over the same links and will therefore add at least 6ms to the initial ping time. 
 
If you were to be able to start the ping again without restarting mininet, you would find that the first ping gave a valid time because it would have the IP to MAC address information stored in its ARP cache. 
 
Why does ping not take this into account when sending the first ping? The ping application is just telling the network stack to send a data packet. The fact that the network stack does not have the MAC address corresponding to the IP address is shielded from the ping application. Therefore, it doesn't know that the ARP process had to occur and cannot account for it when computing the latency of the first packet. 
 

# NOTE: how to scp between vm and the host OS 

You can use an scp command for files between a host and a VM that aren't easily accessible from a web server, a good reference is here: http://www.hypexr.org/linux_scp_help.php 
 
e.g. 
$ scp mntopo.py mel@melx60.local:/Users/mel/dev/gt/a2 
 
WinSCP is an easy-to-use Windows SCP+SFTP+FTP client that works easily with these VMs. http://winscp.net/ 
 
 
 
################################## 
######     mininet  intro    ##### 
################################## 
 
Just read this intro web site. 
 
(URL) 
https://github.com/mininet/mininet/wiki/Introduction-to-Mininet 
 
 
mininet - open source "NW emulator"   (lightweight virtualization on a single linux kernel) 
 
we can define a custom topology - end-hosts, switches, routers, and links. (you can ssh into hosts if you start up sshd and bridge. you can run any applications that run on the underlying linux kernel. NO non-linux OS dependent programs like windows, BSD etc) 
 
# features 
- python API  (= http://api.mininet.org ) 
- included in Ubuntu (12.10+) 
- switches are programmable using OpenFlow protocol. custom SDN. (you will need to write your OpenFlow controller) 
 
# limitations 
- no NAT (nat.py, hwintf.py lets you have exeternal connectivity and add hw interface) 
- all hosts share the same file system and PID by default (bind.py lets you have per host private directories) 
 
--- see example code. 
(1) you can custom define a topology. 
(2) change performance parameters. 
(3) run programs in hosts 
 
 
# NOTE:  sometimes when mn crashes, or your code fails, leaving zombie mininet procs, then you should issue "sudo mn -c"  helps clean up mininet 
 
 
useful cmd:  dumpNetConnections 
http://nullege.com/codes/search/mininet.util.dumpNetConnections 
 
 
 
 
 
 
############################# 
####   (3) switching    ##### 
############################# 
 
switching & bridging 
 
How hosts find each other on a subnet. 
How subnets are interconnected. 
How to scale better. 
 
 
# data gram name convention 
L2 : frame 
L3 : packet 
 
 
ARP : address resolution protocol  ( /usr/sbin/arp -a ) 
- an L2 protocol for a host to learn the MAC address of another host. 
- broadcast a query to the subnet(LAN) broadcast addr "who has an IP addr 130.207.160.47 ?" 
- and only the corresponding IP host will answer its MAC addr in unicast. 
- the host can starts building an ARP table (IP - MAC mapping) 
 
typical encapsulation of L3 packet with L2 frame header(dst/src addr). 
 
e.g. 
broadcast ARP request contains: 
- src IP 
- src MAC 
- dst IP 
 
response includes: 
- src IP 
- src MAC 
- dst IP 
- dst MAC   # see this is known, thus no need for ARP broadcast. 
 
note that not just switches, but hosts have their own ARP tables. 
 
 
Hub : broadcast medium 
- broadcast all frames in its incoming port to every outgoing port. 
- flooding, collisions of frames (--> latency), failures/misconfig (one failed/miscoinfig device could damage other devices) 
- almost extinct cos switches are as cheap these days. 
 
Switch : segments broadcast domains 
- maintains a switch table (mapping btw MAC addrs - output ports) 
- traffic isolation of LAN segments 
- learning switches (examines frames' src addrs and builds a switch table) 
 
## 
## loops and broadcast 
## 
 
a cycle is often intentional in underlying physical NW topology for redundancy. 
==> fwd loops 
==> broadcast storms 
 
(solution) 
- build a (minimum) spanning tree 
- root = switch with the smallest ID (initially every node thinks it's the root, but update view of root, thru a process, and keeps its distance from the root) 
 
format(Y,D,X) 
Y = claimed root 
D = distance from root 
X = origin 
 
===> every node broadcasts, and smaller ID switch will be overrulling. 
===> eventually forms a minimum spanning tree 
 

# switch VS router 

(switch) 
- L2/ethernet 
- auto-configuring 
- fast fwding (flat table loop up) 
(router) 
- L3/IP 
- not restricted to spanning tree (multi-path routing) 
 
===> switch limitations : scaling. (spanning tree msg, arp queries, etc can be a heavy load, to even leverage L2 benefics) => OpenFlow, SDN solutions, which will make L2-L3 walls blurry. 
 

#  buffer sizing : important questions in switch/router design 

 
(see video for graphics) 
https://www.youtube.com/watch?v=GDe77B9lts4#t=10 
https://www.youtube.com/watch?v=1K_DVt1XSQo 
https://www.youtube.com/watch?v=bO5mQ1EJ0rg 
 
 
(traditional notion) 
 
src  ----   router   ----  dst 
 
C = capacity of the bottleneck link (bits per sec) 
T = one-way trip propagation delay (unit in ms, sec, etc) 
2T = round trip propagation delay (src -> router -> dst -> router -> src) 
 
buffer size  =  2T * C  ( = number of bits that can be outstanding along the 2T path.) 
 
===> theoretically correct. but not in practice. 
===> bigger buffer size leads to (1) bigger queueing delay(feedback about congestion delay, etc) and (2) HW resource 
 
TCP window W : size of unack'ed packets that can be outstanding on a path 
sending rate R = W/RTT 
 
===> recall TCP uses AIMD(additive increase multiplicative decrease) congestion control (typical sawthooth graph) 
 
for every W acks received, we send W+1 packets. (sawtooth peak = Wm, bottom = Wm/2) 
 
required buffer = Wm - Wm/2 
 
Rold = Rnew  (just to ensure the sending rate to ramain the same before and after the Window change) 
Wold/RTTold = Wnew/RTTnew 
 
RTT = 2T(transmission delay) + B/C(queueing delay) 
B = max buff size of the bottleneck link 
C = capacity of the bottleneck link 
 
RTold = 2T + B/C 
Wnew = Wold/2 
RTTnew = 2T    # no queueing delay after squeezing the window 
 
====> yields B = 2T * C 
 
====> works ok for a single flow, but backbone router has 20k asynchronous flows, peaking at diff patterns/timing. 
====> overall variance of the sum of Wm will be less dynamic. (according to a stanford buffer sizing paper, asynchronous happens after 200 flows) 
 
Central Limit Theorem (CLT) 
 
more variables ===========> narrower Gaussian 
               by 1/sqrt(n) 
 
   2T * C      ===========>  2T*C/sqrt(n) 
 
# variables = unique congestion windows of flow = n 
# Gausian = fluctuation of sum of all the congestion windows 
 
 
 
(good paper to read) 
http://yuba.stanford.edu/~nickm/papers/sigcomm2004.pdf 
 
 
 
 
############################ 
####   (4)  Routing     #### 
############################ 
 
Internet = inter-connected Autonomous Systems (ASes) 
 
AS example:  Comcast, Google, Geogia Tech 
 
intra-domain routing : routing within AS 
inter-domain routing : routing between ASes 
 
 
## 
## intra-domain routing 
## 
 
IGP (interior gateway protocol as opposed to BGP) 
intra-AS topology:  nodes(aka pops) & edges 
 
pop = point of presence 
 
two kinds of intra-domain routing 
(1) distance vector 
(2) link state 
 
 
(1) distance vector (e.g. RIP) 
- send "vectors" to adjacent neighbors (copies of own routing table) 
- routers compute costs to each destination based on shortest available path (bellman-ford) 
- conversion can take long. (counter infinity problem) SEE video https://www.youtube.com/watch?v=_lAJyA70Z-o#t=112 
 
===> solution is "poison reverse" 
 
# poison reverse 
 
x --- y 
 \   / 
   z 
 
lets say it's undirected. 
 
if d(x,y) > d(y,z) + d(z,x),  then y must go to x via z. 
in this case, y would advertise to z about the cost of x being infinite. 
 
lets say 
d(x,y) = 5 
d(y,z) = 2 
d(z,x) = 1 
 
  x y z 
x 0 5 1 
y 5 0 2   // when y sends its vector to z, instead of "5 0 2", y will send "inf 0 2" 
z 1 2 0 
 
=====> this will let z immediately choose the shortest path when updating the table. 
 
### 
### RIP (routing information protocol) 
### 
 
- edges have unit cost (distance) 
- infinity = 16 (to ensure counter infinity doesn't take that long) 
- table refresh : every 30 sec (or when updates occur)  # timeout limit = 180 sec 
- updates to all neighbors except for the one that caused the update (aka "split horizon" rule, to prevent routing loops) 
 
 
 
(2) link-state routing 
- more prevailing than distance vector today 
- each node distributes "NW map" to all other nodes, and performs SPF(shoftest path find) computation like Dijkstra 
- e.g. flood costs c(u,v) to neighbors. dist(v) = min(c(u,w) + dist(w), dist(v)) 
- notably two protocols: OSPF, IS-IS (intermmediate systems * 2)  recently, ISIS is more popular with ISP 
===> limitation : scalability (as nodes increase, NW gets complex) 
 
solution to scalability problem in link-state routing protocols:  "Hierarchy" 
OSPF : area 
IS-IS: level 
 
====> essentially dividing the whole backbone NW into areas/levels, and each area forms its own shortest path NW, and we also form the SP between areas. Area0 = backbone, AreaN(n>0) has an Area0 router that connects to Area0. 
 
 
 
## 
## inter-domain routing 
## 
 
- routing between ASes. 
- BGP (border gateway protocols) 
- route advertisements(announcements) 
 
(video) https://www.youtube.com/watch?v=HmtJxJslXVI#t=134 
 
imagine comcast_AS --- AT&T_AS --- CMU_AS --- GT_AS 
 
route advertisement from AT&T_AS to comcast_AS contains 
 
(1) Destination IP prefix (NW addr of GT_AS) e.g. 130.207.0.0/16 
(2) Next Hop (IP of AT&T's border router which comcast_AS needs to send to, to get to GT_AS) e.g. 192.5.89.89 
(3) AS Path (sequence of AS numbers from AT&T_AS to GT_AS) e.g. 10579(AT&T's AS num)...29192(GT's AS num) 
GT's AS num in this case is called "origin AS" because it originated the advertisement of the prefix. 
 
transmitting routing info for border routers between adjacent ASes: external BGP (eBGP) 
disseminating BGP route advertisements for routers inside AS for external destination: internal BGP (iBGP) 
 
NOTE: distinction between IGP vs iBGP 
 
IGP : desseminate routes inside an AS to "internal" destinations 
iBGP: desseminate routes inside an AS to "external" destinations 
 
so, often, a router inside AS x goes thru IGP-iBGP-eBGP-IGP to reach a router inside AS y 
 

#  BGP route selection 

- how do we pick the best(shortest) path among multiple available paths? 
(1) prefer a route with higher "local preference" 
"local preference" is just a numerical value assigned to each router within AS (purely local property) 
lets local NW operator to label preferred routes 
(2) among the same local preference, BGP picks the shortest AS path length (fewer AS-hop is deemed better) 
(3) multi-exit discriminator(MED) - tells which exit point to prefer within AS 
                                  - lower value preferred 
                                  - only applies to compare routes advertised from the same AS 
(4) shortest IGP path (to iBGP next hop) aka "hot potato" routing (= AS sends traffic to neighbor ASes via a path that traverses as little of its own NW as possible) distance/cost here is often just "propagation delay often associated with sheer physical distance" 
(5) tiebreak (if above points are all tie, then just tie-break based on some arbitrary methods e.g. "most stable" "lowest router ID") 
 
priority (1) > (2) > (3) > (4) > (5) 
 

#  local preference 

 
   /-- AS2 --\ 
AS1     |     AS4 
   \-- AS3 --/ 
 
 
now from AS1 operator perspective, he can control local pref value for AS1-AS2 link, and AS1-AS3 link. 
lets say default val = 100, and he assigns 110 to AS1-AS2 link, then any traffic from AS1 to AS4 will route via AS2. 
===> as such, setting local pref can control "OUTBOUND" traffic. (extremely useful for configuring primary & backup routes in practice) 
 
also, an AS can attach BGP "community" to a route to affect how neighbor ASes set local pref. 
thus, effectively controlling the incoming traffic by affecting neighboring ASes outbound traffic. 
 
===> AS4 advertize AS4-AS3 is primary, and AS4-AS2 is backup, thus affecting AS2's local pref such that AS2 chooses to send traffic to AS4 via AS3 instead of a direct path. 
 
===> this sort of community arrangement requires "prior-agreement" 
 
 

#  MED (multiple exit discriminator) 

 
(video) 
https://www.youtube.com/watch?v=Z6Wc8o2ObUY#t=59 
 
suppose two ASes with a dual linkage, one link in SanFrancisco, and the other in NewYork. 
 
   AS1 
  |   | 
 SF   NY 
  |   | 
   AS2 
 
==> when a router inside AS2 wants to send traffic to a router in AS1, it chooses which one of SF or NY to go thru based on AS2's internal shortest IGP path(= hot potato routing). so some routers in AS2 will go to SF link, others will go to NY link. 
 
AS1 may advertise its BGP routes to AS2 with MED values.  SF:20, NY:10 
==> this will force AS2 routers to go to NY egress. (recall MED > hot porato) 
 
==> not widely used, but used when for example AS2 is free-riding AS1's backbone network. typical when transit provider peers with content provider. 
 
==> also, in this example, assuming no MED override, if IGP cost details change within AS2, it might change lots of BGP dst prefix table entries, which can significantly affect traffic shift between SF & NY. research has been done to de-couple this IGP-BGP dependency. 
 
 

#  inter-domain routing business model 

 
ranking rule : 
customer > peer > provider 
 
AS_provider  ----------- 
  ||                   | 
AS_you -- AS_peer  --- AS_dst 
  ||                   | 
AS_customer  ----------- 
 
 
if AS_you use a link with 
- AS_provider : you have to pay them  (regardless of direction of traffic) 
- AS_peer     : free ( aka "settlement free peering") 
- AS_customer : you get paid (regardless of direction of traffic) 
 
route advertisement AS_you receive from 
- customer : re-advertise to everyone else (because more people use the route, AS_you get more money) 
- provider/peer :  only re-advertise to customer (if re-ad to provider, you become provider to providers by paying money to them, duh) 
 
===> this re-ad scheme(aka Filter/Ranking/Export rule) guarantees routing stability. 
 
but in practice, stuff like regional-peering, paid-oeering, violate the principle, thus BGP in practice may cause the below oscillation issue. 
 

# inter-domain routing oscillations 

 
(paper : Persistent route oscillations in inter-domain routing) 
http://research.cens.ucla.edu/people/estrin/resources/journals/2000jan-Varadhan-Govindan-Persistent.pdf 
 
(video) 
https://www.youtube.com/watch?v=Cb5VSo6h5_0 
 
ASes have local preference, but the updates by each AS can lead to continuous path updates by other ASes ad infinitum. 
 
Griffin later formalized the BGP correctness property called "safety" which can be achieved by the above provider/peer/customer re-ad scheme. 
 
(paper) 
http://www.cs.cmu.edu/~srini/15-744/readings/GW02.pdf 
 
 
 
 
################################################## 
####   (5)  Naming, Addressing & Fowarding    #### 
################################################## 
 
IP addressing. 
 
IPv4 - 32 bit number. 2^32 = 4 billion. 
 
dotted quad. e.g.  130.207.7.36 
 
## 
## pre-1994 : "classful" addressing 
## 
    bits  12345678 
class A:  0         ## first 8-bits = NW ID, remaining 24bits = host ID 
class B:  10        ## first 16-bits= NW ID, remaining 16bits = host ID 
class C:  110       ## first 24-bits= NW ID, remaining 8 bits = host ID 
 
around 1994, BGP table size grew to a level where we ran out of class C allocation. but there was demand for more... 
 
 
## 
##  IP addr allocation: 
## 
 
IANA (internet assigned numbers authority) 
- authority to assign addr space to regional routing registry (AfriNIC, APNIC, ARIN, LACNIC, RIPE) 
- each regional registry assigns to individual org (like, university, companies) 
 
IANA already depleted all /8 addr space. 
 
$ whois 130.207.7.36 
 
NetRange:       130.207.0.0 - 130.207.255.255 
CIDR:           130.207.0.0/16 
OriginAS:       AS2637               ##  AS number here 
NetName:        GATECH 
Ref:            http://whois.arin.net/rest/org/GIT-Z 
 
 
 
## 
## post-1994: CIDR (classless inter-domain r) 
## 
 
32 bits: IP addr + mask 
 
 
e.g. 
65.14.248.0/22    # tells the first 22bits are NW ID 
                  # lets you have variable length to scope better 
 
 
but obviously confusion arose. 
 
e.g. 
65.14.248.0/22 
65.14.248.0/24    ## this is simply a subset of the /22 
 
===> what if those two prefix appear on a routing table? 
===> "longest prefix match" i.e. pick the longest mask 
 

# longest prefix match 

 
===> lets you have efficient routing table thru hierrchy/aggregation 
e.g. 
 
  other NWs 
     || 
    NW_a 
    /  \ 
 NW_b  NW_c 
 
NW_a : 12.0.0.0/8 
NW_b : 12.20.0.0/16 
NW_c : 12.21.0.0/16 
 
in the above case, the rest of the world(other NWs) only need to have NW_a prefix in their routing table. 
hence, aggregation of IP prefixes. 
 

# multi-homing 

 
multi-homing stimies prefix aggregation 
 
(video) 
https://www.youtube.com/watch?v=niUWkzKAsRI 
 
lets say AT&T (who owns 12.0.0.0/8) as an ISP allocates you (AS_you) 12.20.123.0/24 
and you want to be reachable thru AT&T and Verizon ISPs. 
 
the internet 
  |    | 
AT&T   Verizon 
   \  / 
   AT_you 
 
 
===> AT&T cannot aggregate and advertize only  12.0.0.0/8  to the internet, because then, AT&T advertize 12.0.0.0/8 and Verizon will be advertizing 12.20.123.0/24. 
===> by longest prefix match policy, all traffic to AS_you will come via Verizon. thus AT&T also needs to advertize  12.20.123.0/24 
===> we end up having "lots" of /24s in global routing table. 
 

# longest prefix match(LPM) to control inbound traffic 

 
AS_a : 10.1.0.0/16 
 
 
       AS_b 
     /      \ 
AS_a         AS_d 
     \      / 
       AS_c 
 
 
 
===> since AS_d will have two entries for 10.1.0.0/16, it will pick either one based on the policies discussed in Routing lecture. 
 
===> now, instead of advertizing the same /16  thru AS_b & AS_c, AS_a might de-aggregate the prefix as below. 
 
10.1.0.0/17 
10.1.128.0/17 
 
then advertize each to AS_b & AS_c respectively (along with 10.1.0.0/16), thus effectively load-balancing inbound traffic. 
even if either link fails, the /16 will cover for it. 
 
===> but leads to more table entries. have to be careful. (CIDR will weekly issue offender list) 
 
 
## 
## Lookup tables and how LPM works 
## 
- exact match VS LPM 
- IP addr lookup 
- LPM implementation - Tries 
 
 
##  lookup algorithms 
 
Protocol            | Mechanism         | Techniques 
------------------------------------------------------------------------------------------ 
MPLS, Ethernet, ATM | exact match       | direct/associative lookup, hashing, binary tree 
IPv4/6              | LPM               | radix trie, compressed trie, binary search on prefix interval 
 
 
Ethernet addr : 48-bits (global), exact match search = O(1), but inefficient use of memory 
 
LPM : why longest -> because among overlapping addr space, it matches the narrowest range. 
======> how do we efficiently search all prefix lengths + prefixes of given length? 
 
 

# IPv4 lookup example (with Exact match) 

====> even the exact match, there can be (theoretically) 32 diff prefix length. 
====> every patcket to check 32 diff patterns is extremely wasteful of memory. 
 

# IPv4 lookup example (with Tries) 

(video) 
https://www.youtube.com/watch?v=8tcPbGVVnDo 
 
routing table: 
P1 111* 
P2 10* 
P3 1010* 
P4 10101 
 
Trie: 0=left, 1=right   (called "single bit" trie) 
 
    A 
     \1 
      B 
    0/ \1 
    P2  C 
     \1  \1 
      D   P1 
    0/ 
    P3 
     \1 
      P4 
 
 
====> prefix is spelled out by following path from root 
====> keep track of non-empty node as you traverse, then that is your longest match 
====> delete/insert is easy. mem space usage is efficient. 
====> but if it's 32 bit addr, then that is 32 accesses -> too many. 
 
alternative is "direct trie" where it has N size branching factor just to ensure certain worst case access count. 
 
e.g. 
 
    root        // Level 0 
  ////\\\\\ 
 (2^24 nodes)   // lets you have 24bits worth possibilities (maybe NW ID space) 
 //\\ 
(2^8 nodes)     // each of level 1 node has 2^8 child nodes (maybe host ID space) 
 
===> ensures only 2-access lookup but this is in-flexible, and results in inefficient use of memory 
===> e.g. if you want to represent /16 prefix,  but first depth is 2^24, it means 2^8 identical entries at level 1. 
 
GOOD quiz from video 
 
given the below direct trie, how many entries for a /20 prefix ? 
 
   root 
   ||||| 
  (16-bits) 
   ||||| 
  (8-bits) 
   ||||| 
  (8-bits) 
 
 
level 1 gets filled, then each of level 1 needs to have 2^4 entries = 16 
 
 
## 
##  memory efficiency + fast lookup 
## 
 
obvious trade-off of speed(direct trie) and space(single bit trie) 
 
===> good compromise = multi-bit trie 
 
depth  :  w/k 
degree :  2^k 
stride :  k-bits         // single bit trie k = 1 
 
 
e.g.  if k = 2, it's 4-ary trie 
 
routing table: 
P1 111* 
P2 10* 
P3 1010* 
P4 10101 
 
 
       root 
    10 /  \11 
     P2    A 
   10/  10/ \22 
    P3   P1  P1 
  10/ \11 
   P4 P4 
 
===> how about lookup request comes for "10111" ? 
===> we fwd to P2 
 
(video) 
https://www.youtube.com/watch?v=e_5cnbLCeDw 
 
===> to save space further, "leaf-pushed" trie, "Lulea" "Patricia" that use 3-level trie, bit-map compression for repeated entries, etc. 
 
 
## 
##  alternative to LPM with Tries 
## 
- CAM (content addressable memory)  :  input = tag (e.g. address) 
                                       output= value (e.g. port) 
===> only exact match O(1) 
 
- ternary CAM (0,1,*)     ===> wildcard permits implementation of LPM 
 
 
## 
##  NAT and IPv6  (solution to IPv4 addr depletion) 
## 
 
#### 
#### NAT (network addr translation) can reuse the same private IP addr space. 
#### 
 
- popular on boardband, SOHO, VPN 
 
192.168.0.0/16   (see RFC1918/3130 for private IP addr space) 
 
      private                  global                                      global 
NW_a (192.168.0.0/16) --- NAT (203.178.1.1) --    the     ---  some host (18.31.0.38) 
NW_b (192.168.0.0/16) --- NAT (133.4.1.5)  ---  internet 
 
 
lets say some host in NW_b (192.168.1.10) want to send packets to (18.31.0.38) above. 
 
packet                                      packet 
---------------------                      -------------------- 
src 192.168.1.10:1234   ======(NAT)=====>   src 133.4.1.5:5678 
dst 18.31.0.38:80                           dst same 
 
PROS:  saves addr space 
CONS:  breaks end-to-end model, asymmetric, (NAT breaks, then the communication path breaks) 
 
 
#### 
####  IPv6 
#### 
 
see the packet header structure IPv4 VS IPv6(simpler !) 
(video) 
https://www.youtube.com/watch?v=Z6i9BRtXy_I 
 
- 128 bits 
top 48-bits : public routing topology (= 3bit aggregation + 13bit tier1 ISP + 8bit reserved + 24bit additional) 
    16-bits : site identifier 
    64-bits : interface ID = 48-bit ethernet + 16 more bits 
 
- simpler header 
- more addr space 
- multi-homing 
- security (ipv6 crypto extension) 
 
====> notice 13bit tier1 ISP part, this means changing ISP requires re-numbering. 
 
in 2014, for BGP routes 
IPv6 : 16k 
IPv4 : 500k 
 
=====> always hard to deploy incrementally (recall narrow waist where other layers depend on IPv4) which cause incompatibility 
 
 
##  IPv6  "incremental deployment" 
- "dual stack" that speaks both IPv4/6 
- containing v4 addr inside v6 addr 
- v6 to 4 tunneling (needs some gateway routers that does encapsulation/decapsulation) 
 
 
 
################################### 
####   (5.1)  Router Design    #### 
################################### 

# need for the design of big fast modern routers (e.g. Cisco CRS-1, Juniper M320) 
# (especially in backbone NW) 
- links are getting faster 
- traffic demands are increasing (e.g. streaming video) 
- NWs are getting bigger (hosts, users, etc) 
 

# basic router architecture 

(1) receive packet 
(2) look at header to determine dst 
(3) look in fwd table to determine output interface 
(4) modify header (e.g. TTL) 
 
                Single Line card (I/O) 
              ------------------------------------------------- 
              |                                               | 
packet -----> |  lookup table -----> update header ---> queue --> output port 
[data|header] |  (IP <-> next hop)                            | 
              |                                               | 
              ------------------------------------------------- 
 
===> usually we have a collection of line cards connected via an inter-connection fabric 
 
(1) how big should queues(buffers) be ?   (already discussed) 
(2) how lookup works ?                    (already discussed) 
(3) how should we place lookup tables in line cards ? 
(4) how to design the inter-connection fabric ? 
 

# how should we place lookup tables in line cards ? 

 
- traditionally, multiple line cards shared one central lookup table in some shared buffer memory thru shared bus. 
-- caused contention, bottleneck on shared mem/bus. 
 
-> modern line cards all retain local copies of the lookup table. (for high speed) 
 
 

# how to design the inter-connection fabric ? 

 
- how do we connect multiple line cards? 
- shared bus ?  --> only supports one input/output at a time. - no good. 
- ideally given 6 line cards below, as long as pairs dont compete, we want to enable them to send at the same time slot. 
 
     interconnect fabric 
LC_1   ---------------->  LC_4 
LC_2   ------\/-------->  LC_5 
LC_3   ------/\-------->  LC_6 
 
 
===> solution : cross-bar switch (aka switched backplane) 
 
## cross-bar switching 
- every input port has a connection to every output port 
- during each timeslot, each input connects to zero/one output 
 
e.g.  you can pair up 1-4, 2-6, 3-5, etc 
 
1--------- 
     | | | 
2----|-|-| 
     | | | 
3----|-|-| 
     | | | 
     | | | 
     4 5 6 
 
Advantage  :  parallelism 
requirement:  good scheduling algorithm (to ensure fair use of corss-bar switch) 
 

# switching algo : maximal matching 

 
conceptually:  N inputs,  N outputs 
 
- in each timelot, we wanna achieve one-to-one mapping between inputs and outputs 
 
(video for visual) 
https://www.youtube.com/watch?v=GJHB5KW6JXY 
 
- using the above cross-bar graph, maybe there are needs to 1-4, 2-4, 2-5, 2-6, 3-4, 3-6. 
- which ones do you keep/queue to achieve the max number of matched non-competing pairs? 
- (queued ones will be processed at next timeslot) 
- lets say we got 3 pairs maximally matched at a given timeslot, if each line card thru-put is 10Gb/s, then the inter-connect fabric must be at least 30Gb/s to be able to process them. (called "speed-up" feature of cross-bar switch). also, if some queueing(lets say three packets) at the output of (lets say) LC_1, and if packets are destined to diff LCs, then the inter-connect can quickly resolve the queue.  but obviously, if the queue is long, it will still be some latency. 
 
===>  solution : virtual output queues. 
instead of having a single output queue, virtually set up a queue per output port(e.g. LC4/5/6). see video. 
 
 

#  scheduling and fairness 

 
scheduling : decision on during which timeslots should inputs & outputs be matched ? 
-- wanna ensure 
(1) efficiency 
(2) fairness     # a tricky term 
 
##  "max-min fairness"  method 
definition: 
- small demands get what they want. 
- large users split the rest of the capacity. 
 
e.g. 
flow "rate" allocation : {x1,x2,x3,,,,xn} 
---> increasing any xi => some other xj < xi must be decreased.   ## if this is achieved, we call it max-min fair 
 
(formally) 
we allocate resources to users in the order of increasing demands. 
no user receives more than what they requested. 
users with unsatisfied demands split the remaining resources. 
 
(ref) 
http://en.wikipedia.org/wiki/Max-min_fairness 
 
e.g. 
Demands{2.0, 2.6, 4.0, 5.0} 
capacity: 10 
 
10/4 = 2.5    # first user would have excess of 0.5 
0.5/3 = 0.167 # second user would have excess of 0.07 
0.07/2= 0.035 # allocate between the third/fourth users. 
 
[2.0, 2.6, 2.7, 2.7] 
 
 
======> now look at the definition again. it makes sense. this combats congestion better. because ill-behaved burst flow will punish itself. 
 
e.g. 
Demands {1,2,5,10} 
capacity: 20 
 
20/4 = 5   [5,5,5,5]    # first/second users will have excess of 4,3 respectively 
                        # give them to the user with unsatisfied flow that is only the fourth user 
                        # thus resulting allocation is [1,2,5,12]   NOT  [1,2,5,10] 
 

#  How to achieve max-min fairness ? 

(video) 
https://www.youtube.com/watch?v=gMtA0M3-i00 
 
- Round-Robin scheduling: service each queue in order : problem is each queue may have diff size packets. 
- bit-by-bit scheduling : service only bit in each queue : problem is feasibility of processing only a bit from a queue 
- "fair queueing"       : service packets according to their soonest "finish time" 
                        : it looks at non-empty flow queue, and estimates virtual finish time, and process from the minimum finish time packets. 
 
 
 
 
 
########################## 
####   (5.2)  DNS    ##### 
########################## 
 
 
DNS(domain name system) is a hierarchical distributed naming system for the internet resources. like a phone book. 
 
e.g.  maps URL with IP addr. 
www.gatech.edu -> 130.207.160.173 
 
 
client looks up /etc/hosts or local DNS resolvers(servers) which is usually auto-configured by DHCP. 
 
www.gatech.edu -> .edu .gatech.edu domains 
local DNS servers may have cache(with TTL) but if not, then query their upstream (higher authority) DNS servers. 
 

# record types: 

 
A record    : name -> IP addr 
NS record   : name -> authoritative nameserver for that name(domain) = "referrals" 
MX record   : name -> main server 
CNAME record: canonical name (alias name) 
PTR record  : IP -> name  (reverse lookup) 
AAAA record : name -> IPv6 
 
===> see video on how typically local DNS servers recursively queries higher auth DNS servers with A-NS, A-NS, A-A records exchange. 
(video) 
https://www.youtube.com/watch?v=hSLNgI9Sr9s 
 

#  DNS lookup example (using "dig") 

 
# lets check www.gatech.edu 
 
$ dig www.gatech.edu 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> www.gatech.edu 
;; global options: +cmd 
;; Got answer: 
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12817 
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 3 
 
;; OPT PSEUDOSECTION: 
; EDNS: version: 0, flags:; udp: 4096 
;; QUESTION SECTION: 
;www.gatech.edu.                        IN      A                          ## our query, A-record 
 
;; ANSWER SECTION: 
www.gatech.edu.         300     IN      CNAME   tlweb.gtm.gatech.edu.      ## it was an alias.     300 = TTL 
tlweb.gtm.gatech.edu.   30      IN      A       130.207.244.165            ## then we got the IP.  30  = TTL (in seconds) 
 
;; AUTHORITY SECTION: 
gtm.gatech.edu.         300     IN      NS      gtm-dns-bcdc.gatech.edu. 
gtm.gatech.edu.         300     IN      NS      gtm-dns-rich.gatech.edu. 
 
;; ADDITIONAL SECTION: 
gtm-dns-bcdc.gatech.edu. 300    IN      A       130.207.160.47 
gtm-dns-rich.gatech.edu. 300    IN      A       130.207.165.192 
 
;; Query time: 522 msec 
;; SERVER: 203.178.142.130#53(203.178.142.130) 
;; WHEN: Tue Sep 02 18:40:01 JST 2014 
;; MSG SIZE  rcvd: 169 
 
 
 
# lets check nytimes.com 
 
$ dig nytimes.com 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> nytimes.com 
;; global options: +cmd 
;; Got answer: 
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28165 
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 7 
 
;; OPT PSEUDOSECTION: 
; EDNS: version: 0, flags:; udp: 4096 
;; QUESTION SECTION: 
;nytimes.com.                   IN      A 
 
;; ANSWER SECTION: 
nytimes.com.            500     IN      A       170.149.168.130         ##  you see two IP addrs 
nytimes.com.            500     IN      A       170.149.172.130         ##  this is for "load balancing" 
 
;; Query time: 172 msec 
;; SERVER: 203.178.142.130#53(203.178.142.130) 
;; WHEN: Tue Sep 02 18:43:43 JST 2014 
;; MSG SIZE  rcvd: 300 
 
 
 
# lets check specifically for NS record 
 
$ dig ns gatech.edu 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> ns gatech.edu 
;; global options: +cmd 
;; Got answer: 
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50822 
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 6 
 
;; OPT PSEUDOSECTION: 
; EDNS: version: 0, flags:; udp: 4096 
;; QUESTION SECTION: 
;gatech.edu.                    IN      NS                            ## issued NS query 
 
;; ANSWER SECTION: 
gatech.edu.             300     IN      NS      dns3.gatech.edu.      ## got three answers 
gatech.edu.             300     IN      NS      dns2.gatech.edu.      ## any of them can answer authoritatively 
gatech.edu.             300     IN      NS      dns1.gatech.edu. 
 
;; ADDITIONAL SECTION: 
dns1.gatech.edu.        3600    IN      A       128.61.244.253         ## here you got IP addr of those DNS servers 
dns1.gatech.edu.        300     IN      AAAA    2610:148:1f00:f400::3  ## for IPv6 as well 
dns2.gatech.edu.        3600    IN      A       130.207.244.81 
dns2.gatech.edu.        300     IN      AAAA    2610:148:1f01:f400::3 
dns3.gatech.edu.        1018    IN      A       168.24.2.35 
 
;; Query time: 275 msec 
;; SERVER: 203.178.142.130#53(203.178.142.130) 
;; WHEN: Tue Sep 02 18:45:55 JST 2014 
;; MSG SIZE  rcvd: 200 
 
 
 
# lets check MX record for email server 
 
 $ dig mx gatech.edu 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> mx gatech.edu 
;; global options: +cmd 
;; Got answer: 
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25730 
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 3, ADDITIONAL: 8 
 
;; OPT PSEUDOSECTION: 
; EDNS: version: 0, flags:; udp: 4096 
;; QUESTION SECTION: 
;gatech.edu.                    IN      MX 
 
;; ANSWER SECTION: 
gatech.edu.             300     IN      MX      10 mxip2.gatech.edu.     ## two email servers 
gatech.edu.             300     IN      MX      10 mxip1.gatech.edu.     ## 10 = priority (bigger higher priority) 
 
;; ADDITIONAL SECTION: 
mxip1.gatech.edu.       300     IN      A       130.207.165.196          ## IP addr of the email servers 
mxip2.gatech.edu.       300     IN      A       130.207.160.176 
 
;; Query time: 177 msec 
;; SERVER: 203.178.142.130#53(203.178.142.130) 
;; WHEN: Tue Sep 02 18:48:42 JST 2014 
 
 
 
 
# now lets add trace option to see the recursive call we discussed in the lecture. 
# as you can see, it first queries "." then gets NS records for edu., then gatech.edu. 
 
$ dig +trace gatech.edu 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> +trace gatech.edu 
;; global options: +cmd 
.                       523     IN      NS      k.root-servers.net. 
.                       523     IN      NS      i.root-servers.net. 
.                       523     IN      NS      f.root-servers.net. 
.                       523     IN      NS      c.root-servers.net. 
.                       523     IN      NS      h.root-servers.net. 
.                       523     IN      NS      d.root-servers.net. 
.                       523     IN      NS      l.root-servers.net. 
.                       523     IN      NS      b.root-servers.net. 
.                       523     IN      NS      m.root-servers.net. 
.                       523     IN      NS      g.root-servers.net. 
.                       523     IN      NS      j.root-servers.net. 
.                       523     IN      NS      a.root-servers.net. 
.                       523     IN      NS      e.root-servers.net. 
.                       3565    IN      RRSIG   NS 8 0 518400 20140909000000 20140901230000 8230 . qO29qQ4hS1KsQSiGZ2rit1kW7MoxHFKICx91l3VXoOeSDXBGgpGPQdiN amTgbAz8XPug3mEzgmFsdLiqt/hrS7NRR+OKmX4SO8bDYDhx1N7pBXew xhT2d9R5te0P9L9qEctITe51n+cMxKU3KNzwirQvB6D2IvJzylLb0K97 A5Y= 
;; Received 913 bytes from 203.178.142.130#53(203.178.142.130) in 2361 ms 
 
edu.                    172800  IN      NS      a.edu-servers.net. 
edu.                    172800  IN      NS      c.edu-servers.net. 
edu.                    172800  IN      NS      d.edu-servers.net. 
edu.                    172800  IN      NS      f.edu-servers.net. 
edu.                    172800  IN      NS      g.edu-servers.net. 
edu.                    172800  IN      NS      l.edu-servers.net. 
edu.                    86400   IN      DS      28065 8 2 4172496CDE85534E51129040355BD04B1FCFEBAE996DFDDE652006F6 F8B2CE76 
edu.                    86400   IN      RRSIG   DS 8 1 86400 20140909000000 20140901230000 8230 . rOWSU8gDv19r+tGrca5YQlVd1sZmwFD+Klct3ayGoJvjPQTOwmNh2ifl +sY1ZeF0aJJ3f9mkug5AWlfupyyYRykd7jRWMfHSEhZLS1gJGk2V4XRG mjnHyp2eChLVn8arqN1yZfFn35z6PDkljy7eaZxdbF6ELdXclwAz7wet ob8= 
;; Received 481 bytes from 128.63.2.53#53(h.root-servers.net) in 584 ms 
 
gatech.edu.             172800  IN      NS      dns1.gatech.edu. 
gatech.edu.             172800  IN      NS      dns2.gatech.edu. 
gatech.edu.             172800  IN      NS      dns3.gatech.edu. 
9DHS4EP5G85PF9NUFK06HEK0O48QGK77.edu. 86400 IN NSEC3 1 1 0 - 9HVQECE2O4M8V3B21LPRECG4GFKRPKK3 NS SOA RRSIG DNSKEY NSEC3PARAM 
9DHS4EP5G85PF9NUFK06HEK0O48QGK77.edu. 86400 IN RRSIG NSEC3 8 2 86400 20140909094952 20140902083952 1790 edu. g/GZPEFm3QNMhgCKgCVG7q4J5SUA1hs3qQejMMvkRBingnRUNljQyoV6 WP0V7gK/HGuuVwH63UES7VIyiiibnPNXW4CKEBjEidqtj+UanYU+vKJf sZMpdKLD0gK2TcwLSR47+yyDay/FhdbggJnDarxP9X2CHvUSvsQnEo0Y DbU= 
KGDNA3UUELMTQU7DADUDVIQ027H2GFU6.edu. 86400 IN NSEC3 1 1 0 - KMMD6PFI08T89S5J7OE46QQUG4S1NQJP NS DS RRSIG 
KGDNA3UUELMTQU7DADUDVIQ027H2GFU6.edu. 86400 IN RRSIG NSEC3 8 2 86400 20140909012233 20140902001233 1790 edu. bjYXQAj29IVrzsIOyMpevt967JYNX1qDxwG/Ws2rrLQkTlFIb8h+5ac3 v8741wD6Gxlv+3BgrROcslwE9/izGahaVUAuvYwz/Mh0kNfy3UyM4gav 07as5hQn9ce3JT8lnE4dLPCQCuncbPFnMQscQeTz1j+vxjKdb4YJ6H3i 1To= 
;; Received 629 bytes from 192.42.93.30#53(g.edu-servers.net) in 1096 ms 
 
gatech.edu.             300     IN      A       130.207.160.173 
gatech.edu.             300     IN      NS      dns1.gatech.edu. 
gatech.edu.             300     IN      NS      dns2.gatech.edu. 
gatech.edu.             300     IN      NS      dns3.gatech.edu. 
;; Received 216 bytes from 128.61.244.253#53(dns1.gatech.edu) in 175 ms 
 
 
 
 
## now lets look at PTR 
## in the end, we get PTR, but what about the intermediate process? 
## first, as you can see, we ask the "." root servers 
## then instead of getting referred to ".edu" or ".com" we get referred to a special top level fomain called in-addr.arpa 
## in-addr.arpa maintains referrals maintained by each regional registry (APNIC, ARIN, RIPE, etc) 
## in this case, next stap is 130.in-addr.arpa maintained by ARIN 
## eventually resoving each octed. once you reach DNS gatech.edu then it can do reverse lookup 
 
$ dig +trace -x 130.207.160.173 
 
; <<>> DiG 9.9.5-3-Ubuntu <<>> +trace -x 130.207.160.173 
;; global options: +cmd 
.                       257     IN      NS      b.root-servers.net. 
.                       257     IN      NS      e.root-servers.net. 
.                       257     IN      NS      h.root-servers.net. 
.                       257     IN      NS      c.root-servers.net. 
.                       257     IN      NS      l.root-servers.net. 
.                       257     IN      NS      i.root-servers.net. 
.                       257     IN      NS      f.root-servers.net. 
.                       257     IN      NS      j.root-servers.net. 
.                       257     IN      NS      a.root-servers.net. 
.                       257     IN      NS      d.root-servers.net. 
.                       257     IN      NS      k.root-servers.net. 
.                       257     IN      NS      g.root-servers.net. 
.                       257     IN      NS      m.root-servers.net. 
.                       3299    IN      RRSIG   NS 8 0 518400 20140909000000 20140901230000 8230 . qO29qQ4hS1KsQSiGZ2rit1kW7MoxHFKICx91l3VXoOeSDXBGgpGPQdiN amTgbAz8XPug3mEzgmFsdLiqt/hrS7NRR+OKmX4SO8bDYDhx1N7pBXew xhT2d9R5te0P9L9qEctITe51n+cMxKU3KNzwirQvB6D2IvJzylLb0K97 A5Y= 
;; Received 913 bytes from 203.178.142.130#53(203.178.142.130) in 1 ms 
 
in-addr.arpa.           172800  IN      NS      a.in-addr-servers.arpa. 
in-addr.arpa.           172800  IN      NS      b.in-addr-servers.arpa. 
in-addr.arpa.           172800  IN      NS      c.in-addr-servers.arpa. 
in-addr.arpa.           172800  IN      NS      d.in-addr-servers.arpa. 
in-addr.arpa.           172800  IN      NS      e.in-addr-servers.arpa. 
in-addr.arpa.           172800  IN      NS      f.in-addr-servers.arpa. 
in-addr.arpa.           86400   IN      DS      53696 8 2 13E5501C56B20394DA921B51412D48B7089C5EB6957A7C58553C4D4D 424F04DF 
in-addr.arpa.           86400   IN      RRSIG   DS 8 2 86400 20140909000000 20140901230000 50944 arpa. Qn5LgBItaKVkE9v2w1i/c/ytn25Rcp4o6zkvqgzXcR4bUlpZu0M/nWWh 695PeG23LXQoJRrbQu+LBpYclTBFB5RM30dUtzfXRRDSuB+sler1Vkmx Z6m+Uav1XEhRAM+qWoGZjKFSyy/AC3kJhlbONjNugWW6nKYH1yyr2Hp0 oDI= 
;; Received 645 bytes from 193.0.14.129#53(k.root-servers.net) in 1869 ms 
 
130.in-addr.arpa.       86400   IN      NS      r.arin.net. 
130.in-addr.arpa.       86400   IN      NS      t.arin.net. 
130.in-addr.arpa.       86400   IN      NS      u.arin.net. 
130.in-addr.arpa.       86400   IN      NS      v.arin.net. 
130.in-addr.arpa.       86400   IN      NS      w.arin.net. 
130.in-addr.arpa.       86400   IN      NS      x.arin.net. 
130.in-addr.arpa.       86400   IN      NS      y.arin.net. 
130.in-addr.arpa.       86400   IN      NS      z.arin.net. 
130.in-addr.arpa.       86400   IN      DS      22794 5 1 A8F6CD5A570993125F227CD49C8390B9D1B79D1F 
130.in-addr.arpa.       86400   IN      RRSIG   DS 8 3 86400 20140909203418 20140902082741 33363 in-addr.arpa. Ep0hzJc/gRR+VpN6c/FClZ13X9f8gHcsTwyF4Kp9smPvzggKncmI2vtq dGz+/m6RO3jKxNN929QsPvwxTQVUii8hqVy2HmTDqDDJ4O117hbldmN3 1o/GjWu6DKLN6PP5KWGzjoukhc3W/H58J9+7KsKg2tql23m2yZGNWGHH 2xE= 
;; Received 401 bytes from 2001:dd8:6::101#53(e.in-addr-servers.arpa) in 13 ms 
 
207.130.in-addr.arpa.   86400   IN      NS      dns3.gatech.edu. 
207.130.in-addr.arpa.   86400   IN      NS      dns1.gatech.edu. 
207.130.in-addr.arpa.   86400   IN      NS      dns2.gatech.edu. 
207.130.in-addr.arpa.   10800   IN      NSEC    208.130.in-addr.arpa. NS RRSIG NSEC 
207.130.in-addr.arpa.   10800   IN      RRSIG   NSEC 5 4 10800 20140912080109 20140902080109 13244 130.in-addr.arpa. p0xZ/7phyxdHOQy6bRCWVaWJIswTorkRX225sjUsbrSrrz8uN8/nAixM KAvyk6kPeOXIWFyXdYYbtaOpWnyFGA5DjzLWTq6+MCHj4C6wf9mwbjAe H7+LoQuc7nBVL0FYxkkDFacQ0Cx702X2jEFTCTkl5yQakRfz0xeeHxpj /+Y= 
;; Received 342 bytes from 192.42.93.32#53(y.arin.net) in 98 ms 
 
173.160.207.130.in-addr.arpa. 300 IN    PTR     tlw-cache.bcdc.gatech.edu. 
160.207.130.in-addr.arpa. 300   IN      NS      dns2.gatech.edu. 
160.207.130.in-addr.arpa. 300   IN      NS      dns1.gatech.edu. 
160.207.130.in-addr.arpa. 300   IN      NS      dns3.gatech.edu. 
;; Received 257 bytes from 2610:148:1f01:f400::3#53(dns2.gatech.edu) in 178 ms 
 
 
 
 
 
################################################### 
#####   (6) Congestion control & streaming    ##### 
################################################### 
 
resource control (handing bandwitdh contrains on links), and content distribution(streaming) 
 

# congestion control : fill the internet pipes without overflowing them 
#                    =  (1) efficient use  (2) fair allocation   (3) avoid congestion collapse 
 
typical congestion : congestion collapse 
- increase in load can lead to decrease in work done (just like traffic jam) due to (1) spurious re-transmission (2) undelivered packets 
- packets are still fwd'ed but getting dropped a lot means less work is done. 
 
Work 

|     ------    (saturation) 
|    /      \ 
|   /        \   (collapse) 
|  /          \ 
| / 
|/ 
-------------------------   Load 
 
 
 

#  two approaches to congestion control 

 
(1) end-to-end 
(2) NW-assisted 
 
 
(1) end-to-end  (typical TCP) 
- no feedback from NW. 
- congestion inferred by loss & delay 
 
(2) NW-assisted 
- routers provide feedback (thru single bit fkag, instructing explicit rates, etc) 
- TCP ECN(explicit congestion notification) extention uses single bit 
 

# TCP congestion control 

- in TCP, "buffer drop(ie packet loss)" is considered a sign of NW congestion. 
(obviously, this is not true in wireless NW for example where no congestion but interference can cause packet loss) 
- so in TCP, senders increase rate until packets are dropped, then slow down. (see below two approaches) 
 

#  two approaches to adjusting rates 

 
(1) window-based (AIMD) 
- "window" indicating how many outstanding(in-flight) un-acked packets can be in the NW. 
- increase window size = increase rate 
 
additive increase       : once W(n packets) successfully completes without packet loss, then try W+1 (n+1 packets) 
multiplicative decrease : if packet drop for W window, then try W/2 (n/2 packets per window).  (or not always by half, but maybe some constant) 
 
(2) rate-based 
- monitor loss rate 
- use timer to modulate the transmission rate 
 
====> (1) window-based is common. 
 
 
## 
##  a quiz 
## 
 
RTT: 100ms 
packet size: 1KB (=8K bits) 
window size: 10 packets 
 
===> what is the sending rate?    ==>   10 * 8K * (1000ms/100ms) = 800Kbps 
 
 
 
## 
##  phase plot : to understand optimal point of efficiency and fairness of NW resource allocation 
## 
 
lets think about efficiency first 
 
x2 


|\ 
| \ 
|  \ 
|   \   B 
|    \ 
| A   \ 
|      \ 
|       \ x1 + x2 = c (capacity, "efficient") 
|----------------x1 
 
if usage falls onto plane A, then it's "under-utilized" 
if usage falls onto plane B, then it's "over-utilized" 
 
===> now lets apply fairness, to derive the optimal point 
 
x2 


|\      / x1 = x2 ("fair") 
| \    / 
|  \  / 
|   \/   # this cross over point 
|   /\   # is the optimal point of efficiency and fairness 
|  /  \ 
| /    \ 
|/      \ x1 + x2 = c (capacity, "efficient") 
|----------------x1 
 
 
====> AIMD will converge to efficient and fair optimal point. 
(see video) 
https://www.youtube.com/watch?v=EfeAglStSd4 
 
AI : gets us closer to efficiency line, further(same) from fairness line 
MD : gets us closer to fairness line, further from efficiency line 
 
 

# AIMD for TCP congestion control 

- distributed (i.e. each sender can follow AIMD independently) 
- fair & efficient (as seen above) 
 
(must see video) 
https://www.youtube.com/watch?v=U5eV9HJtOYw 
 
TCP sawtooth. 
 
why  Wm/2 + 1  ?   ==> every RTT without packet loss, we add 1 packet to the window. 
                   ==> we add Wm/2 + 1 times at each sawtooth, +1 is the last addition that causes the buff drop. 
 
p = total packets sent per packet loss 
  = 1/2 * Wm/2 * (Wm/2 + 1) * k 
 
Loss Rate = 1/p 
 
Wm ~= 1 / k*sqrt(p) 
 
Thruput = L = 3/4 * Wm/RTT 
           ~= k/RTT*sqrt(p)      ## tells Tput is inversely proportionate to RTT and sqrt of loss rate 
 
 

#  TA-recommended wiki 

(pi chart is good) 
http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control 
http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Congestion_control 
 
 

#  TCP congestion control in Data Centers 

 
Data centers: 
 
 
         NW 
         ||       #  high "fan-in" : requires high-bandwitch and low latency 
      switchs     #  switches only have small buffers 
    /    ||   \ 
racks  racks  racks (of servers) 
 
"Fan-in is the number of inputs a gate can handle." (wiki) http://en.wikipedia.org/wiki/Fan-in 
 
"incast" : TCP collapse in application throughput when servers using TCP all simultaneiously request data. 
         : fill-up of switch buffer result in retransmission burst that is caused by TCP timeouts which can last a few hunderds msec. 
         : but RTT in DC NW is typically less than 1 msec. 
         ==> senders will have to wait for TCP timeouts before retransmission. 
         ==> this leads to gross under-utilization of many-to-one communication NW like DC. 
 
(ref) 
http://www.pdl.cmu.edu/Incast/ 
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-40.pdf    ## see Fig 1 in page 2 
 

# "barrier synchronization" - common request pattern in DCs today 

- a client/app has multiple parallel threads and no thread can continue until all threads reach a certain point of execution. 
- but as seen above, this can lead to TCP incast. 
e.g. one thread suffers packet drop while all other threads succeed. this means all other threads complete in RTT, but the failed thread needs to wait TCP timeout that can be far far bigger than RTT. while waiting for TCP, link can be idle, underutilized. (besides, addition of more servers in NW induces overflow of the switch buffer, causing severe packet loss, leading to thruput collapse.) 
 
(solutions) 
- get TCP timeout closer to RTT (fine grain granular retransmission window) 
- only ack every other packet (rather than every packet) 
 
 
 
## 
##  multi-media & streaming 
## 
 
(challenges) 
- large volume of data 
- data volume varies over time (rate not constant) 
- low tolerance for delay variation/period 
(some loss is still accecible) 
 
digitizing audio/video :  sampling & quantization 
 
sampling rate:  (size of a sample) * (number of samples per sec) 
            e.g.  8 bits           * 8000 
                = 64kbps 
 

# video compression 

 
image compression         -> spatial redundancy 
compression across images -> temporal redundancy 
 
reference/anchor frame (I-frame) 
derived frame (P-frame) = I-frame + motion vectors(for blocks that changed) 
 
common video compression : MPEG 
 

# streaming video 

- server stores audio/video files 
- client requets 
- server slices out with timestamp (so client knows in which order to play them) 
- client maintains a playout-buffer to play smoothly.  (obviously, client should wait first for the playout buff to have some reserve) 
- (as discussed, some loss or initial delay is ok, but retransmission can lead to playout buffer starvation) 
 
===> clearly TCP is not a good fit. 
- reliable delivery 
- slow down upon loss 
- header overhead (20bytes per packet) 
 
===> hence UDP (user datagram protocol) 
- no retransmission 
- no sending rate adaptation 
- smaller header 
(adjusting send rate, transmission timing, how to encapsulate, whether to retransmit, all left to higher layer e.g. app to decide) 
- still needs to be TCP friendly cos we share the same link 
 
 
youtube: flash/html5 over http/tcp (simple, not UDP) at the expense of video quality. CDN(content distrib NW) - later lecture. 
       : youtube uses DASH  https://en.wikipedia.org/wiki/Dynamic_Adaptive_Streaming_over_HTTP 
skype,voip: AD converstion done at the app level. 
 
skype: 
- central login server 
- p2p data exch 
- compress audio (67 bits/packet, 140 packets/sec, = 40kbps each direction) 
- encryption 
 
====> weak against NW delays/congestion/disruption 
====> QoS (1) explicit reservation (2) mark certain packets higher priority 
 

#  QoS : marking and policing 

- apps compete for bandwidth 
:solutions 
- marking certain app packets higher priority (maybe set up separate queues of diff priority) 
- scheduling (service higher priority queue more frequently than less prioritized queue) 
- fixed b/w per app (= inefficient) 
- admission control (blocks if over-cap, like busy signal on phone, not realistic for web services) 
 
 
# e.g.   multiple queues for each flow instead of one big queue at switch/router 
 
(classmate comment from piazza) 
 
Packets are put into queues by the switch/router/device via classification (e.g. noticing that the IP protocol field is TCP, ok that goes in queue 1, this one is UDP, ok that goes in queue 2). The packets themselves can't escape this classification or queue placement (applications could change from using TCP to UDP or vice-versa if they find they're getting better speeds, but that's a bit higher level). Allowing packets to manipulate their queue locations is simply not possible because the packets are treated as data- they don't get to decide which queue they go in. 
 
That said, packets can have QoS markings either in the CoS bits of VLAN tag or in the ToS bits of the IP header. However, in a properly set-up QoS network, you (almost) never trust QoS values as they come from an end-user system. You re-mark based on your own classifications (e.g. VoIP traffic should be high because it can't handle loss, undesirable services may be dropped or throttled, etc.) that you find in packet headers at various levels. Once a trusted device (switch/router/firewall) has classified and re-marked packets to "proper" values for your network, the rest of the devices can queue them based on these QoS bits. 
 
 
 
############################################# 
#####     assignment 3  - parking lot   ##### 
############################################# 
 
(udacity) 
https://www.udacity.com/wiki/cn/assignment3-parking-lot 
 
(video) 
https://www.youtube.com/watch?v=ZTQgPSE3NWg 
 
 

#  parking lot topology 

 
b/w   : 10Mbps     #  changed from 100 to 10 
delay : 1ms 
 
N senders each connected to a corresponding switch, which connects to a lone receiver. 
 
sender   sender   sender           sender 
  |        |        |                | 
switch - switch - switch -- ... -- switch - receiver 
 
 
iperf : to generate simultaneous TCP flows from each sender to the receiver.  just man iperf 
 
run for 60 sec, and plot the thruput timeseries from each sender as you increase N. 
 
edit parkinglot.py 
 
code 
(1) set up a topology for a given value of N 
(2) generate flows using iperf 
(3) monitor the thruput of all flows 
 
(result should look like this) 
https://lh3.ggpht.com/PnWypC2R9zax2bXfCI4u_tKUHInGXxU9T7LxYtjEthubbCQfatbsY1f0Dv6GMHQCRq4DnUH7X4hvmNtm1A=s500 
 
 
## 
##  iperf notes 
## 
 
http://en.wikipedia.org/wiki/Iperf 
 
run from both server and clients. 
 
 
#------ (snippet from the assignment code)-------- 
 
# server side 
'iperf -s -p 5001' 
 
-s (means server) 
-p port_number 
 
# client side 
'iperf -c %s -p %s -t %d -i 1 -yc > %s/iperf_%s.txt' % (recvr.IP(), 5001, seconds, args.dir, node_name) 
 
-c server_IP  ## run as client and connect to the target IP 
-p port_number 
-t duration_in_sec 
-i interval_in_sec 
-yc (tells the output format = CSV) 
-P number_of_flows  ## default = 1  # number of parallel client threads to run 
 
# incidentally, you can run mininet-version of iperf within mininet. 
 
mininet> iperf sender receiver 
*** Iperf: testing TCP bandwidth between sender and receiver 
*** Results: ['12.3 Mbits/sec', '12.3 Mbits/sec'] 
mininet> 
 
mininet> receiver iperf -s & 
mininet> sender iperf -i 1 -c receiver 
[ ID] Interval       Transfer     Bandwidth 
[  3]  0.0- 1.0 sec  1.25 MBytes  10.5 Mbits/sec 
[  3]  1.0- 2.0 sec  4.25 MBytes  35.7 Mbits/sec 
[  3]  2.0- 3.0 sec  5.38 MBytes  45.1 Mbits/sec 
[  3]  3.0- 4.0 sec  5.75 MBytes  48.2 Mbits/sec 
[  3]  4.0- 5.0 sec  5.75 MBytes  48.2 Mbits/sec 
[  3]  5.0- 6.0 sec  5.62 MBytes  47.2 Mbits/sec 
[  3]  6.0- 7.0 sec  6.00 MBytes  50.3 Mbits/sec 
[  3]  7.0- 8.0 sec  5.62 MBytes  47.2 Mbits/sec 
[  3]  8.0- 9.0 sec  5.88 MBytes  49.3 Mbits/sec 
[  3]  9.0-10.0 sec  5.62 MBytes  47.2 Mbits/sec 
[  3]  0.0-10.0 sec  51.2 MBytes  42.9 Mbits/sec 
 
 
# NOTE:  iperf's default TCP congestion control algo is CUBIC TCP 
basically concave-steady-convex   where that inflection point is the prev packet dropoed window size 
(ref) 
http://en.wikipedia.org/wiki/CUBIC_TCP 
 
 

# opening a flow from each host to the receiver 

 
    recvr = net.getNodeByName('receiver') 
    port = 5001 
    recvr.cmd('iperf -s -p', port,'> %s/iperf_server.txt' % args.dir, '&') 
 
    sender = [] 
    for i in range(n): 
        sender.append( net.getNodeByName('h' + str(i+1)) ) 
        waitListening(sender[i], recvr, port) 
        sender[i].sendCmd('iperf -c %s -p %s -t %d -i 1 -yc > %s/iperf_%s.txt' % (recvr.IP(), 5001, seconds, args.dir, sender[i])) 
    for i in range(n): 
        sender[i].waitOutput() 
 
 
# NOTE: if crashed in the middle, just issue "sudo mn -c" and "sudo killall iperf" and start fresh 
 
 
 
## 
##  TCP window size congestion control by iperf 
## 
 
as mentioned above, it is CUBIC TCP which does not do slow-start 
 
http://en.wikipedia.org/wiki/CUBIC_TCP 
http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf    ## talks about its logic in verbose details 
 
According to this doc, client side iperf uses Binary Exponentation back-off algo. 
https://iperf.fr/#adaptive 
 
other known congestion control/avoidance algo 
(1) SLOW START (exponential - linear)  # fast recovery means drop only by half and only employ linear growth (pretty much the basic AIMD) 
http://en.wikipedia.org/wiki/Slow-start 
http://www5e.biglobe.ne.jp/%257eaji/3min/42.html 
(2) WRED(weighted random early detection) 
http://www.itbook.info/qos/wred.html 
 
 
 
 
###################################################### 
######       Assignment 4 - Learning switch      ##### 
###################################################### 
 
pretty much a simple review of how an L2 switch learns routes and populates its MAX-to-port table. 
 
Learning switches work as follows: 
- If the path to the destination MAC address is known, send it down the known path. 
- If the path to the destination MAC address is not known, flood. 
Flooding is when you send the packet that came in out all other ports on the particular switch. So if you have a switch with ports 0, 1, 2, and 3, and a packet comes in port 0, a flood will send a copy of the packet out ports 1, 2, and 3. 
 
In Pyretic, there are two ways that you can combine policies: parallel composition or sequential composition. Parallel composition is when you run the operations in parallel, and the output is the union of all the outputs of the different policies. Sequential composition is running two policies in sequence, and, if there is output, it is after passing through the multiple policies. 
 
## 
## what is pyretic ? 
## 
 
Frenetic :  network programming language for SDN. 
Pyretic : Frenetic wrapped with Python 
 
http://www.frenetic-lang.org/overview.php 
http://frenetic-lang.org/pyretic/ 
https://www.cs.princeton.edu/~jrex/papers/pyretic-login13.pdf 
https://github.com/frenetic-lang/pyretic/wiki/Language-Basics 
https://github.com/frenetic-lang/pyretic/wiki/Query-Policies 
 
 
# spanning tree commonly used in L2 switches in LANs 
 
ACM paper "An algorithm for distributed computation of a spanningtree in an extended LAN" 
http://david.choffnes.com/classes/cs4700fa14/papers/p44-perlman.pdf 
(we are gloasing over this algo this time, but this is what is used by pyretic, NOT by real switches) 
 
 
## 
## combining policies in pyretic 
## 
 
n Pyretic, there are two ways that you can combine policies: parallel composition or sequential composition. Parallel composition is when you run the operations in parallel, and the output is the union of all the outputs of the different policies. Sequential composition is running two policies in sequence, and, if there is output, it is after passing through the multiple policies. 
 
In the original code that was released for Assignment 4, the policies were composed as follows: 
self.policy = new_policy + self.flood + self.query 
The + means parallel composition. In this case, new_policy is what happens if the path is in the switch's table, self.flood is the flood action, and self.query is to get all packets with a new source MAC address (which we'll gloss over for the time being). So, all these actions happen at the same time, which means that if we know how to get to MAC address X, we'll send it out via new_policy and by the flood action, rather than just through the new_policy. 
 
The new version now checks if we know how to get somewhere, use new_policy, otherwise flood: 
self.policy = if_(not_flood_pkts, new_policy, self.flood) + self.query 
 
# NOTE : in pyretic, port number starts from 1, and in mininet, it starts from 0. 
# NOTE : recall you can always print the entire topology in mininet 
 
mininet> net 
h1 h1-eth0:s1-eth0 
h2 h2-eth0:s2-eth0 
h3 h3-eth0:s3-eth0 
h4 h4-eth0:s4-eth0 
h5 h5-eth0:s4-eth1 
h6 h6-eth0:s5-eth1 
h7 h7-eth0:s5-eth0 
s1 s1-eth0:h1-eth0 s1-eth1:s2-eth1 
s2 s2-eth0:h2-eth0 s2-eth1:s1-eth1 s2-eth2:s5-eth2 s2-eth3:s3-eth1 
s3 s3-eth0:h3-eth0 s3-eth1:s2-eth3 s3-eth2:s4-eth2 
s4 s4-eth0:h4-eth0 s4-eth1:h5-eth0 s4-eth2:s3-eth2 
s5 s5-eth0:h7-eth0 s5-eth1:h6-eth0 s5-eth2:s2-eth2 
c0 
 
e.g. 
mininet> pingall 
mininet> h2 tcpdump > log.txt & 
mininet> h2 ping h6 -c 1 
mininet> h2 killall tcpdump 
 
# NOTE: your first ping may see a few dropped packets, because of the broadcast and reply all storming. but it's natural. if it's causing a problem in the learning process for switches, tcpdump is your friend. 
 
====> also, while at it, run "sudo ifconfig" on another terminal, then you see all switches/hosts eth ports, like below. 
 
eth0      Link encap:Ethernet  HWaddr 00:0c:29:79:89:6c 
          inet addr:192.168.10.167  Bcast:192.168.10.255  Mask:255.255.255.0 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:6562 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:6692 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:609203 (609.2 KB)  TX bytes:1326062 (1.3 MB) 
 
lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0 
          UP LOOPBACK RUNNING  MTU:65536  Metric:1 
          RX packets:27866 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:27866 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:0 
          RX bytes:2985445 (2.9 MB)  TX bytes:2985445 (2.9 MB) 
 
s1-eth0   Link encap:Ethernet  HWaddr 2a:c9:28:76:2b:94 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:7 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:558 (558.0 B)  TX bytes:446 (446.0 B) 
 
s1-eth1   Link encap:Ethernet  HWaddr 7e:65:d3:2d:76:16 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:7 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:405 (405.0 B)  TX bytes:41 (41.0 B) 
 
s2-eth0   Link encap:Ethernet  HWaddr c6:06:b5:b8:ad:63 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:818 (818.0 B)  TX bytes:223 (223.0 B) 
 
s2-eth1   Link encap:Ethernet  HWaddr 06:ac:4d:09:77:93 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:41 (41.0 B)  TX bytes:405 (405.0 B) 
... 
 
====> then you can run tcpdump specifically on individual ports(or we call it interface). 
 
$ sudo tcpdump -i s1-eth1 > log.txt & 
 
# NOTE: you may see packet loss at your first ping(all) cos of VM/Pyretic performace, i.e. when it's creating new rules, it may not process all packets, etc. 
 
 
 
 
######################################################## 
######   (7) Rate Limiting and Traffic Shaping    ###### 
######################################################## 
 
- traffic shaping 
- network measurement 
 
## 
## traffic classification and shaping 
## 
- ways to classify traffic 
- traffic shaping approaches - (1)leaky bucket, (2) RT traffic shaper, (3) token bucket, (4) composite shaper(combines (1)&(3)) 
 
(motivation): 
- NW resource control 
- ensure flows do not exceed a threshold rate 
 
 
## 
##  traffic source classification 
## 
- data : bursty, periodic, regular 
- audio: continuous, periodic 
- video: continuous, bursty, periodic 
 
===> we generally classify into two: 
(1) CBR (= constant bit rate) source   e.g. audio 
we shape CBR based on "peak" rate 
 
(2) VBR (= variable bit rate) source   e.g. video, data 
we shape VBR based on both "avg" and "peak" rates 
 
 
## 
##  "leaky bucket" traffic shaper 
## 
 
Beta : size of bucket 
Rho  : drain rate        # R for Regulator 
 
 ------------- 
    Beta     | --> Rho 
 ------------- 
 
thus, 
=====>  Beta defines the max size of burst. 
=====>  Rho defines the max average rate 
 
e.g. audio 
Beta : 16KB 
pkt  :  1KB 
Rho  : 8pps 
 
 
(ref) : greek letters 
http://www.purplemath.com/modules/grklttrs.htm 
 
 
## 
##  (r,T) traffic shaping 
## 
- traffic divided into T-bit frames 
- flow can inject <= r bits in any T-bit frame 
 
e.g. 
 
  ____T_____     ________    _______ 
  | __r__   |    | ___   |   | ___  | 
------------------------------------ 
 
 
- flow that obeys this rule is called "(r,T) smooth" 
- suitable for fixed rate flow. if wanting more than one packet exceeding r-bits, then wait for next frame. 
- more relaxed than "leaky bucket" because instead of sending one packet at every time unit, the flow can send r-bits every time unit. 
 
(question) what if flow exceeds a threshold rate? 
- assign lower priority to the excess packets (possibly dropped if NW is congested) 
- priority may be assigned at (1) sender, or (2) network 
e.g. 
at sender, application may mark priority on its own packets cos the app knows the best which ones are important. 
at NW, router may mark some packets lower priority == called "policing" 
 
 
## 
##  "token bucket"   aka bursty traffic patterns 
## 
- tokens added at bucket at rate = Rho 
- bucket size = Beta 
- avg traffic rate  = Lavg  (lambda avg) 
- peek traffic rage = Lpeek (lambda avg) 
 
e.g. 
            Rho 
             \/ 
           |    | 
           |----| 
           |Beta| 
           |----| 
Lpeek  --->      ---> 
(Lavg) 
 
## token bucket works as a flow rate regulator. flow can send packets as long as tokens are in the bucket. 
 
consider sending a packet of size "b" < Beta 
- if bucket has more than b tokens, the packet is sent, b tokens removed from bucket 
- if bucket has less than b tokens, the packet must wait till b tokens drip into the bucket 
 

#  "token bucket"   VS  "leaky bucket" 

(token bucket) 
- permits burst, but bonuds it 
- in any time interval T, rate < Beta + T*Rho 
- long term rate < Rho 
- no discard/priority policy 
--> policing difficult because over 2T, rate < 2(Beta+T*Rho) 
--> but if you think about it, over 2T, rate < B + 2T*Rho   is supposed to hold... thus policing is difficult. 
--> though token buket allows for long burst, if the burst is high priority traffic, it is difficult to police, and may interfere with other high priority traffic. so we need to limit how long the token bucket sender can monopolize the NW. 
 
(leaky bucket) 
- smooths bursty traffic 
- priority policy (for flows that exceed the smooth rate Rho) 
 
====> it's just literally phrasing what was depicted above 
====> token bucket is more flexible as we have more parameters to configure burst size 
 
 
## 
##  Policing with Token Buckets  ( aka Composite Shaper ) 
## 
- combining "leaky bucket" and "token bucket"  allows for good policing 
- (implementation is complex since each flow now needs two timers/counters) 
 

# quiz 

Beta = 100KB 
Rho  = 10 pps 
pkt size = 1KB 
T = 1sec 
 
max rate ?   ===> 110 KB per sec 
 
 
## 
##  Power Boost 
## 
- traffic shaping mechanism (2006 by comcast) allows subscriber to send at higher rate for a brief period of time 
- to spare capacity for users who do not put sustained load on NW. 
- if the highest limit set during the period, it's called "capped" power boost, otherwise "uncapped" power boost. 
 
(youtube) 
https://www.youtube.com/watch?v=ROLM1CKZuAs 
 
rate 
 | 
 |  ---------                    # Rho_cap   (capped) 
 |  |  Beta | 
 |--        ------------------   # Rho_sus   (sustained) 
 |        Rho * T 
 -------------------------------- time 
 
 
e.g. 
power boost bucket size = Beta 
send rate = r > Rho_sus 
 
how long (d = duration) can the flow send at r ? 
 
Beta = d*(r - Rho_sus) 
   d = B / (r - Rho_sus) 
 
 

# quiz 

 
R_sus = 10Mbps 
r     = 15Mbps 
Beta  = 8Mbits 
 
what is d ?    1.6 Mbps 
 
 
## 
##   power boost effects on "latency" 
## 
- during power boost, we observe latency, and potential pkt loss 
 
 
       bucket(buffer) 
       --------- 
r  -->   Beta  | --> Rho_sus 
       --------- 
 
bucket fills up with buffer if the access link's limit < r 
packets being buffered up = latency 
if it gets worse, pkt drop 
 
====> thus, some sender who wants to keep latency low, may use traffic shaper before power boost enabled link, to shape r < Rho_sus 
====> prevents buffering (obviously burst is more restricted in that case) 
 

#  buffer bloat  (above visual) 

 
delay = datasize_in_buffer / Rho_sus 
 
===>  the bigger buff size, the bigger delay 
===>  TCP keeps incrasing thru put but that may be causing delay 
 
(solutions) 
- smaller buffer 
- traffic shaping  (point of this lesson!)   in real world, shaping can be done by many of "openwrt" capable routers 
 
(ref) 
https://openwrt.org/ 
 
 
## 
##   Network Measurement 
## 
- how to "see" what traffic is being sent on the network? 
(1) passive measurement: collection of packets, flow stats, app level logs that are already on the network 
(2) active measurement : inject additional traffic to measure various characteristics (e.g. ping, traceroute) 
 
# measurement is important 
- for billing (commonly, customers pay at 95 percentile rate. aka CIR = committed info rate) 
- for security (to detect DoS, "botnets", compromised hosts, etc) 
 
(video) 
https://www.youtube.com/watch?v=EvKoakEmquQ 
 
 
## 
##  How to measure (passively) 
## 
 
(0) SNMP (simple network mgmt protocol) 
-- many NW devices come with MIB (mgmt info base) which polls info such as interface byte/count counts (i.e. how much is sent from the device). 
-- take the delta at a certain interval, then you can measure dataflow rate 
 
MIB is ubiquitous. available on almost all devices, but coarse. only interface byte/count counts, not much more granular info such as how much traffic is sent from which host/flow, etc. 
 
 
(1) packet monitoring 
(2) flow monitoring 
 

#  (1) packet monitoring 

- can see full pkt content(or at least header).  e.g.  tcpdump, ethereal, wireshark, or some expensive sniffer HW with high speed link (optical link) 
- lots of details ! (timing, header info, etc) 
- but high overhead.... 
 
# if you are ok with less details, and want less overhead, then go with flow monitoring below. 
 

# (2) flow monitoring   (usually done by router) 

-  monitors statistics(such as bytes count) per flow 
- "flow" consists of: 
--- src & dst IP 
--- src & dst port 
--- protocol type 
--- TOS byte   # type of service 
--- interface 
--- next-hop IP           # additional 
--- src & dst AS + prefix # additional 
 
=====> in other words, router can group packets into diff flows based on above info. (all header info) 
=====> in addition to the above, flow monitoring typically groups packets that happen "close" together in time. 
=====> if that flow stops for 15 sec for example, monitoring may declare the flow died. 
 
- less overhead 
- more coarse (no individual packet/payload info, such as "timing" info of packets) note that grouping is done based on header info, flow monitoring does not record any header info, if header/payload content level info is required, do "packet" monitoring. 
 
also, flow monitoring might use "sampling" i.e. only sample some packets to produce flow statistics 
 
 
 
############################################## 
#####    (8)  Content Distribution      ###### 
############################################## 
 
recall the 3 parts of the course. 
 
- protocols & architecture 
(principles, swtich/routing, naming, addressing, forwarding) 
- resource control & content distribution 
(congestion control, streaming, rate limiting) 
- operations & mgmt 
(SDN, traffic engineering, network security) 
 
===> so far, we've talked about 
- congestion control (which works within TCP protocol) 
- traffic shaping (which is a high level NW tool) 
===> now we are onto the final item of the 2nd part of the course. 
 

#  content distrib 

- an internet wide tool that enables websites and NW operators to deliver data quickly/efficiently. 
 

#  the web and caching 

- caching in the web to improve performance. (especially in HTTP:an application layer protocol) 
- HTTP request/response layered over byte stream protocol (almost always TCP) 
 
 
        request 
      -----------> 
client            server (statelss, i.e. does not maintain info about past client requests.) 
      <----------- 
        response 
 
## 
##  HTTP req/resp 
## 
 
lets look at each, one by one. 
 

#  HTTP requests 

 
(1) Request line     : method     -> GET, POST, HEAD 
                     : URL        -> /index.html (relative path) 
                     : version    -> HTML version number 
(2) Headers          : referer    -> URL that caused the page to be requested. e.g. an HTML page embedding an youtube video within. 
(additional/optional): user agent -> client software used to fetch the page. e.g. chrome, firefox 
 
 GET : get(return) contents associated with URL 
 POST: send data to server 
 HEAD: only header info, no content, part of GET response 
 
## NOTE: GET can be used to send data to server 
 
 

#  HTTP request [example] 

 
GET/HTTP/1.1            ##  request line 
Accept: */*             ##  header        # accepts any content type, e.g.  text/html 
Accept-Language: en-us  ##  header 
Accept-Encoding: gzip   ##  header        # accepts pages encoded in gzip compression format 
User-Agent: Mozilla/5.0 ##  header 
Host: www.kenics.net    ##  header        # the host to which the request is made. (useful in distinguishing when a web srv IP add hosting multiple websites on the same srv) 
Connection: keep-alive  ##  header 
 
 

#  HTTP response 

 
(1) status line      : HTTP version 
                     : response code -> 100s - info 
                                        200s - success   e.g. 200 == OK 
                                        300s - redirect  e.g. 301 == moved permanently 
                                        400s - error     e.g. 404 == not found 
                                        500s - server error 
(2) headers          : location        -> for redirection, etc 
                     : server          -> indicates server software 
                     : allow           -> which HTTP methods are allowed e.g. GET, HEAD, POST, etc 
                     : content-encoding-> how content is encoded. e.g. compressed, etc. 
                     : content-length  -> byte size 
                     : expires         -> how long the content can be cached. 
                     : last-modified   -> indicates the last time the page was modified. 
 

#  HTTP response [example] 

 
HTTP/1.1  200 OK      ## status line, version and response-code 
Date: Tue, 25 Oct. 2011 12:12:12 GMT 
Server: Apache/2.2.15 Debian 
Set-Cookie: lskdjflksadjflasjfskd       # sets state for the client. e.g. "logged in" etc 
Expires: Mon, 1 Jan 2012 00:00:00 GMT 
Last-Modified: tue, 25 Oct 2011 00:11:22 GMT 
Cache-control: no-store, no-cache, etc 
foo 
bar 
Transfer-Encoding: chunked 
Content-Type: text/html; charset=utf-8 
 
 

#   Early-day HTTP (v.0.9/1.0) 

- one request/response per TCP connection 
--(pros) simple to implement 
--(cons) TCP conn for every request. (3-way handshake, slow start overhead) 
===> also, as TCP conns are terminated after requests are completed, the servers have many connects are forced to keep the connections in time-wait states until the timers expire, thus resulting in additional resources that the server needs to keep reserved even after the conns are terminated. 
===> (solution) persistent connections 
 

#  Persistent Connections 

- multiple HTTP req/resp are multiplexed onto a single TCP conn. 
--- delimiters (at the end of an HTTP req) indicates the end of a req, 
--- content-length lets the receiver to know the resp size (i.e. server has to know the transfer size in advance.) 
- also can be combined with "pipelining" 
 

#  pipelining 

- client sends requests as soon as it encounters a referenced object. 
(so all referenced objects can begin to be fetched within one RTT) 
 
====>  persistent connections + pipelining  = default behavior in HTTP 1.1 
 
 
## 
##   caching (to improve performance) 
## 
- TCP thru-put is inversely proportional to RTT. so if web server is far away (bigger latency, lower thru put), nearer local/nw cache improves performance. 
- clients can cache documents (in local browser, or in network web cache by ISP, or CDN = a special type of web cache) 
 
e.g. 
 
client_A 
client_B  ---  NW_A   ---  NW_B  ---  web content origin 
client_C 
 
===> imagine NW_A ISP keeps local cache. then all clients and ISP benefit. clients get content faster, also ISP saves NW links (lower transit costs). 
===> cache expires based on the "expire" header. or it can check the origin server to ask whether the content got modified. if no modification, server gives HTTP code 304 == not modified. 
 
- caching can be directed by (1) browser setting, or (2) origin server directs the browser to a cache (done via a special reply to a DNS req) 
 
e.g. 
 
$ dig  www.google.com    # get IP addr for google.com 
$ ping 74.125.235.82     # ping that IP 
PING 74.125.235.78 (74.125.235.78) 56(84) bytes of data. 
64 bytes from 74.125.235.78: icmp_seq=1 ttl=56 time=2.45 ms     ## only 2msec,  very likely origin directed me to a local NW cache 
64 bytes from 74.125.235.78: icmp_seq=2 ttl=56 time=2.33 ms 
^C 
 
 
## 
##  (web) CDN: content distrib networks 
## 
 
- overlay network of web caches 
-- designed to deliver content to client from optimal location 
(optimal == geograpically closest, in general, but more for least latency ) 
 
- geographically distinct/disparate servers (each group can server all the content on the CDN) often very extensive. 
-- some CDNs owned by content providers like google, others owned by network vendor/ISPs like AT&T. 
-- non NW CDNs like google typically place serves in other ASs/ISPs. 
-- number of cache nodes in a CDN varies. could be tens of thousands servers. 
 
 
## 
##  challenge in running a CDN 
## 
Goal: replicate content on many servers. 
 
Challenge: "server selection" - how to choose the appropriate server replica/cache for a client. 
           "content routing" - how to direct clients to a particular server. 
 

#  server selection 

criteria: 
- least loaded server 
- least latency (== typically this is defacto method) 
- any "alive" server (for fault tolerance) 
 
 

#  content routing 

(1) routing (e.g. anycast)  # let the internet router choose. simple but no control by ISP. no control = less predictability 
(2) application-based (e.g. HTTP redirect)  # effective control, but requires the app to goto the origin server to get redict = latency 
(3) naming-based (e.g. DNS) # most common. as discussed above, a client looks up a domain name, and the response contains an IP addr to the nearby cache. this provides significant flexibility in directing diff clients to diff server replicas. 
 
===> see how (3) naming-based achieves both speed and control. 
 
e.g.   just do DNS look up of some domain (yahoo.com) from diff hosts in diff locations. you get diff IPs. if you do reverse look up on the IP addr, you will reach the corresponding CDNs. 
 
 

#   CDNs + ISPs   :   symbiotic(inter-dependent) relationship 

 
CDNs peer with ISPs 
- direct peer = no intermediate AS hops = better thruput/low latency for customers 
- more vectors = more reliability 
- burstiness = lower transit costs 
 
 

#  Bit Torrent 

- peer-to-peer content distrib protocol 
-- for large file sharing 
 
(ref) 
http://en.wikipedia.org/wiki/BitTorrent 
 
P2P: imagine a network of clients who all want this one big file. the origin server cannot handle each request separately, because it will soon overload the origin, or congest the network. chop the content and each client owns diff pieces of the content. so a client fetches the rest of the content from peers. 
(challenge in P2P): Free-loading/riding = client that finishes downloading leaves the NW as soong as it got the file, not helping other peers. 
===> bit-torrent resolves this. 
 

# bit torrent     (video) https://www.youtube.com/watch?v=hrKx_ZNTGW4 

(1) peer creates "torrent" = metadata containing (a) tracker that allows client to find seeder/peers (b) all pieces and the checksum for each piece 
(2) "seeders" create initial copy 
 
seeders = some peers in the NW that maintains a complete copy of the file. 
leechers= peers that contain an incomplete copy of the file. 
 
(workflow) 
a new client first contacts the "tracker" which gives a list of seeders, and the client starts downloading some chunks of the file. 
once a certain amount of chunks is accumulated, then the client starts swapping with other peers in the NW. hopefully, their pirces are all distinct, and eventually after some swapping, everybody has a complete copy. 
 

#  Solution to freeriding : "choking" (aka tit-for-tat) 

- temporary refusal to upload to a peer if that peer is not uploading. i.e. you cannot download from that peer. 
-> eliminates freerider problem. forces/ensures cooperation among mutually distrustful parties. 
(for game theory behind this, read about "repeated prisoner's dilemma problem" ) 
 

#  how to ensure distinct chunks for each peer 

- if all peers had the same chunks, then all peers can have only an incomplete copy of the file. 
- "rarest piece first" policy. client determines which pieces are the most rare among peers. download those first. 
--> but still, initially, rare pieces are so rare and the only peer with that rare piece may be slow, also some client may reach a point where it has no chunks to trade but want more download, then we use another policy "random piece first (from seeder)" which ensures a single peer with slow transfer rate does not prevent the download for others. when the download is complete for a client, the he cancels requests for missing pieces he had for other peers. 
 
 
## 
##   DHT (distributed hash tables) 
## 
- structured content overlay 
- a well known DHT : Chord (uses "consistent hashing") 
 

#  Chord: 

- scalable, distributed "lookup service" which maps keys -> values (e.g. DNS, directories) 
- Chord is known for its (1) scalability, (2) provable correctness (3) performance 
 
(motivation) 
- scalable location of data in a large distributed system. 
 
imagine a NW where a publisher of data publishes some mp4 named "foo bar" and the client needs to lookup("foo bar"). 
so, key = name of the data, value = location of the data. 
obviously, we can just use a hash table. but this hash table needs to be distributed in the NW. not just a central place, for scalability. hence, the need for DHT. 
 
====> DHT implemented based on "consistent hashing" mechanism. 
 

#  Consistent Hashing (mechanism) 

 
Main Idea: keys and nodes map to the same ID space. 
 
lets say a hash function hash() assigns IDs 
 
node: hash(IP) 
key : hash(key) 
 
in Chord DHT, key is stored at successor (a node with next highest ID). 
 
e.g. ID space 6-bit = addr 0 to 63 
 
publisher pubs data with "foo bar" key at IP 123.456.106.10 
hash("foo bar") = 36 
hash("123.456.106.10") = 42 
 
SEE the ring metric (video) https://www.youtube.com/watch?v=viaNG1zyx1g 
 
===> load balancing (all nodes get roughly the same number of keys.) 
===> flexibility (when a node leaves, only a few re-map is needed. very optimal) 
proven to be optimal. 
 
 

#  Consistent Hashing (implementation) 

(video) https://www.youtube.com/watch?v=H0hZBsVSt10 
 
option 1: every node knows location of every other node. ==> lookup: O(1), Table: O(n) 
e.g. a node does lookup("foo bar") = 36, then since it knows every other node ID, it goes find the node with the smallest ID bigger than 36. 
 
option 2: every node only knows its successor.  ==> lookup: O(n), Table: O(1) 
 
===> both options trade-off speed VS space 
 
(solution) : "finger table" that takes the best of both options. 
 

#  finger table 

- every node knows M other nodes. 
- which M nodes?  it calculates distance for each of M node exponentially. 
 
e.g. node 10 maintains the table for the following M nodes. 
 
lets say there are nodes 10, 32, 43, 60 in 6-bit addr space. 
 
10 + 2^0 = 11     # -| 
10 + 2^1 = 12     #  | 
10 + 2^2 = 14     #  |- all these point to node 32. so we say finger 0 to 4 points to node 32. 
10 + 2^3 = 18     #  | 
10 + 2^4 = 26     # -| 
10 + 2^5 = 42     # --- finger 5 points to node 43 
... 
10 + 2^M = ... 
 
lets say node 10 wants to look up key=42, then we know at least its known predesessor = node 32 is a good place to move forward the search. ask node 32 about key=42's successor. 
 
===> this lookup is known to be lookup: O(log N)   Table: O(log N) 
 
# when a node joins: (1) update the finger table of the new node, and the existing nodes. (2) new node's successor node transfer some keys to the new node. 
# when a node leaves: one way to go about this is to let a node have, in addition to its own finger table, its seccessor's finger table. so it can take over if the successor node leaves. 
 
(video) https://www.youtube.com/watch?v=GOOXa2GkPws 
 
 
 
################################################# 
#####     Assignment 5 - buffer bloat       ##### 
################################################# 
 
no real coding involved, but a few sets of experiments using mininet to analyze buffer bloat for TCP/UDP. 
 
end_host ---  router(w/ buffer) --- server 
 
exp 1: measure wget a web page download time with and without competing long living constant TCP flow 
 
===> obviously, without any other flow, end_host's wget works as fast as it can. 
===> w/ long living TCP flow(which we cause using iperf) will repeat its own AIMD dynamic congestion window control thing, filling buffer then resizing, releaving the buffer, until it saturates it again. 
===> so if you keep end_host pinging server, your ping RTT will pretty much follow the AIMD behavior, when buffer is getting longer, RTT gets longer, but as soon as multiplecative-decrease kicks in, ping RTT drops smaller, and it graduaally increases. 
 
exp 2: try exp 1 but with a shorter buffer at router (use TCP) 
 
===> essentially the same behavior, but because the buff is smaller, ping RTT max latency will be smaller, which is a good thing. 
===> depends on the AIMD mechanism of TCP flow (Cubic, slow-start, etc) we may see more frequent and more pkt drops, which may be bad from the overall NW engineering perspectives, but we dont go into that aspect now. 
 
exp 3: instead of one buffer queue at router, create two queues, one for iperf, and one for end_host's wget/ping. (use TCP) 
 
===> a typical QoS, end_host's latency is not affected at all. good. 
 
exp 4: try exp 2 using UDP iperf instead of TCP iperf 
 
===> obviously, buff queue will be always completely be filled, so end_host should expect worse wget/ping latency. 
 
exp 5: try exp 3 using UDP iperf instead of TCP iperf 
 
===> no impact. 
 
for exp4,5, change the "iperf.sh" code as below ( enable either one ) 
# TCP 
# iperf -c 10.0.0.2 -p 5001 -t 3600 -i 1 -w 16m -Z reno > iperf.txt & 
# UDP 
# iperf -c 10.0.0.2 -u -b 1.5m -p 5001 -t 3600 -i 1 > iperf.txt & 
 
 
 
 
####################################################### 
#####    (9) SDN (Software Defined Networks)      ##### 
####################################################### 
 
recall the 3 parts of the course. 
 
- protocols & architecture 
(principles, swtich/routing, naming, addressing, forwarding) 
- resource control & content distribution 
(congestion control, streaming, rate limiting) 
- operations & mgmt 
(SDN, traffic engineering, network security) 
 
===>  now we are onto the 3rd part. 
 
## 
## network operations & mgmt 
## 
(1) SDN 
(2) traffic engineering  (load balance) 
(3) network security 
 
## 
## what is NW mgmt? 
## 
 
===> process of "configuring NW" to achieve a variety of tasks. 
- load balance 
- security 
- business relationships 
 
## mistakes lead to 
- oscillation (where routers cannot agree on the routes) 
- loops 
- partitions (NW gets separated into two disconnected segments) 
- black-holes (routers dont know the route, drop packets) 
 
## 
##  why is cfg hard? 
## 
(1) "correctness" is hard to define. 
(2) interactions between (multiple diff routing) protocols -> "unpredictability" 
    each AS policy configured independently -> "possible unwanted behavior" 
(3) operators make mistakes. (cos cfg/policy are difficult/complex) 
    device-level cfg is pre-set by vendors. ====>  SDN changes this by a logically centralized controller. 
 
## 
##  what operators need (and what SDN provides) 
## 
(1) NW-wide views (both topology and traffic) 
(2) NW-level objects(load-balance, security, etc) 
(3) direct-control (can directly control data plane by a logically centralized controller) (no need to cfg each device one by one or indirectly trying to influence its behavior etc) 
 
##  Routers should 
- forward packets 
- collect measurements 
- compute routes (========>   SDN takes over this responsibility) 
 
===> in essence, SDN == "remove routing from routers" (cos a logically centralized controller will computes routes) 
 
## 
##  SDN 
## 
- definition 
- advantages 
- overview (history, infra, applications) 
 
 

# definition 

 
today's NW: two functions (both owned by routers) 
(1) Data Plane : fwd packets 
(2) Control Plane: computes routing tables 
 
===> SDN takes over (2) by a logically centralized controller (enables NW-wide control) 
 
today: each router device closed API. vertically integrated. slow innovation 
SDN  : open API (that can be accessed/controlled by a remote centralized controller), fast innovation 
 
 

#  SDN history 

pre-2004: distributed configuration 
==> RCP(routing control platform) that does centralized control for BGP. (yes, BGP only) 
 
2005: "4D" generalized RCP into three planes 
- Decision Plane : decides the fwd state 
- Data Plane     : fwd packets based on the decision 
- Dissemination/Discovery Plane : provides info for decision plane 
 
2008: OpenFlow (incorporates ideas from RCP/4D) 
===> vendors started making cheap switches from open chipsets that can be controlled from remote SW 
 
===> enabled us to decouple Data Plane and Control Plane in commodity switching HW 
 
 

#  SDN advantages 

- separation of control plane (from data plane) enables 
(1) coordination 
(2) evolve 
(3) reasoning 
about the NW behavior. 
==============> enables us to apply CS techniques to old NW problems. 
 
 

#  infra of SDN 

- Control Plane : Software program = controller( written in python, C, etc) 
- Data Plane    : programmable HW (controller by controller via "openflow" protocol which is essentially a set of cmds to control the switches) 
 
 

#  SDN applications 

- data centers 
- backbone NWs 
- enterprise NWs 
- internet exchange points (IXPs) 
- home NWs 
 

#  quiz  (which tasks are done by Control or Data Planes?) 

 
(CP) computing a fwd path that satisfies a high level policy 
(CP) computing a shortest path routing tree 
(DP) rate limiting traffic 
(DP) load balancing traffic based on hash of src IP addr 
(CP) authenticating a user's device based on MAC addr 
 
===> recall the job of the CP is to compute states that end up in the DP. 
===> question is can that be done by a high level remote controller or done at the fwd time by the local device. 
 
(video) 
https://www.youtube.com/watch?v=U9TZZPvI-Cs 
 
 

#  control plane  VS  data plane 

 
control plane : logic that controls fwd behavior (computes paths)  = SW 
e.g. 
routing protocols, cfg for NW middle-boxes. 
 
data plane : fwd traffic according to control plane logic (action) = HW 
e.g. 
fwding, switching 
 
====> why the separation? 
(1) independent evolution 
--- SW and HW can evolve independently. 
(2) control from high-level program 
--- NW operator can debug/check NW behavior more easily. 
 

#  opportunities 

(1) data centers: VM migration 
(2) Routing: more control over decision logic 
(3) enterprise NW: security/access control 
(4) research : coexistence with production (controller lets you affect only research packets) 
 

#  example: SDN in data center 

- DC : consists of many racks of servers 
- 20k servers per cluster (each server hosting 20 virtual servers = 400k virtual servers per cluster) 
- problem : provisioning/migration in response to varing traffic load is hard 
- SDN solves this problem by programming switch state by a centralized DB 
- all L2 addressing (entire DC can be expressed in an L2 topology) (no L3 addr change needed. just switch fwd state/table updates !) 
====> killer apps of SDN 
 
 

#  example: SDN in backbone security 

- goal: filter malicious(attacking) traffic 
- challenges: three 
(1) scalability : hundreds of thousands of switches (controller's control needs to extend over) 
(2) consistency : ensuring diff replicas(of controllers) see the same view 
(3) security/robustness: failure or compromise? 
 
(quiz) https://www.youtube.com/watch?v=yMV29hKp8Pw 
 
 
## 
##  different kinds of SDN controllers 
## 
- NOX 
- Ryu 
- floodlight 
- pyretic 
- frenetic 
- procera 
- route flow 
- trema 
- more 
 
===> here we will look at Nox/Ryu/Floodlight (then look at pyretic/frenetic in the next lecture on SDN programming) 
 

#  NOX     (www.noxrepo.org) 

- first gen OpenFlow controller (open-source, stable, widely used) 
- two flavors 
(classic) : C++/pythong (deco) 
(new NOX) : C++  (clean, fast) 
- supports OpenFlow 1.1 
 
NOX architecture 
(components) : (1) switches, (2) NW-attached servers 
(abstraction): switch control (from controller via openflow) 
(control)    : flow-level (flow = {srcIP, dstIP, srcPort, dstPort,,,}) 
 
NOX operation 
(flow) : <header(10-tuple), counter(counts pkt for this flow), actions(to be performed on the packets of this flow)> 
(action): fwd, drop, send to controller 
 
NOX programmatic interface (event-based model) 
- controller deals on events (e.g. switch join/leave, pkt rcvd/in, stats) and controls swithes (via OpenFlow protocol) 
- controller maintains NW view (e.g.topology) 
===> so programmers write 'event handler' 
===> performance is good but requires low-level openflow cmds/knowledge (good for DC usage) 
===> POX(python NOX) is easier to write but not as good performance  (good for prototype) 
 

# Ryu 

- open source python controller 
- OF 1.0, 1.2, 1.3, Nicira-extention (support for later version OF is advantage) 
- open stack (advantage) 
- not as good performance (cos of python) 
 

# Floodlight 

- java 
- OF 1.0 
- good documentation 
- REST API 
- good performance 
- hard to learn 
 
===> all of the above are sill hard in a way that you have to know openflow cmds, modify flow tables, flow-specific action setup, etc. 
===> pyretic does even higher level programming wrapping of these requirements. 
 

#  customize control with Pox 

- hub : fwds an input pkt to every other output port 
- switch: learns a fwd table (mac-addr based) 
===> lets implement some customization(of control logic) using POX 
(video) https://www.youtube.com/watch?v=vKZWy6uOnsA 
 

#  summary 

- flow is matched with the OF 10-tuple header 
(1) ingress port 
(2) mac src 
(3) mac dst 
(4) mac type 
(5) vlan id 
(6) ip src 
(7) ip dst 
(8) ip proto 
(9) tcp/udp src port 
(10) tcp/udp dst port 
 
--> so modifying fwd behavior is easy 
 
Switching: just match dst mac addr, and specify the corresponding action(out of some output port) 
"Flow" Switching: match all 10 tuples, then decide an action for some output port 
Firewall : only match src mac addr and the rest is wild card to make a fwd/drop decision, then it is a firewall 
e.g. 
constructing a firewall can be as simple as building a hash table where key=(switch/src mac addr), value=(true/false)  true=drop, false=fwd. 
 
"Caching" for performance 
(1) pkt is fwd'ed to controller if no flow table entry at switch (i.e. first time a particular flow is seen) 
(2) controller decides an action, installs in switch's flow table 
(3) decision is cached in flow table until expiry 
 
# final recap on SDN 
- customizing control is easy (turning switch into firewall takes <40 lines of python code) 
- performance benefits of caching rules/decisions 
 
 
 
######################################## 
####   (9.1)  Programming SDNs     ##### 
######################################## 
 
SDN: updates switch flow table entries from controller (via openflow) 
 
Consistency problems 
(1) pkt level : updates on multiple switches may disrupt pkts along an end-to-end path. (possible loop, etc) 
(2) flow level: updates on a path while flow is flowing; a flow may experience diff states/actions. 
 

#  SDN programming : 3 steps 

(1) read/monitor NW states/events(failure, topology changes, security events) 
(2) compute policy based on (1). this is decision plane. (fwd rules) 
(3) write policy (onto OF switches) 
 
===> again, consistency problem happen at; 
(1) if controller reads states of switches at different timing, then not the accurate view of the NW. 
(3) writing onto switches at different timing can disrupt pkts. 
===> after all, OF rule is a simple match-action predicate. 
===> [downside] is it's hard to express complex policies 
 

#  reading state with multiple rules 

 
3 problems for consistency 
 
(1) 
example: match web server traffic except src 1.2.3.4 
solution: predicates (src_ip != 1.2.3.4) && (src_port == 80) 
 
===> then the runtime system should translate into low level OF rules (and install in the right order). 
 
(2) 
problem: limited # of rules --- cannot install all possible patterns of rules 
solution: have the runtime system to dynamically "unfold" rules as traffic arrives 
          (programmers specifies "GroupBy(src_ip)") 
          (then the runtime system dynamically adds rules to the switches as traffic arrives) 
          (thus guaranteeing only rules in the switch that corresponds to active traffic) 
 
(3) 
problem: extra unexpected events 
 
           [application] 
            [controller] 
         /     |     |   \ 
flow-->sw0 -- sw1---sw2--sw3 
 
====> assume the first pkt in a new flow arrives at sw0, then it sents to controller who in turns decides rules and disseminate to sw[0-4]. 
====> but while the controller is deciding/desseminating, sw0 might keep sending multiple pkts from that new flow. 
====> this is redundant and unnecessary. 
====> solution: programmer specifies "Limit(1)" so sw0 only sends one pkt. (i.e. runtime system hides extra events) = suppression 
 
 
in recap, [Consistency] for reading states 
(1) predicates 
(2) unfolding 
(3) suppression 
 
 

#   writing states(policy) : avoiding disruption 

controller writes states for 
- maintenance 
- unexpected failure 
- traffic engineering 
 
important to achive 
- no fwd loop 
- no blackhole (where sw doesnt know where to fwd) 
- no security violation (traffic going to where it's not supposed to) 
 
example: traffic engineering 
 
 
        /-(1)-sw1-(1)-\ 
src --sw0     (1)     sw3--- dst 
        \-(1)-sw2-(3)-/ 
 
===> suppose each inter-sw link has a link weight. the shortest path is src-sw0-sw1-sw3-dst 
===> suppose we update the link weight sw1-sw3 to (5) then the shortest path from sw1 to dst becomes sw1-sw2-sw3-dst 
===> but if that new link weight info gets installed on sw1 who will now direct traffic to sw2, but what if sw2 is yet to receive that new info from controller, it sends back traffic to sw1. thus a fwd loop. 
 
solution: two-phase commit 
==> keep track of pkt's applied rule (old rule or new rule?) and after confirming all switches have new rule, then really start applying the new rule. 
==> no need to literally wait for every single switch in the NW to have the new rule, in fact we can shift to the new rule after only the affected portion of the topology is ready. (saves space for sw to retain old/new rules, also saves rule dissemination time) 
 
(quiz) https://www.youtube.com/watch?v=aXG9Q88VeiU 
 
 
## 
##  application of SDN :  NW virtualization 
## 
 
- definition 
- implementation 
- examples (e.g. mininet) 
 
 

# definition of NW virtualization 

- abstraction of physical NW 
-- multiple logical NW on shared physical substrate  (this is cool) 
 
(see the diagram) 
https://www.youtube.com/watch?v=0SRQWPeRiqQ#t=27 
 
for virtualization, 
- a single physical node might act as multiple virtual nodes (= VMs) 
- multiple physical links might act as a single cirtual link (= tunneling/encapsulation) 
 
===> idea is similar to VM. here we say virtual NW. in both cases some hyperviser arbitrates each virtual machine/NW to access the underlying physical host/NW resources. 
 

#  why virtual NW? 

- "ossification" of internet architecture (i.e. internet arch is so pervasive and hard to make fundamental changes) 
- overlay nw was studied in 2000's but not effective in overcoming this dilemma. 
- virtualization enables evolution by letting multiple architectures exist in parallel. 
 
in practice, NW virt really took off in DC business. a.g. Amazon EC2 
also, big NW service providers like yahoo, google use this NW virt to let multiple services share its underlying physical NW resources. 
 

#  promise of NW virt 

(1) rapid innovation (abstraction to software now lets programmers/engineer make evolution at software evolution speed rather than HW dev cycle) 
(2) new forms of NW control 
(3) (potentially) simplifying programming 
 

# SDN  vs  NW virtu 

SDN : separate Data & Control Planes 
NWV : separate logical & physical layers 
 
so SDN can be used to simplify NWV, but it does not abstract the details of the underlying PHY NW. 
 

#  design goals for NW virtualization 

- flexible (able to support diff topo, routing/fwd policies, independent cfg for each, vendor independent) 
- manageable 
- scalable (maximizing the # of coexisting virt NWs) 
- secure (isolation of logical NWs from one another) 
- programmable 
- heterogenious (able to support diff arch/tech) 
 

#  implementation of virtualized NWs 

nodes : VMs                     (e.g. VMware, Xen, linux Vservers) 
edges : tunnels(encapsulation)  (e.g. openVswitch) 
 
 

#  NW virtualization in Mininet 

within VM, mininet creates virtual hosts(each runniong an independent bash proc) using OS-level virtualization. 
each node has its own view of the NW stack, and shared filesystem. 
root namespace system controls the connections between nodes/switches. (assigns ether IF ports) 
 
 

#  programming SDNs : why & how? 

problem: programming OpenFlow is not easy. 
        - low level abstraction 
        - controller only sees events that switches dont know how to handle 
        - race condition if switch rules not installed properly (consistency issue) 
 
solution: "northbound" API 
 
         application  (e.g. path computation, loop avoidance, etc) 
             || 
             ||(northbound) == no definitive standard like southbound. but many examples (e.g. frenetic/pyretic) 
             || 
          controller 
             || 
             ||(southbound)  == OF 
             || 
          switches 
 
 

#  Frenetic: SQL-like query language 

e.g. 
select(bytes) 
where(incoming IF:2 & src_port:80) 
groupBy(dstMAC) 
every(60) 
 
http://frenetic-lang.org/ 
 

#  overlapping NW policies 

problem: modules affect same traffic 
 
===> suppose you write modules for (1)monitor (2)routing (3)firewall (4)load-balance 
===> then controller needs to translate into a single set of OF rules onto switches. 
===> need a proper way to "compose/combine" the modules. 
 
policy composition : two ways 
(1) parallel exec (e.g. counting pkts + fwd) 
(2) sequential exec (e.g. firewall, then switch) 
 
# sequential example : load-balance 
first rewrites the dst addr half to srvA, half to servB (can be done using predicates), then secondly route to each dst. 
 
# summary 
- Northbound API: higher-level abstraction from OF rules 
- composition : parallel and sequential 
 
 
## 
##  Pyretic : SDN language of runtime 
## 
language : express policies 
runtime: compile policies to OF rules 
 
key abstraction: "located" pkts  (we apply rules based on the pkt and its location in NW. location means a particular sw, or port, etc) 
 

# pyretic features 

- NW policy as "function" (input is pkts) mapping pkts to other pkts <====> in OF, policy = bit patterns 
- boolean predicates (logical operators &|! that resolve into true/false) which OF does not offer 
- virtual pkt header fields (tells pkt location and lets programmer tag pkts to apply specific policies at specific portion of the program) 
- policy composition (parallel and sequential) 
 

#  NW policy func examples 

- identity 
- none 
- mod() 
- fwd() 
- match() 
- flood() 
 
in pyretic, pkt is just a dictionary of key-value e.g. dst_ip = 10.0.0.3 
 

#  policy composition in pyretic 

sequential:  e.g.    match(dstIP=2.2.2.8)>>fwd(1)           #  ">>" means sequential,  fwd(1) means output to port 1 
parallel  :  e.g.    match(dstIP=2.2.2.8)>>fwd(1) + match(dstIP=2.2.2.9)>>fwd(2)       #  "+" means parallel 
 

#  traffic monitoring in pyretic 

e.g.   query on pkt streams 
self.query = packets(1,['srcmac','switch'])   # 1=unique:only capture the first pkt at the srcmac and switch 
self.query.register.callback(learn_new_mac)   # registering callback for each new pkt that is captured by the above query 
 
===> leaning switch! 
 

#  dynamic policy 

- timeseries of static policies :  current value = self.policy 
(1) set a default policy 
(2) register callback that updates policy 
 
in pyretic, every first pkt with a new src mac addr at each switch is read by the self.query, and policy is updated with a predicate every time a new mapping mac-output_port is learnt. 
 
(quiz) https://www.youtube.com/watch?v=T73UoU4oO7s 
(obviously the third one is the answer) 
 
 
########################################### 
###   assignment 7 - SDN and firewalls  ### 
########################################### 
 
enable X11 forwarding in putty.   [connection]-[ssh] 
 
X window is a tool that lets you display what you run in a remote server. 
 
for MS windows PC, "xming" is a good example. it links with putty well. 
for Mac OS, try XQuartz 
http://support.apple.com/kb/HT5293 
http://xquartz.macosforge.org/landing/ 
 
 
 
 
to block a certain mac_addr pair in both directions, we shall do something like below. 
here basically, im adding two table entries. one entry for src->dst, and the other entry for dst->src that we wanna block. 
 
POX----------------------------------------------- 
(ref) 
https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-ofp_matchMethods 
https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-OpenFlowActions 
http://archive.openflow.org/wk/index.php/OpenFlow_Tutorial 
https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-ofp_flow_mod-Flowtablemodification 
https://openflow.stanford.edu/display/ONL/POX+Wiki#POXWiki-ConnectionObjects 
http://archive.openflow.org/wk/index.php/OpenFlow_Tutorial#ofp_match_class 
 
 
    def _handle_ConnectionUp (self, event): 
        policies = self.read_policies(policyFile) 
        for policy in policies.itervalues():    # assume policy contains dl_src and dl_dst which are the pair we wanna block 
            # TODO: implement the code to add a rule to block the flow 
            # between the source and destination specified in each policy 
 
            # Note: The policy data structure has two fields which you can 
            # access to turn the policy into a rule. policy.dl_src will 
            # give you the source mac address and policy.dl_dst will give 
            # you the destination mac address 
 
            # Note: Set the priority for your rule to 20 so that it 
            # doesn't conflict with the learning bridge setup 
 
            msg = of.ofp_flow_mod() 
            msg.priority = 20 
            msg.match.dl_src = policy.dl_src 
            msg.match.dl_dst = policy.dl_dst 
            msg.actions.append(of.ofp_action_output(port = of.OFPP_NONE)) 
 
            msg2 = of.ofp_flow_mod() 
            msg2.priority = 20 
            msg2.match.dl_src = policy.dl_dst 
            msg2.match.dl_dst = policy.dl_src 
            msg2.actions.append(of.ofp_action_output(port = of.OFPP_NONE)) 
 
            event.connection.send(msg) 
            event.connection.send(msg2) 
 
# NOTE: as above, i set the output action to send the flow (that matches the dl_sec/dst where dl means data link layer which is L2) to OFPP_NONE which basically means nowhere, i.e. drop it. but not defining msg.actions at all also has the same effect per spec. 
 
 
pyretic------------------------------------------------------------ 
 
as you can see, pyretic is more conveniently abstracted than pox 
 
(ref) 
https://github.com/frenetic-lang/pyretic/wiki/How-to-use-match 
https://github.com/frenetic-lang/pyretic/wiki/Language-Basics 
https://github.com/frenetic-lang/pyretic/wiki/Language-Basics#filter-policies 
 
    # TODO and add traffic that isn't allowed 
    # Note: this uses the same policy named tuple from the POX 
    # firewall code. Please refer there for further info. HINT - You could use '|' in place of  '+' as well. 
    for policy in policies.itervalues(): 
        not_allowed = not_allowed | ( match(dstmac=policy.mac_0,srcmac=policy.mac_1) ) | ( match(dstmac=policy.mac_1,srcmac=policy.mac_0) ) 
 
    # TODO express allowed traffic in terms of not_allowed - hint use '~' 
    allowed = ~not_allowed 
 
# NOTE:  "|" "&" "~" are logical operators, but "+" is not, it is just a syntactic way for paralell processing. 
# if(a=b + c=d) effectively lets you use "+" in place of "|" but since it is not a logical operator, we cannot negate the end result. 
 
 
 
########################################## 
#####   (10) Traffic Engieering     ###### 
########################################## 
 
recall the 3 parts of the course. 
 
- protocols & architecture 
(principles, swtich/routing, naming, addressing, forwarding) 
- resource control & content distribution 
(congestion control, streaming, rate limiting) 
- operations & mgmt 
(SDN, traffic engineering, network security) 
 
===>  now we are onto the "traffic engieering" part. 
 
## 
##  Traffic Engineering 
## 
- process of reconfiguring the network in response to changing traffic loads, to achieve some operational goal (such as peering ratios, relieve load on a certain link, or overall load balance, etc). peering simply refers to two routers connected. 
 
-> SDN is useful in traffic engineering (especially for (1) datacenter network, and (2) transit networks) 
 

#  IP NW must be mangaged 

- self-management protocols: TCP(congestion control), Routing protocols (dynamic topology update) 
 
(problem) : does the NW run efficiently ??  (some path high load, while others low load, or inefficient routing, etc) 
(question): how should routing should adapt to traffic? 
- attempt to avoid congested links 
- satisfy certain application requirements, such as "delay" 
 

# overview 

- tuning routing protocol cfg 
- intra/inter domain(AS) traffic engineering 
- multipath routing 
 
 

#  3 steps in TE : (1)measure, (2)model, (3)control 

 
(1)measure - figure out the current traffic load and toplogy 
(2)model   - analyze how diff cfg changes would affect the traffic/topology 
(3)control - reconfigure/readjust 
 
===> hard to make it both practical and accurate in practice. 
 
 
## 
##   intra-domain(ISP, campus, datacenter, etc) traffic engineering : tuning link weights 
## 
 
lets assume a single AS with static link weights between routers. 
 
- OSPF, ISIS protocols will dynamically maintain the single shortest path topology. 
- routers flood information to each other to learn/update topology. 
- operator can affect the topology by simply configuring link weights. 
 
# (how does operator set link weights?) 
(1) inversely proportional to capacity  (weak links should have a big weight) 
(2) proportional to propagation delay 
(3) some NW-wide optimization (based on certain knowledge about the characteristics of the flows, etc) 
 
===> comes down to a graph optimization problem 
 
input: Graph G(R,L)  # R - routers(aka nodes),  L - links(uni-directional, aka edges) 
       each link has its capacity(aka weight), lets denote as C(l) 
       traffic matrix Mij : represents traffic load from router i to router j 
 
output: a set of link weights (which satisfies certain goal, such as minimizing the maximum congested link, or evenly splitting loads across links, etc) 
        # here we call the goal "objective function" 
        # we denote W(l) as the weight of unidirectional link L within G. 
 

# objective function 

 
- we can represent the cost of congestion increase in a quadratic polinomial manner as the traffic load(link utilization) increases 
- but optimization problem is easier to solve if we use a piece-wise liniear cost function 
e.g. 
utilization : U(l)/C(l)      # U(l) = amount of traffic on the link, C(l) = capacity of the link 
objective   : min Σ(l){ f(U(l)/C(l)) } 
 
====> solving this optimization is NP-complete, i.e. no efficient algo to find the optimal set of link weights. 
 
(other operational reality problems) 
- minimizing the number(and frequency) of changes to the NW 
- resistant to failure 
- robust to measurement noise 
 
===> often, chaging the weight of a link or two is enough to optimize. 
 
 
## 
##  inter-domain TE 
## 
- reconfiguring BGP policy/cfg 
- edge routers in each AS peer with edge routers of other ASs 
- alleviating congestion on edge links 
- using new/upgraded edge links 
- changing end-to-end path 
 
-- changing cfg of edge routers affect routers inside AS to route the traffic to/away from certain edge routers. 
 
 

# 3 goals of inter-domain TE 

(1) predictability 
(2) limit influence to/of neighbors  (you change BGP policy for one AS, and dont wanna influence neighbor ASs too much) 
(3) reduce overhead of routing changes (as few IP prefixes as possible) 
 

# factors that confound predictability : globally visible changes 

e.g. 
 AS1 - AS2 - AS3 
  |    |      | 
 AS4 - AS5 - AS6 
 
suppose AS1 sending lots of data to AS3 via AS2. 
and to reduce load on AS2-3 link, lets say we change BGP cfg so now the path from AS2 to AS3 is AS2 -> 5 -> 6 -> 3 
but then at this point, in response to the above BGP change, AS1 may decide to change its route AS1 -> 4 -> 5 -> 6 -> 3 
thus defeating the purpose of the initial BGP cfg change... because no longer AS2-3 edge link is loaded. 
traffic matrix is changed. 
 
===> unpredictable. 
(solution) : avoid globally visible changes 
 
(2) limit influence to/of neighbors 
===> we may make a path look longer with AS-path prepending. keeping AS-path around the same as other group of common path, then we can avoid globally visible changes. this gives us additional flexibility in TE. 
===> this will enforce consistent BGP route advertisements. 
e.g. have multiple peering links between two ASs, so if one goes down, we can use the other egress point. 
 
AS1e0 ---\ 
          AS2 
AS1e1 ---/ 
 
 
(3) reduce overhead of routing changes 
===> group related IP prefixes that have common AS paths. 
===> move traffic in terms of grouped prefixes. 
 
(10% of origin ASs are responsible for 82% of outbound traffic) 
 
 
## 
##  Multi-path Routing 
## 
- operator can configure. 
- applies to both intra/inter domain routing. 
 
multipath routing for (intra) domain routing 
- set equal cost for multiple paths (aka ECMP) 
- router might decide based on congestion (e.g. 35% to path A, 65% to path B) 
- router has fwd table have diff next hops defined for ougoing pkts with the same destination. alternate btw each entry, etc. 
 
 
## 
##  TE in DC 
## 
 
characteristics of DC 
(1) multi-tenancy (pros: amortize the cost of shared infra) (cons: extra caution on security and resource isolation) 
(2) elastic resouces (flexibly de/allocate resources) 
(3) flexible service mgmt (as load increases on certain service, operator can provision additional virtual machines, or move it to a different set of servers, virtual host migration, etc.) 
===> key enabling technology : virtualization  (quickly provision/move/migrate servers/services in response to flucuation in workload) 
 
===> needs TE solutions to allow the NW reconfig in response to changing workload and migrating services. 
 

#  challenges in DC NW 

- traffic load balance 
- support for virtual machine migration 
- power savings 
- provisioning (when demands fluctuate) 
- security (particularly for multi-tenant env) 
 
DC topology:  3 layers 
 
 
                           (internet) 
                             |    | 
(1) Core                   routers(L3)  # or top level switches in VL2 
                             |    | 
(2) aggregation(aka edge)   switches    # connects access layers 
                             ||||||| 
(3) access                  switches 
                            servers 
 
 
=======>  modern DC topology is often done all-L2 topology, thus migration is easier. (no IP change when servers move) 
=======>  ease of migration leads to ease of load balance 
=======>  but all-L2 topology inevitably leads to loss of scalability cos of flat-L2 NW. 
 
(first of all, L2 addresses are not really topological. monolithic. every switch has to carry a fwd table whose entries are literally tens of thousands of servers MAC addr. ) 
(L2 addressing doesnt take advantage of the natural hierachy of NW topology like IP does) 
 
=======> other problems include ; 
- single point of failure in L2 hierarchy (as links/traffic get aggregated towards the core) 
- oversubscription at the core (core can carry 40-200 times more traffic than the bottom of the hierarchy) - huge capacity mismatch 
 

#  solution to the scalability issue of L2-toplogoy in DC 

- "pods" = assigning psuedo MAC addr to each server and group them. 
 
lets say servers with MAC addr A,B,C,D,E,F 
 
pod1 : A,B 
pod2 : C,D 
pod3 : E,F 
 
within each pod, the switch assigns psuedo MAC addr to each server. 
 
pod1 : A(1:1), B(1:2) 
pod2 : C(2:1), D(2:2) 
pod3 : E(3:1), F(3:2) 
 
===> no more huge fwd table in every switch 
 
(problems) 
- mapping pusedo MAC to real MAC 
- intercept ARP query (who has this IP addr? let me know your MAC addr) 
 
e.g. 
lets say in the above example L2 topology, a host with real MAC addr:A (psuedo MAC 1:1) issues an ARP query, the frame first gets intercepted by its pod master switch who then fwds to "fabric mgr (who usually resides at the core layer)" who returns host A the psuedo MAC addr(lets say 2:1) for the queried IP addr. then, the host A sends a frame to that psuedo MAC addr(2:1), which host A's switch knows which switch to fwd to. the frame reaches the switch for pod2, then the switch knows psuedo MAC - real MAC addr for the hosts within that pod, so the host C successfully receives the frame. 
 
====>  provides hierarchical fwding in a L2 topology. very good. (notice how no software modification was required for hosts/servers) 
 
 

#  TE in DC 

 
- in DC, customized topology and special mechanism can help reduce link utilization (reduce the number of hops to reach the edge of DC). 
 
(problems) 
- limited server-to-server capacity (due to links at the core layer are over-subscribed) 
- fragmentation (as a result of virtualization, services may be fragmented across diff virtual machines, may cause data to be travelling a lot, also this complicates L2/L3 re-config) 
 
====> thus we want abstraction of BIG L2 switches (making the entire DC a virtual L2 topology, aka VL2) 
====> VL2 achives two objectives: 
(1) L2 semantics across the entire DC topology (done via name/location separation/resolution services, like fabric manager) 
(2) LB(load balance) aka uniform high capacity btw servers/links). VL2 relies on flow based random traffic interaction using "valiant load balancing" 
 
 

#  valiant load balancing 

- goal 
(1) spread traffic evenly across servers in VL2 NW. 
(2) location independence (unrestricted by the destinations of traffic flows) 
 
===> "picking a random indirection point" 
 
recall "access/aggregation/core" layer model. 
servers at access layer randomly pick a switch at aggregation layer (which we call "indrection" level), and the switch fwds the frame to the appropriate destination server (based on dst MAC addr), just effectively load balancing. (the same idea from multi-processor architecture) 
 
(video for visual topo) 
https://www.youtube.com/watch?v=BzNYSrcextU 
 
 
## 
##   Jellyfish : a technique to network DC randomly  (called "random regular graph") 
## 
- goal 
(1) high throughput             # big data 
(2) incremental expandability   # easy replacement of servers/switches 
 
- problem : structure constrains expansion 
e.g. 
hypercube topology requires 2^k switches (for k servers) 
even a 3 level FAT tree (a more efficient than hypercube) still requires quadratic number of switches relative to servers 
 
===> recall in a typical hierarchy of core/aggr/access layer mode, always top level core layer becomes the bottleneck. 
 

#  Random Regular Graph 

- topology in Jellyfish is RRG 
- each graph is uniformly randomly selected from the set of all regular graphs. 
- a regular graph is one where each node has the same degree. (degree N means each node has N edges) 
- switches are nodes in a graph 
 
every top-of-rack(ToR) switch has 
Ki : total ports 
Ri : ports to connect to other ToR switches 
Ki-Ri : ports to connect to servers 
N : number of racks 
 
===> N(Ki-Ri) servers can be supported 
 
we denote a random regular graph as RRG(N,k,r) 
 
a RRG is sampled uniformly from the set of all RRGs, but achieving such property is a complex graph theory problem. 
 
 

# constructing a Jellyfish RRG 

(1) pick a random switch pair with free ports (for which the switch pair is not already neighbors) and link them. 
-> repeat this step until no more links can be added. (degree N restricts the number of links) 
-> when new switches are added, just uniform-randomly remove the existing links, and add links to the new switches 
 
essentially, compared to less random, more structured graph like Fat tree, 
RRG is empirically demonstrated to be able to support 25% more servers with increased capacity for the same equipment. 
(achieved because RRG gives shorter paths between more servers than conventional Fat tree topo) 
 

# Open Questions 

- how close are RRG to optimal? (in terms of optimal throughput achievable with a given set of equipment) 
- what about heterogenious switches? (switches with diff kinds of link speed, diff number of ports) 
(from system design perspective) 
- cabling (how do we do it with such a random topo?) 
- routing (how do we do it with no hierachical structure?) 
 
 
 
########################################### 
#####    (11)  Network Security      ###### 
########################################### 
 
networks (including the internet) are susceptible to various kinds of attacks; 
 

# attacks on routing(BGP) 

 
some rogue AS advertises a set of IP-prefix that it does not own(called "route hijack") to neighboring ASs which in turn advertise to the rest of the internet. 
 
e.g. 
in 2008, Pakistan highjacked YouTube prefixes in a botched attempt to block youtube traffic to its citizens (well they disrupted youtube for the rest of the world alright.) 
in 1995, AS7007 advertised ALL IP prefixes as originating its own AS, thus disrupted the entire internet. 
 
===> highlights the vulnerability of BGP, because ASs believe route advertisement 
 

# attacks on naming(DNS) 

- "reflection" (aka DDoS)      # distributed denial of service 
- "phishing" 
 
 
## 
##  internet's design :  fundamentally insecure 
## 
- designed for simplicity 
- "on" by default. (once a host joins a network, it is reachable by other hosts in the NW. initial internet was a network of trusted networks.) 
- hosts are insecure 
- attacks can look like "normal" traffic 
- federated design (internet doesnt have no centralized controller by design, thus makes it hard to coordinate a defense plan) 
 
 
## 
##   components of security 
## 
 
(1) Availablity    : ability to use a resource           # target of DoS 
(2) Confidentiality: concealing information 
(3) Authenticity   : assure origin of information 
(4) Integrity      : prevent unauthorized changes 
 
thread : potentially violates any of above 
attack : voilates any of above 
 
 

#  attack on confidentiality/authenticity/integrity/avaiability 

- eavesdropping 
e.g. 
host A sends email to host B, an eavesdropper can intercept/sniff the packets using wireshark, tcpdump, etc. ("promiscuous mode") 
 
attacker can get lots of info 
- DNS query: reveals what web sites you are visiting 
- pkt header: who you are communicating to, what application you use. 
- payload: everything you are sending, like private email, etc. 
===> breach on confidentiality 
===> eavesdropper can modify the content and reinject to the flow, impersonate host A (aka "man-in-the-middle" attack) 
===> breach on authenticity/integrity 
(or drop the pkt that it intercepts, then you have compromised availability as well) 
 
(consequence) 
- theft of confidential info 
- unauthorized use of resouce 
- spread of false info 
- disruption of service 
 
 
 
## 
##  routing security  (BGP) 
## 
- control plane security (authentication of msg being advertised by BGP) 
- goal is to determine the veracity of routing advertisement 
--- session: point-to-point between routers 
--- path   : protects AS paths 
--- origin : protects origin AS, to effectively guarantee the origin AS is the owner of IP prefix that it advertises. 
 
- data plane security : hard open problem. how to ensure the pkt arrives at the intended destination thru the intended route. 
 

#  attack on routing 

- cfg error (inadvertently router advertises wrong routes, etc e.g. AS7007 was a result of a miscfg) 
- router compromised (by attacker) 
- unscrupulous ISPs (may advertise bad routes) 
 

# types of attack 

- cfg (intentional overwrite of cfg or its mgmt software) 
- tamper with software 
- tamper with routing data 
 
====> the most common attack being "route hijack(aka attack on origin authentication)" 
 

#  why hijack is critical 

- lets say you wanna visit a website via URL, then you issue a DNS query, which returns an IP addr for the web server hosting the website. 
- but suppoer a rogue DNS server starts BGP advertising an IP prefix for that web server, so your DNS query goes to the rogue DNS server. 
(now DNS is hijacked.) 
 

# MITM (man in the middle) attack 

- continued from the above BGP hijack example, lets say the attacker can collect your pkt, then add some modification, and then able to route the pkts to the legitimate destination. then it becomes MITM attack. 
- but, in that case, how does the rogue DNS ensure that it hijacks the internet and keep the route between itself and the legitimate AS intact? 
 
suppose  AS_500 hijacks the prefix of AS_200 
 
 
AS_200  ---  AS_300  --- AS_400  --- AS_500 
   |                                 | 
   -------- AS_600  --- AS_700 ------- 
             |                       | 
            AS_800  --- AS_900 ------- 
 
 
=====> a technique called "AS path poisoning" 
=====> basically advertises a path that includes AS_500 - AS_400 - AS_300, thus both AS_300 and AS_400 will ignore advertisement from AS_500. every other AS will hear AS_500's malicious advertisement. 
 
=====> notice a circular loop. traceroute will loop. and configure routers in AS_500 to not decrement TTL of ICMP pkts, then the AS_500 can effectively "hide" in the MITM attack. 
 
=====> makes you appreciate the importance of origin & path authentication. 
 
 

#  session authentication 

- ensures BGP route msg sent between two routes of two ASs are authentic. 
- TCP session (using MD5 auth, which basically carries the msg itself and its hash and a shared secret key) 
- TTL hack (because usually it is only one hop, the sender sends TTL=N, and the receiver checks TTL=N-1, if not drop the msg. this eliminates the remote attacker) 
 
 

#  guaranteeing origin & path authentication 

- "secure BGP" (aka BGPSEC) 
(1) Origin(Address) Attestation: cerificate binding prefix to owner.  # certificate must be signed by a trusted party like a routing registry. 
(2) Path Attestation: signatures of AS path (see below) 
 
each AS owns pub/pri key pair. each AS signs the path attestation with its private key, and other ASs verify using the pub key. 
 
AS_0 --- AS_1 --- AS_2 --- AS_3 
(orig) 
 
===> lets call the prefix of AS_0 "p" 
 
AS_1 will send BGP route advertisement to AS_2.  "p" AS-path:1,   attestation {2,1} signed by key1 (pri key of AS_1) 
AS_2 will send BGP route advertisement to AS_3.  "p" AS-path:2-1, attestation {3,2,1} signed by key2, {2,1} signed by key1 
 
====> notice how it carries attestation signed by each AS along the way, so the AS_3 can verify each step of the advertised AS-path by BGP. 
====> prevents any path insertion/modification/hijack. 
====> notice how the path attestation signed by an AS includes the AS that receives the BGP advertisement. this is CRITICAL for security. 
====> suppose in the above example, path attestation included only up to the AS who is signing the path with its pri key. 
e.g. 
"p" AS-path:2-1, attestation {2,1} signed by key2, {1} signed by key1 
 
===> then any attacker(lets say AS_4) can take this msg, and overwrite it as 
"p" AS-path:4-1, attestation {4,1} signed by key4, {1} signed by key1 
 
===> effectively tool over the path for the prefix "p". 
===> this is prevented by including the next AS in its path attestation. (because AS_4 cannot produce a path attestation {4,1} signed by "key1" which is owned by AS_1 ) 
 
NOTE: path attestation cannot defend against 
(1) suppression # failure of route advertisement 
(2) replay      # a premature re-advertisement of a withdrawn route 
===> still no guarantee that pkt actually travels along the advertised AS path. (= weakness of BGP, an open problem in internet routing) 
 
 
## 
##  DNS security 
## 
 
# DNS architecture 
                                                          (can be corrupted) 
                                                           zone files     dynamic update sys  (can be spoof) 
                                                              |            | 
stub resolver ----query---->  caching resolver  --------->  master authoritative (can be spoofed) 
                    |               |                             | 
                 (MITM)       (cache poisoning)             slave authoritative (can be spoofed) 
 
 
for MITM & spoofing, we have "DNSSEC" 
for cache poisoning, we have "0x20" encoding 
 
=====> also, another infamous attack is "DNS reflection" where DNS is used to stage a massive DDoS attack. 
 
 

#  why is DNS vulnerable? 

- resolvers trust responses. (e.g. cache poisoning. attacker may responds to resolver before a legitimate response arrives, etc) 
- response can contain certain info unrelated to query 
 
=====>  No Authentication (of responses by basic DNS protocol, which uses UDP = connectionless unlike TCP) 
 

#  DNS cache poisoning 

              (with a 16-bit query ID) 
              A record google.com ?                   A record google.com ? 
stub resolver  ------------------->  recursive resolver -----------------  SOA (start of authority) 
                IP of A (IN A)               ||            IP of A (IN A) 
               <------------------           ||          <---------------- 
                                       attacker 
                                 (can send lots of bogus responses before SOA replies) 
                                 (theoretically 2^16 msg will suffice to match the query ID) 
                                 (in fact, due to birthday paradox, probability wise, just hundreds of bogus msg will usually hit) 
(defense) 
- query ID 
- randomizing query ID 
===> as above, no good. 
 
================> for the attacker to succefully poison cache, (1) win the race (2) hit the query ID 
================> if lose on the race, just wait on the cache expiry, or "Kaminsky attack" 
================> where you send a stream of A record queries and reply with NS record instead of A record, thus owning the entire zone ! 
 

#  defenses to DNS cache poisoning 

(1) IP + randomization 
(2) source port randomization (adds another 16-bit entropy, but this is resource intensive and NAT can derandomize) 
(3) 0x20 encoding (takes advantages of DNS query being case-insensitive: 0x20 bit decides whether a char is upper case or not. resolver and SOA can decide which letter in the domain to capitalize, and keeps the combination, to discard any attacker's responses that dont match the 0x20 encoding of the original query) 
e.g. 
www.GoOGLe.com = www.google.com 
 
 

#  DNS amplification attack 

- exploits the asymmetry in size between DNS queries and responses 
 
e.g.  query = 60 bytes,  response = 3000 bytes 
 
attacker(can be multiple hosts/src) sends queries with its src as "victim IP" then DNS resolver will send all the responses to the victim. a classic DoS attack. if multiple attackers, then DDoS. 
 
(defense) 
- prevent spoofing of src IP (using filter rule, etc) 
- disable resolver to resolve queries from arbitrary locations on the internet. 
 

#  DNSSEC 

- add signature at each referral. just like AS path attestation. 
- lets assume recursive resolver has a pub key to (root) server below. 
 
 
                               -------- (root)  # signs{IP and pub key of .com server} with private key of (root) server 
    A google.com?              |                # RR can verify (root)'s response with the pub key 
stub ------------>  recursive  -------- .com    # signs{IP and pub key of googe.com} with private key of .com server 
                    resolver   |                # RR can verify .com srv's response with the pub key of .com server given by (root) 
                               -------- google.com 
 
===> ensures the authenticity in each step of the recursive DNS resolution. 
 
 
 
######################################### 
#####   (11.1)  Internet Worms     ###### 
######################################### 
 

# viruses and internet worms 

 
virus: "infection" of an existing program that results in modification of behavior. 
===> usually requires user action. e.g. opening an attachment in email, running a malignant .exe 
 
worm : code that propagates/replicates across the network 
===> usually spreads automatically, exploiting the flaw/vulnerability of existing programs, open services. 
 
===> in recap, virus = spreads manually, worm = spreads automatically 
 
 

#  types of viruses 

- parasitic       : infects executable files 
- memory resident : infect running programs 
- boot sector     : spreads when system is booted 
- polymorphic     : encrypts part of virus program using randomly generated key 
 
 

#  lifecycle of worms 

(1) discover/scan for vulnerable hosts 
(2) infect vulnerable machines via remote exploit 
 
===> general approach consists of three steps 
(1) scan 
(2) spread 
(3) remain un-discovered/un-detected 
 

#  first worm : "Morris" worm 

- 1988 
- no malicious/destructive payload, but resource exhaustion. (10% of all internet hosts got affected) 
- multiple propagation vectors 
(1) remote shell exec. (dictionary attack on pw, once a host is cracked, then attack the host's list of trusted hosts) 
(2) buffer overflow/remote exploit     # classic 
(3) debug mode in sendmail (used to be able to exec a cmd remotely in SMTP msg) 
 
====> having multiple propagation vectors is effective 
 
# some worms even patch the vulnerability after infection to prevent other worms infecting the host also, which may lead to the detection of an anomaly faster. 
 
 

#  worm outbreak cases 

- summer of 2001, three worms hit the internet: (1) Code Red1, (2) Code Red2, (3) Nimda 
 
code red 1; 
- jul 13th, 2001 
- "modern" worm 
- buff overflow on MS IIS server 
- active 1st-20th of month (due to bug, it dies on 20th, and recovers at 1st) 
- random scanning of IP addr (multi-threaded for the random IP scanning) 
- payload : mount a DoS on whitehouse.gov 
 
===> compromised 350k hosts in 14hrs. 
===> weak cos it actually DoS on the hard-codedIP of whitehouse.gov, not the domain itself, so once the NW admin changed the IP, the web site was safe. 
 
code red 2; similar 
- only on windows 2000 
- only active aug 4th - oct 1st 2001, by design 
- payload : IIS backdoor 
- preferential scanning of "nearby addr" in the same NW addr, the idea being if one host is found vulnerable, it's likely other hosts are too within the same NW. 
 
Nimda; 
- sep 18, 2001 
- multi-vector (IIS, email attachment, copying itself across open NW shares, infects web servers so that any hosts that visit the web site would get infected, and exploits IIS backdoor by code red 2) 
- antiworm/virus is done based on filtering/blocking some signature of the worm/virus but until we extract that, worms can spread freely, day zero is critical. 
 
 

#  modeling the spread rate of worms 

 
see the graph 
(video) https://www.youtube.com/watch?v=6oWgDcpMd_k 
 
- "random constant spread" model 
k : initial compromise rate 
N : number of vulnerable hosts 
a : fraction of already compromised hosts 
 
e.g. 
Na = number of already infected hosts 
 
for a particular time interval dt, 
new infections in dt = Nda = Na * k(1-a)dt 
 
if we solve for a, we get 
 
a = e^k(t-T) / 1+e^k(t-T) 
 
====> all comes down to increasing "k"  = initial compromise rate 
 
 

#  increasing initial compromise rate 

(1) hit list: list of vulnerable hosts 
(2) permutation scanning (shares the list of already scanned IPs) 
e.g. Slammer worm that random scanned and exploited buff overflow MS SQL server, using UDP which was connectionless(fit into a single pkt), meaning it was not restricted by the NW latency RTT, only by bandwidth. 
===> caused $1.2 billion. (brought down BoA ATM, five root DNS, cell phone NW in south korea, etc) 
===> no malicious payload, but the bandwitdh exhaustion caused resource exhaustion on infected hosts. 
 
 
 
##################################### 
#####      (11.2)  Spam         ##### 
##################################### 
 

#  spam : unwanted commercial email 

- most spams end up in spam folder 
 
- problem 
(1) filters  : separate good from bad 
(2) storage  : mail servers have to store them 
(3) security : e.g. phishing 
 
- 95% of email is said to be spam 
 

#  filter 

- prevent spam msg from reaching inbox 
- how to differentiate spam from "good" email 
(1) content-based      : grep for "rolex" "viagra" 
===> easy to evade (diff spelling, or diff forms like image, mov, excel sheet,etc) 
(2) IP addr of sender  : aka "blacklist" 
===> when a sender emails a receiver, the receiver queries DNSBL(blacklist, aka spam house) to check the sender IP, if hit, then terminate the connection. saves the operator the cost of storage. 
(3) behavioral features: sender/sending behavior, like particular size/location/recipient group/upstream ISP/time of the day in batch, etc 
===> hard to understand/classify the behavior. 
 
 

#  spamming technique 

- BGP agility: hijack IP prefix for 10 min, send spam, withdraw the prefix. (gets ephemeral IP addr !!!  evades BL) 
 

#  set of effective NW-level anti-spam features 

(single-pkt) 
- distance between sender & receiver 
- density (any nearby senders?) 
- time of day 
- sender AS 
(single-msg) 
- number of recipients 
- length of msg 
(aggregates) 
- variation in msg length across a group of messages 
 
=====> all used in SNARE(spatio temporal NW-level automation reputation engine) which has 70% detection rate and 0.1% false positive. 
 
 
 
 
############################################ 
#####        (11.3)  DoS attack       ###### 
############################################ 
 
- definition 
- defenses 
- how to infer DoS attack 
- security against DoS (using SDN) 
 
 

#  what is DoS attack? 

- attempt to exhaust resources: 
(1) NW bandwidth 
(2) TCP connections (e.g. a host usually have a limit on how many TCP conn it can have at a time) 
(3) server resouces (e.g. web server that does server side work rendering web pages, etc. bogus requests can exhaust server CPU, mem etc) 
 
pre 2000 : single-source 
post 2000: distributed (hence DDoS) 
 

#  defenses 

 
(1) ingress filtering 
 
e.g. 
 
      AS -------------------- internet 
prefix "x.y.z.x/24"          | 
                          gw router rule : drop if src is NOT prefix "x.y.z.x/24" 
 
- fool proof 
- works at edge AS 
- does NOT work in core of the internet 
 
 
(2) uRPF checks      # unicast reverse path filtering  (there is multicast RPF = mRPF as well) 
 
- can be used in core 
- automatic 
- requires symmetric routing (not always possible) 
e.g. 
 
----(eth0) router (eth1)---- 
 
suppose a table content: 
 dst  10.0.1.0/24  -> eth0 
      10.0.18.0/24 -> eth1 
 
if "src"=10.0.18.3 arrives at (eth0), then it came from the reverse direction/path, so it must be fake, thus drop. 
 
 
(3) TCP Syn Coolies 
 
- recall TCP 3-way handshake 
 
client ---SYN pkt---> srv    # as soon as the SYN pkt arrives, the server allocates socket buffer for the TCP conn 
       <-- SYN-ACK---        # but if client never responds, then client can force the server to allocate many many socket buffers 
       ---- ACK ---->        # (often clients spoof src IP addr) 
 

# TCP Syn Cookie 

- no buffer is allocated upon SYN pkt arrival. but computes "sequence number" which is a hash of (srcIP, srcPort, dstIP, dstPort, random num) 
- keeps no state, just returns the seq# in SYN-ACK 
- then an honest client can reply ACK to the server, then server matches the seq#, only then establish the connection (allocating socket buffer) 
 
 

#  inferring DoS acitivity (aka "backscatter") 

- suppose attacker spoofs srcIP and SYN floods the server 
- server replies a storm of SYN-ACK (called "backscatter") to the forged srcIP which is basically a victim 
- spoof srcIP is usually picked randomly 
 
- if we measure the amount of what we deem as "backscatter" within a certain portion(called telescope) of NW, then we can infer the size of DoS in the overall NW. 
 
e.g. 
 
n : number of IP addresses 
m : number of backscatter pkts  (per sec, etc) 
x : total attack rate  (per sec, etc) 
 
m = x * (2^32/n) 
 
(quiz) https://www.youtube.com/watch?v=HGbUNN5thS8 
 
 

#  assignment 8: automated DoS mitigation 

- PyResonance: extention of pyretic. allows the composition of finite state machines that run various programs depending on the state (of NW) 
- essentially, sflow monitor will detect DoS, then state changes from Normal to Under Attack, then controller can install a diff flow table entry (to mitigate the impact of DoS attack) 
 
FSM = finite state machine 
 
(ref) 
# FSM in general 
https://en.wikipedia.org/wiki/Finite-state_machine 
 
# pyretic in general 
https://github.com/Resonance-SDN/pyresonance/wiki 
 
# tells about how asynchronous the sFlow-RT engine is 
http://www.inmon.com/products/sFlow-RT.php 
 
# sflow in general 
http://www.sflow.org/about/index.php     # can be used for both small L2 switches and L3 core 10Gbps routers. 
http://www.sflow.org/sFlowOverview.pdf   # L2-L7 details can be monitored (various NW events, like dos) 
 
 
# pretty much covers all pyretic logics/syntax we need to use for this assignment 
# but also look at the firewall assig code 
https://github.com/frenetic-lang/pyretic/wiki/Language-Basics 
http://frenetic-lang.org/pyretic/doc/ 
 
# paper on pyresonance concept 
https://smartech.gatech.edu/bitstream/handle/1853/49181/GT-CS-13-04.pdf 
 
 
# NOTE: if you get "Address already in use" error, that means you ran pyretic.py instance somewhere already. kill them. 
 

  1. 2014-09-02 23:11:32 |
  2. Category : gatech
  3. Page View:

Google Ads