perl multi threading

 
########################################## 
###   multi-process VS multi-thread    ### 
########################################## 
 
academic details: http://kenics.net/gatech/ios/ 
 
for practical purposes, multi-thread is useful when you need threads to share a variable. 
 
## 
##  an example scenario 
## 
- suppose you have 5000 files, each containing a company's trading price/volume history, and you want to build some sort of one giant hash variable in perl, 
e.g. 
$history{$ticker}{$timestamp}{'price'} = $price; 
$history{$ticker}{$timestamp}{'vol'} = $vol; 
 
then you wanna do all sorts of stats analysis, like how many stocks traded more than XYZ volume size within a certain time window. 
or calculate average daily volume 21 for each given yyyymmdd, etc. 
 
suppose it takes 1 min to process a single file. so processing sequentially takes more than 3 full days. 
so you need to parallelize the processing. also importantly, you need an ability to share a variable within these parallelized processing. hence the multi thread. 
 
NOTE: it is not entirely impossible to effectively achieve sharing of storage type of variable inter-process, but it is not as simple. and yes of course, there are often more than one definitive way to achieve the result you want. so even in the above example, depending on what kind of stats you wanna compute, there may be a different way that is more suitable, etc etc 
 
 
## 
##  multi-proc 
## 
building up on the above example scenario, suppose you just need to process each file totally independently, then, though there are some lib/modules that let you do this, ultimately you can simply parallel run your script with a file name as your input argument. 
e.g. 
for filename in `ls ./file_dir/`; do script.pl $filename &  done wait 
 
 
## 
##  multi-thread 
## 
 
here is an example using perl "threads" class. 
 
(ref) http://perldoc.perl.org/threads.html 
(ref) http://perldoc.perl.org/threads/shared.html 
(ref) http://perldoc.perl.org/perlthrtut.html 
 
----------------------------// thread.pl 
 
#!/usr/bin/perl 
 
use strict; 
use threads; 
use threads::shared; 
 
my $thread_num = 10; 
my @files = glob("/tmp/foo.*.csv"); 
my $file_num = scalar(@files); 
my $files_per_thread = int($file_num / $thread_num); 
my @thr; 
my %hist :shared;   # notice the syntax. to share a variable btwn threads, you must do this. 
my $idx :shared = 1; 
 
print "gonna process $file_num files. \n"; 
 
while(scalar(@files) > 0){ 
    my @subset = splice(@files,0,$files_per_thread); 
    push @thr,threads->create(\&process_files,@subset); 

for (@thr){ 
    ${_}->join(); 

 
print "finished processing $file_num all files. \n"; 
 
# do whatever stats computation 
 
print "finished computing stats"; 
exit 0; 
 
sub process_files{ 
    for my $file (@_){ 
        &process_file($file); 
    } 

 
sub process_file{ 
    my ($file) = @_; 
    open (IN,"<$file") or die "cannot open $file     \n"; 
    { 
        lock($idx);                                       # as per the spec, we cannot explicitly unlock 
        print "processing (${idx}/${file_num}) $file \n"; # so just create a block to specify the lock scope 
        $idx++; 
    } 
    while(<IN>){ 
        print "processing $_ \n"; 
        my ($ticker, $timestamp, $price, $volume) = split(',',$_); 
        { 
            lock(%hist); 
            my %dummy_0 :shared;  # see http://www.perlmonks.org/?node_id=259551 
            my %dummy_1 :shared;  # shared var can only store (1) scalar, (2) ref to shared var/data. 
            $hist{$ticker} = \%dummy_0 unless(exists $hist{$ticker});   # it's a bit annoying but you need this for threads::shared 
            $hist{$ticker}{$timestamp} = \%dummy_1 unless(exists $hist{$ticker}{$timestamp}); 
            $hist{$ticker}{$timestamp}{price} = $price; 
            $hist{$ticker}{$timestamp}{volume} = $volume; 
        } 
    } 
    close IN; 

---------------------------- 
 
 

  1. 2016-10-02 15:33:26 |
  2. Category : perl
  3. Page View:

Google Ads