Machine Learning for Identification of Cars

| Comments

There are plenty of data on internet, however it is raw data. Think for a second about public surveillance cameras - useful to check the traffic on the route or busy place, but anything else? What if you want to know how many cars are on the route? How many car were yesterday at the same time? Given so many cars on the route, how much polluted air in the area? While working on the road map for data dive event, I started to wonder, how feasible is to use data of public surveillance cameras. So I quickly built a pilot project and now I would like to share my experience.

First step - data acquisition. At beginning I was thinking to plug my smartphone somewhere and collect data of the busy route. Nevertheless, I quickly found surveillance cameras in Vilnius and started to collect images. Run a search and I’m sure, that you will find them in your city:

Here is bash script, which I use to collect images:

#you need full path for crontab
cd /home/git/carCount/img
a=`date +%s`
b=${a}_4.jpg
wget -O $b -q "http://www.sviesoforai.lt/map/camera.aspx?size=full&image=K7742-1.jpg&rnd=0.15417794161476195"

Data preparation. After while you will have enough data to train your machine (for beginning more than 30 images should be O.K.). How do we train the algorithm? The goal is to identify the cars in a given image. That means, that we have to provide the examples of positive images (clear image of the cars) and negative images (no car, parts of the car and etc.). Important note - we don’t feed whole image, but we cut a chosen image with sliding window (100x100 in my case). 4 examples of positive images:

Meanwhile, it is worth converting each image to portable grey format PGM. For this specific task, we can sacrifice information about the color of the car - it won’t improve prediction. Besides, PGM images can be loaded into R and easily transformed into matrix. Here is bash script, which converts jpg to pgm and slices each image:

#remove image duplicates
find . -maxdepth 1 -type f -exec md5sum {} \;  >test.txt
awk 'a[$1]++ {gsub(/^\*/,"",$2); print "rm ", $2}' test.txt |sh
rm test.txt

#convert jpg

if [ -d "out" ]; then
    rm -r out
fi
mkdir out
for k in $(ls *.jpg); do convert $k out/$k.pgm; done

cd out
mkdir slide
for filename in $(ls *.pgm);
 do 

w=`convert $filename -print "%w" /dev/null`
h=`convert $filename -print "%h" /dev/null`
let "ww= $w/100"
let "hh= $h/100"
for((y=150;y<=250;y+=50))
do
for((i=100;i<=400;i+=50))
do
echo "slide/$i.$filename"
let "h_slide=$i"
convert $filename -crop 100x100+$i+$y slide/$y.$i.$filename
done
done
done

Training, predicting, cross validation. Now is time to open R, load 100x100 images from “train/out/slide” directory and train the algorithm. Important note - each image is a matrix, however you have to feed a matrix of all images to learning algorithm (support vector machine in my case). What you have to do is to “unroll” each image matrix into a vector, get 1X10000 vector and build a new matrix, where each row is an image. Once training is done, load unseen data from “crossval/out/slide” directory and check “result/” directory, where you will find images of the cars. R script, which does all above:

setwd('/home/git/carCount/')

######read positives############
files=list.files('test/pos/')
pos=matrix(nrow=NROW(files),ncol=100*100)

for(i in 1:NROW(files))
{
  gray_file=read.pnm(paste('test/pos/',files[i],sep=''))
  pos[i,]=c(gray_file@grey)
}
outcome=vector(length=NROW(files))
outcome[which(outcome!=1)]=1

########read negatives#############
files=list.files('test/neg/')
neg=matrix(nrow=NROW(files),ncol=100*100)

for(i in 1:NROW(files))
{
  gray_file=read.pnm(paste('test/neg/',files[i],sep=''))
  neg[i,]=c(gray_file@grey)
}
tmp=vector(length=NROW(files))
tmp[which(tmp!=0)]=0
outcome=c(outcome,tmp)
forecast=svm(rbind(pos,neg),outcome)
cross_val=pos[84:90,]
pred=predict(forecast,cross_val,decision.values=TRUE)

##########################unseen data######################
files=list.files('crossval/out/slide/')
cross=matrix(nrow=NROW(files),ncol=100*100)

for(i in 1:NROW(files))
{
  gray_file=read.pnm(paste('crossval/out/slide/',files[i],sep=''))
  cross[i,]=c(gray_file@grey)
}
pred=predict(forest,cross,decision.values=TRUE)

###############copy positives into result directory###############
dir.create('result')
file.copy(paste('crossval/out/slide/',files[which(as.double(pred)>0.6)],sep=''),'result/')

Classified as positive by algorithm:

Classified as negative by algorithm:

Conclusion. It is truly amazing how well algorithm is able to separate wheat from the chaff without additional tuning. Mind you, my impression is biased after so many fails with financial data, which is noisy and good predictions are scarce. Nevertheless, this project is far away for ideal - it doesn’t take into account weather condition, traffic jams, perspective view, movements of the camera and etc. But I leave this fun for data-dive event.

Fork the code: https://github.com/kafka399/carCount/

Levenshtein Distance in C++ and Code Profiling in R

| Comments

At work, the client requested, if existing search engine could accept singular and plural forms equally, e. g. “partner” and “partners” would lead to the same result.

The first option - stemming. In that case, search engine would use root of a word, e. g. “partn”. However, stemming has many weaknesses: two different words might have same root, a user can misspell the root of the word, except English and few others languages it is not that trivial to implement stemming.

Levenshtein distance comes as the second option. The algorithm is simple - you have two words and you calculate the difference between them. You can insert, delete or replace any character, but it will cost you. Let’s imagine, an user enters “Levenstin distances” into search engine and expects to find revalent information. However, he just made 2 errors by misspeling the author’s name and he used plural form of “distance”. If search engine accepts 3 errors - the user will get relevant information.

The challenge comes, when you have a dictionary of terms (e. g. more that 1 mil.) and you want to get similar terms based on Levenshtein distance. You can visit every entry in the dictionary (very costly) or you can push dictionary into the trie. Do you need a proof for the cost? There we go:

Red color indicates the performance of the search, when all terms are in the trie, green - simple dictionary.

Now we come to the second part of the post - why to bother and plot such graphs, if we could check few entries to determine average time and the winner? The reason is simple - we trust in God, all others must bring data. To say it differently - while profiling the code, you should be interested in average time AND variation. As you can see in the graph above, variation of the blue color is very small - it takes approximately the same time to scan whole dictionary. However, red has higher variation - the result can take for while or it can finish just at the beginning, but overall it works faster. Now, imagine, that a programmer wants to define, which implementation A or B for volatile cache is much faster. Let’s assume, that big O notion is not going to help and she conducts 2 test for A and 2 for B. While running test A, cache size expands, while B - shrinks. As the result, B wins over A and she makes wrong choice. However, her colleague claims, that despite A has greater volatility, it is much faster and she tried with 500 queries! Whom should I trust?

I use this piece for code profiling:

simple=read.table('simple.txt')
node=read.table('node.txt')

simple=cbind(simple,as.character(c('simple')))
colnames(simple)=c('time','type')
node=cbind(node,c('node'))
colnames(node)=c('time','type')

rez=data.frame(rbind(simple, node))

require(ggplot2)

ggplot(rez,aes(time,fill=type))+geom_density(alpha=0.6,size=1.3)+scale_x_log10()

The data, C++ code for Levenshtein distance and trie can be find on GitHub.

I found this source very useful: http://stevehanov.ca/blog/index.php?id=114

Vectorized R vs Rcpp

| Comments

In my previous post, I tried to show, that Rcpp is 1000 faster than pure R and that generated the fuss in the comments. Being lazy, I didn’t vectorize R code and at the end I was comparing apples vs oranges.

To fix that problem, I built a new script, where I’m trying to compare apples against apples. First piece of code named “ifelse R” uses R “ifelse” function to vectorize code. Second piece of code is fully vectorized code written in R, third - pure C++ code and the last one is C++, where Rcpp “ifelse” function is used.

Name Seconds
ifelse R 27.50
vectorized R 10.40
pure C++ 0.44
vectorized C++ 2.24

Here we go - vectorization truly helps, but pure C++ code still 23 times faster. Of course you pay the price when writing it in C++. I found a bit strange, that vectorized C++ code doesn’t perform that well…

You can get the code from github.

How Big Block Trades Affect Stock Market Prices?

| Comments

I will be giving a presentation on “Optimal transaction cost” in Vilnius on 16th of August. While preparing the presentation and looking for an optimal execution solution, a natural question arises: does the size of the trade affect stock market price? I’m sure, you would say 100 % yes. Well, you would be right, but what is the scale of such effect? Is it possible to profit from execution of the big block trades?

Such test is not trivial and to conduct it, you need high frequency data, which is messy in most of the cases. For testing purpose I chose BNP Paribas stock from February 2011 to May 2011. Initially, I had more than 460 k. trades and more than 320k. quotes. However, the data was filtered by buyers initiated trades. To find buyers initiated trades, I used Lee-Ready Rule - short description can be found here on page 2. I found about Lee - Ready rule while reading Maxdama last post and a damn good summary (check page 42).

The first chart below shows the average return one trade later (within seconds in most of the cases), when big or small trade was done. X axis represents difference between the trade and following trade, Y axis represents the trade size and the dot size represents number of trades within that cluster of volume. As you can see, small trades add 0.0004% to the price, while big ones (more than 980 of shares) increase the price on average 0.0007%

The next figure shows average return one minute later. This time the different between small trades and big one are almost3 times!

While we can see, that stock market prices are affected by big blocks, there’s no easy way to profit from it. You have to take into account bid/ask spread, plus you are becoming liquidity demander when liquidity is dry. On other end, this test shows the cost for each volume cluster and this cost can be used when choosing an optimal strategy for portfolio/stock liquidation.

Transaction Cost Analysis and Pre-trade Analysis

| Comments

Transaction cost analysis (TCA) is the framework to achieve best execution in trading context. TCA can be split into three groups: pre-trade analysis, intraday analysis, and post-trade measurement.

Pre-trade analysis allows us to get insight about the future volatility of the price, forecast intra-day and daily volumes, market impact. It evaluates all strategies and advises the strategy that is most consistent with manager preferences for the risk.

Intraday analysis is real time analysis, where the system ensure that the strategy performs in line with the forecast.

Post-trade analysis measures the implementation of the investment decision to ensure, that pre-trade models are accurate and best execution is delivered.

If you want to learn more about the transaction cost analysis, I highly recommend Optimal Trading Strategies: Quantitative Approaches for Managing Market Impact and Trading Risk by Robert Kissell. The book not only covers all type of mentioned analysis, but also considers the practical aspects of the implementation of trading strategy and algorithms. The book includes very interesting part about Principal Bid Transactions or blind bids.

To build a forecast model we need some input parameters. Pre-trade analysis can be split into two parts - fundamental part and statistically based part. It is much ease to derive fundamental facts such as market capitalization, company profile, the sectors in which company operates, than statistically based parameters. For the latter I’m going to use R and the raw stock price data to derive it. The former can easily be found on Internet, Bloomber or Reuter.

The first input parameter that I’m are going to derive is average daily trading volume of the traded security. It is very easy to obtain it - only daily data is necessary. I used 20 days rolling window to get average volume. Here’s an example of IBM security:

The graph shows, how the 20 days average volume evolved since last September - the minimum was 3.5 millions and the maximum was 6 millions traded a day. But the average can be misleading - it takes into account last 20 days and then it derives one number. In most of the case, we are going to deal with short time prediction such as 1 day or 1 hour, so approximation of 20 days does not make a lot sense. One of possible solutions would be using confidence intervals estimate daily volume. 95% confidence interval is wide used in finance - I will use the same interval answering the following question: based on this interval, what volume will be traded next day.

The density diagram shows distribution of daily volume of IBM security. As you can see most of the days the volume was above 3 millions. However, 5% of the days the volume was below 3 millions. Based on this diagram, we can predict, that tomorrow’s volume will be higher than 3 millions.

Before we finish with the daily volume, we have to test for weekly seasonality. If there is weekday seasonality, then the volume forecast has to be adjusted.

The chart above clearly indicates, that the traded volume on Monday is below (~5%) the day average. There is increase in the volume on Friday, but the significance is under question. Let’s check density diagram to get rid of any doubt about the volatility on Monday and Friday.

The density diagram above shows, that Monday’s peak and the body are shifted to the left. On the other end we can see, that Friday’s body is aligned with others days, only it has a fat tail. Based on this diagram we can eliminate Friday’s volume adjustment and apply Monda’s adjustment only.

Once we finished with daily data we can move to intra-day. Then liquidating (or acquiring) a position in a security, the position has to be sliced into smaller parts to avoid influence of the market and to hide the intentions. For this reason, intra-day volume patterns have to be known. With that in mind, let’s look at how the volume is distributed in relation to the trading hour.

The graph above shows hourly volume pattern, where the volume is grouped by hour. The black dots indicate the median of the hour. Please keep in mind, that the trading starts at 9:30 and the first trading hour has only 30 minutes (if you want align the first hour to the others, then you need to multiply the volume of the first hour by two). As we can see, the first and the last are most traded and the volume drops in the middle of the day.

The next useful thing then liquidating a big position is average trade size. We need to know, how behaves average Joe and what he trades.

The chart above shows all trades grouped by hour - the black dots indicate median of the trades. The following table is supplementary to the chart - here you find median of the hour.

Hour Volume
10 842.5 |
11 565.5 |
12 394.0 |
13 300.0 |
14 297.0 |
15 369.5 |
16 708.5 |

Once again, average trade size is much higher in the first and last hours and it drops ~2.5 times during the trading session.

The final chart shows the volatility grouped by hour. There is a lot jittery, then the market opens, it becomes calmer during the lunch time and slightly increases then the market closes. These are the numbers for each hour:

Hour Volume
10 0.16% |
11 0.09% |
12 0.08% |
13 0.05% |
14 0.05% |
15 0.06% |
16 0.07% |

In this article we analyzed a list of potential input parameters for pre-trade analysis and further forecast. The list is not static and can be extended with supplementary parameters, such as bid-ask spread distribution and etc. The next step is to aggregate these parameters and build the models to forecast volatility and volume.