# Applying Machine Learning to Peer to Peer Lending

Peer to peer lending allows to lend money to unrelated individuals without going through traditional financial service such as bank, credit union, etc. Nevertheless, there is an intermediary - service and platform provider. The provider verifies the identity of the borrower and income status, processes the payments, promotes its platform, deals with bad loans or demands bankruptcy for the borrower.
The advantage of peer to peer lending for the borrowers is lower interest rate and higher rate for lenders. However, higher rate comes with higher risk - the return is more volatile than a bank deposit.

Regardless of many lending platforms such us Prosper.com (US), zopa.co.uk (UK), www.fundingcircle.com (UK), auxmoney.com (DE), pret-dunion.fr (FR), comunitae.com (ES), lendico.com (global), majority of platforms accept only the investment from local investors. However bondora.com (EE) allows invest into three markets - Estonia, Finland and Spain and accepts investors from across Europe. Additionally, the rate of return at bondora.com is not fixed as in other platforms and allows much higher returns. However, there is a possibility that you may lose some or all of your initial investment as it is not protected by any financial compensation scheme.

The company behind bondora.com is isePankur AS, which is based in Estonia, a small country in north of Europe. The really amazing thing about bondora.com, that they share data with everyone. The data-set gives us an opportunity to glimpse at the performance of the company and a possibility to build our own credit scoring model!

The data goes back to 2009 and the chart below shows the total number of loans and funded loans per month. It looks like the business exploded in 2013 and the following charts will give us a few clues.

In 2013 bondora.com became more active in Finland and Spain markets, though it was in 2014, when the grow skyrocketed in these markets.

Another big change, what might ignited the grow of bondora.com is shift in the duration of the loans. Dominant loan duration before 2013 was 1-2 years, but in 2013 the company started issuing 5 years loans, which became the primary duration in 2014 and was half of all loans.

The latest data-set has data about 29688 loans and 172 columns or features (depending which parlance you prefer). Below you can see a partial print screen of the interface and dozen of the features.

### Model building

Now, that we are familiar with bondora’s data-set, let’s move to model building. The prediction model can predict two types of outcomes - categorical (yes/no, true/false classes) or numerical (one is less than one hundred). Although most credit scoring models are built to return a credit score for a borrower, I have opted for simple model, where the outcome has two classes: good or bad.

The definition of “good” class is straight forward - the class in which you are willing to invest or give a credit, but “bad” class definition is complicated. The data-set has data about the borrowers who were late with their payments for 7, 14, 21, 60 days and defaulted loans. How bad is a borrower if he is late for X days? True, he doesn’t respect the schedule and the contract for various reason such as harsh life, distraction or any other reason. However, once he is back on track he pays what he owns, plus late charges, which leads to higher return for additional risk.
The defaulted loans really sounds as bad loans, right? Well, what if the loan defaults, but you get back the principal and partial interest rate? Doesn’t sound that bad, does it? What you really don’t like is the default on the loan and zero payments - these loans are the fraud and you want to avoid them. So let’s mark them as a “bad” ones.

Beside choosing the outcome, it took me awhile to realize another problem with the data-set - the shift in the business model. Nowadays, most of the loans are issued for 5 years and the data-set doesn’t event have data on matured 5 years loans! So I did the trick - I marked all 5 years loans as repaid which are still “alive” after 2-3 years.

While working on a few machine learning projects I quickly learned, that the biggest impact on performance of the model comes not from the fancy machine learning algorithms, but from well engineered features. In the chart below you can find, that 3rd feature is “total_interest”, which is made of “Interest” and “LoanDuration”. The two features perform well, but the derived feature has much bigger weight.
Additionally, I have added data about VIX index. The index tracks volatility of the stock market via S&P500 index - its value increases during the crisis and falls back during calm times. By adding independent source the performance of the model increased 2%.

After initial cleaning of the data-set and feature engineering it was time to build a simple model. My favorite machine learning algorithm is Random Forest for the following reasons: you can feed almost any raw data and it chew happily; the algorithm itself is easy to understand, nevertheless it is kind of black-box; it gives the weights of the features:

### Model metrics

In classification task precision and recall are used frequently for model metrics. The predicted value can be assigned to four classes: True Positive - real fraud (model predicted True and value was True), True Negative - not a fraud (model predicted False value and value was False), False Positive - not fraud marked as a fraud (model predicted True, however value was False) and False Negative - real fraud marked as not fraud (model predicted False, but the value was True).

In case of p2p lending, if you commit False Positive error (Type I), you just miss one investment. You main concern is False Negative (Type II) errors, because you will be loosing money on the bad investment. Positive predictive value metrics is used for performance evaluation: sum(True positive)/sum(Test outcome positive)

### Blending

Funny, but the blending works well with ML algorithms and with the people. Michael Nielsen in his book “Reinventing Discovery: The New Era of Networked Science” gives many examples how the collective intelligence can be more powerful than single mind. The idea is very simple - if you gather predictions from dozen of people or ML models then the average score will better than the best single prediction. There is one pitfall - if the predictions are given by “herd mind” then the result most likely be horrible. So, in ML environment try to include very different algorithms - decision trees, linear models, neural networks, etc. to sustain diversity.

For my final model I blend tree algorithms - Random forest, SVM and generalized boosted modeling (GBM). If all of them predict, that the loan is not a fraud, then I will make an investment.

### Real time data

The modeling part is done, but without real time data it is just waste of time. Fortunately, there is a nice Python framework for web crawling - Scrapy. Initial time investment in the tool might look significant, but because it is robust it won’t be your concern any time soon, unless the platform gets face-lift…

### Automated investment

Selenium is a suite of tools to automate web browsers across many platforms. It is widely used for web interface testing purpose, however any web-based task can be automated.

Once I have real time data I feed my model with it to find out which loans are good for the investment. Acquired list is send to Selenium script, which logins into the platform and makes the investments.

My first idea was to use Raspberry Pi for the project, however I had the problems setting up R-language and Python frameworks. So, I rented an instance on DigitalOcean for 10 dollars a month (there is 5$/month option as well) and it worked well. Meanwhile, I realized, that I don’t need the server for 24 hours a day. The solution was to power-on four times a day and shutdown the server once the job is finished. But as you probably know, powering-off your server is not enough to save you from paying - you need to archive your virtual instance (the same applies to Amazon AWS). So, I came up with the script, which creates a virtual image from the archive, powers it on, runs the crawler, runs R mining module, makes the investments if necessary and then shutdowns the instance, archives the image and deletes virtual machine. Does it sounds good? Well, it was good, until I went on holiday for one week without almost any access to Internet. At the end of my vacation, I found, that every day four new servers were created and were still spinning… Turns out, that Raspberry Pi was able to initiate new instances, but wasn’t able to shutdown and delete. The support at DigitalOcean have asked for the log files from my server and I ended up paying the bill, because it was almost non-existant on my RaspPi. The lesson taken - log as much as possible and incorporate health checks for virtual machine in your scripts. ### Results I started my investments at Bondora in September 2014. At the beginning all my investments were done via Bondora’s investment engine, where you can define investment parameters (country, risk profile, etc.). Somewhere in November - December I decided, that I will rely on my own engine only. Below you can see, that the engine based on machine learning algorithms does good job by avoiding bad borrowers. # Data Based Review of Strapless Mio Link HRM | Comments Our bodies generate a lot of data - blood pressure, heart rate, amount of glucose in blood, etc. However, the struggle is how to collect data? Until recently, heart rate monitors(HRM) were popular only among practitioners of endurance sports - runners, cyclists, swimmers, etc. Nevertheless, a strapless, wristband heart rate monitor seduces a new category of users - data minded people who are interested in quantified self movement. Mio Link looks like a watch on your wrist, allowing to wear it around the clock. Have you ever tried to wear chest based HRM outside workout? The technology behind wristband HRM is quite simple - integrated LEDs beams light into the skin and pulsing volume of blood flow is collected. Wristband HRM might provide better accuracy than chest based HRM, because latter is affected in lower air temperatures. Below you can see data comparison between Mio Link and Garmin Soft Strap from one of my workouts: There some data discrepancy, but most importantly peaks and valleys are intact. However, sport workouts are limited in time and I wanted to know, how does wristband HRM work around the clock. The second chart shows data recorded during the day. As you can see the error rate (red color indicates potential errors) is much higher. This might be explained by the fact that my movements were not constant or repetitive contrary to running or sleeping. The third chart shows the data gathered during the night. The data bears some noise as well, but the spikes indicate shift in data blocks and presumably body movements. It would be interesting to know if sleep stages can be extracted from heart rate data. If you plan to use it on daily basis, keep in mind, that the battery last 8-10 hours. It might sound bad, but during the day you have plenty of time windows when you can charge the battery without loosing a lot sensitive data. For example, while you sit in front of computer your heart rate most likely will be low and stable and it is perfect time for charging. If the idea of wristband HRM sounds appealing, beside Mio link, you can check for Scosche RHYTHM+ as well, which is equivalent of the former. # Credit Card Fraud Detection | Comments During the last Data Science community meeting in Luxembourg Phd Candidate Alejandro Correa Bahnsen gave a presentation about Credit Card Fraud Detection(CCFD). In short - CCFD is just another machine learning problem which is similar to Network Security and Intrusion Detection(NSID) problem, but it has its own obstacles. From business perspectives - card fraud detection system directly impacts profitability of all credit card operators, therefore are very desirable. The cost of card fraud in UK alone in 2006 was estimated ~ 500 millions pounds. Assuming, that CCFD system can identify 30% (Churn models at TelComs are able to save that much) of all fraud would lead to 150 millions pound savings a year. Hence, we have a desirable product with price range from 1.5 to 15 millions pounds in the UK. Here is the catch - in any given country there are a few credit card operators, for ex. only CETREL operates in Luxembourg. So, if you want to sell a solution you play all or nothing game. Additionally, if you wish to build a CCFD system or at least a prototype you most likely missing the data and you won’t get it. Chicken and egg problem. How does the data set might look? Think of millions rows where less than 0.002% of data are the fraud operations. If you train your model with unadjusted data set, it will predict all future events as normal operations. In order to avoid such behavior you need to throw away most of the data with normal operations and balance data where distribution would be 1% of fraud vs 99% normal or 5% vs 95%. You can play with freely available [network intrusion data] to get idea how imbalanced data would look like. Another thing to keep in mind is the size of data set. Conventional wisdom - the more data you have, the better model you can build, but it is not true if you run a real time system, where latency is big deal. In such case you have to think about something similar to map-reduce framework, where you keep only the averages of the variables per client. CCFD systems work as binary classificators, where the response either fraud or normal operation, meaning that it doesn’t take into account how much fraud cost. Alejandro tries to incorporate loss-profit function, where each operation has its own cost. If you think about his approach - sounds as regression problem to me. And the last thing - I suppose, it’s worth to try to run unsupervised learning system in parallel. Unsupervised CCFD would issue a lot false alerts at the beginning, however it would considerably improve over time with good feedback from supervised CCFD. # Kaggle Challenge - Event Recommendation Engine | Comments Event Recommendation Engine Challenge was my second challenge at Kaggle and I finished 15th out of 225 on final (private) leaderboard. I was able to finish 1st on public leaderboard. Believe or not but the difference doesn’t come from over fitting but rather from an external data source (Google Maps) which was forbidden. I did read the rules, but such important restriction was buried under additional layer of rules which I didn’t bother to read. So moral of the story - if you are doing well then read the rules for second time. Nevertheless it is strange, why the host of the challenge didn’t preprocess the data and didn’t convert location of the users into latitude/longitude format? It would definitely lead to better models, as in my case such conversion gave + 4% in precision. For this competition I used random forest almost exclusively and devoted all my time for the feature derivation. For the final prediction I built three models and then combined them together: set.seed(333) final_model3=randomForest(factor((interested-not_interested)/2+.5) ~ .,data=final_model,importance=TRUE,nodesize=4) set.seed(33) final_model1=randomForest(factor((interested-not_interested)/2+.5) ~ .,data=final_model,importance=TRUE,nodesize=4) set.seed(3) final_model2=randomForest(factor((interested-not_interested)/2+.5) ~ .,data=final_model,importance=TRUE,nodesize=4) final_model=combine(final_model3,final_model1,final_model2)  Below you can find a chart with most important features of my final model: • time_diff - From early beginning I found that the difference between when the event is scheduled to begin and when the user saw the event ad is important feature which is easy to derive. • popularity - How many users said they are interested in the event. • start_hour - Turns out, that it is important to know at what hour an event is going to begin. • friends - The name of this feature might be misleading, nevertheless it keeps how many user friends are invited to the event. • joinedAt The difference between year when the user joined the service and 2000-01-01. I was surprised to find, that such feature has weight at all. • timezone Had few NA values which I replaced by 0. Then I converted timezone numerical value into factor of two hours: 14-12, 12-10 and etc. • birthyear Numeric value. • weekdays On which weekday did the event happen? (Monday, Tuesday and etc). • friend_yes, friend_no, friend_maybe Number of friends which are interested, not interested or maybe in the event. • c_XXX All c_xxxx features were used without preprocessing. • locale I used the first two letter of locale variable. • location_mat Once I found, that external sources such as Google maps are forbidden then I tried to determine if user location shares the same words with event country, state and city descriptions. If it does I would add +1 (max 3) for location_mat variable. • distance (forbidden) the feature scored high, but I had to remove it. The first step was to obtain latitude and longitude for users with known location. If you are interested here is the source code which shows how easily it can be done in R. Need to say, that I did manual and automated data cleaning - converted states names and some frequent errors like “City name 19”. Once I had the coordinates of the users I was able to calculate the distance to event. Then I used k-means to predict user location for those who did not specified it, based on friends location. For example if 5 out of 8 friends of the user are based in Indonesia then user is given Indonesia location and distance to the event is calculated. Here’s the source code for prediction of user location. Click here if you are interested in the source code of my solution. # Machine Learning for Hackers | Comments Which way do you prefer to learn a new material - deep theoretical background first and practice later or do you like to break things in order to fix them? If latter is your way of learning things, then most likely you will enjoy Machine Learning for Hackers. The book has chapters on machine learning techniques such as PCA, kNN, analysis of social graphs hence even advanced R users might find something interesting. So I want you to show you my example of visualisation of similarity between parliamentarians in Lithuania which idea is taken for chapter 9. In most of the cases you should be able to get access to voting results of legislative body in your country. Nevertheless the data can be buried in “wrong” format such as html or pdf. I use Scrapy framework to parse html pages, however I have faced a problem, when my IP address was blocked due to many requests (10 000) within 2 hours. But in cloud age the problem was quickly solved and I made a delay in my crawler. Here is the examples of the data in CSV format. With data in hand it was easy to proceed further. To find similarities between parliamentarians I took voting results of approximately 4000 legislations and built a matrix, where rows represent parliamentarians and columns - legislations. “Yes” votes were encoded as 1, “No” as -1 and the rest as 0. R has a handy function dist to compute the distances between the rows (parliamentarians) of a data matrix. The result of the function is one dimension data of the distance between parliamentarians, however to reveal the structure of a data set we need two dimensions. Once again, R has a function cmdscale which does Classical Multidimensional Scaling (CMS). I found this document very useful in explaining Multidimensional Scaling. Here is the final result: Click on the image to enlarge. The plot above reveals, that right wing party TSLKD has a majority in parliament and LSDP (socialists) are in opposition and liberals (LSF, JF, MG) are in the center. You might argue, that that is already known, however the plot is based on actual data, therefore differences in voting support outlooks of the parliamentarians(right, central, left). The map shows which members of the party are outliers and which one from the other party can be invited while forming a new parliament (second tour of the election is on the way). Members of the left wing are mixed up and it would make sense to them to merge or form a coalition. Are you looking for source code? Click here. # Garmin Data Visualization | Comments People go on rage, when governments initiate surveillance projects like CleanIT, nevertheless share very private data without a doubt. I have to admit, that some data leaks are well buried in the process. Take for example Garmin which produces GPS training devices for runners. In order to see your workouts you are forced to upload sensitive data on internet. In response you are given a visualization tool and a storage facility. What are alternatives? It seems, that in the past there was a desktop version, however I was not able to find it. So, we are left with the last option - hack it. First of all you need to transfer data from Garmin device to computer. I own Forerunner 610 with relays on ANT network and I found Python script with takes care of data transfer. Once data is transfered there is another obstacle - Garmin uses a proprietary format FIT. In order to tackle this problem I use another Python script which I have adapted to have csv format. Once data is in CSV format R can be used to plot data. I had a lot of fun by trying to understand Garmin longitude and latitude coordinates. Here is a short explantion by Hal Mueller: The mapping Garmin uses (180 degrees to 231 semicircles) allows them to use a standard 32 bit unsigned integer to represent the full 360 degrees of longitude. Thus you get the maximum precision that 32 bits allows you (about double what you get from a floating point value), and they still get to use integer arithmetic instead of floating point. # Building a Presentation, Report or Paper in R | Comments If you need to build a presentation, obviously you have following options: • Powerpoint alike presentation • Online engines • LaTex The first two are beloved by business people and the third one is widely used in academia. The objective of the first group is shiny presentation, contrary to the second where asceticism and demand for automation are top priorities. However, if you are data scientist or any other data specialist with a need to build an automated report, then you know, that LaTex is just wrong. LaTex allows you to build a shiny presentation or outstanding paper, however it can take light years to build something useful for beginners . If you never tried LaTex here is an example of the monster - you literally have to code a document or presentation: <code>\documentclass{article} \title {Investment strategy} \author {Dzidorius Martinaitis} \begin{document} \maketitle</code>  So, what do you do, if you need only 1% of all LaTex features and a report/document needs to be build automatically? Turns out, that HTML little brother Markdown is saving the world. Markdown(.md) source files are easy to read and easy to write and you can convert it into .html, .pdf, .docx, .tex or any other format. There are many ways to do conversion, however I use Pandoc utility. By the way this post was written in markdown in Vim and you can check the source file. However, the nicest thing about Markdown is integration with R. You can build your report in one file, where R code would be embed in Markdown. Knitr package will help you to convert R code into Markdown simply by calling this piece of code: <code>require(knitr); knit('workshop.Rmd', 'workshop.md');</code>  Below you will find an excerpt of .Rmd file which is mix of R and Markdown: <code>Get the data === Who is tweeting about #Haxogreen {r results=asis,comment=NA, message=FALSE} require(twitteR) load('tweets.Rdata') names=sapply(tweets,function(x)x$screenName)
rez=(aggregate(names,list(factor(names)),length))
rez=rez[order(rez$x),] colnames(rez)=c('name','count') options(xtable.type = 'html') require(xtable) xtable(t(tail(rez,6)))  Plot top10 tweeters === {r topspam, figure=TRUE,fig.cap=''} barplot(tail(rez$count,10),names.arg=as.character(tail(rez\$name,10)),cex.names=.7,las=2)
</code>


Here is a workshop presentation which contains the example above - I built it for Haxogreen hackers camp and source code can be found on gitHub.

# Data Mining for Network Security and Intrusion Detection

In preparation for “Haxogreen” hackers summer camp which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection.

Flame virus, Stuxnet, Duqu proved that static, signature based security systems are not able to detect very advanced, government sponsored threats. Nevertheless, signature based defense systems are mainstream today - think of antivirus, intrusion detection systems. What do you do when unknown is unknown? Data mining comes to mind as the answer.

There are following areas where data mining is or can be employed: misuse/signature detection, anomaly detection, scan detection, etc.

Misuse/signature detection systems are based on supervised learning. During learning phase, labeled examples of network packets or systems calls are provided, from which algorithm can learn about the threats. This is very efficient and fast way to find know threats. Nevertheless there are some important drawbacks, namely false positives, novel attacks and complication of obtaining initial data for training of the system. The false positives happens, when normal network flow or system calls are marked as a threat. For example, an user can fail to provide the correct password for three times in a row or start using the service which is deviation from the standard profile. Novel attack can be define as an attack not seen by the system, meaning that signature or the pattern of such attack is not learned and the system will be penetrated without the knowledge of the administrator. The latter obstacle (training dataset) can be overcome by collecting the data over time or relaying on public data, such as DARPA Intrusion Detection Data Set. Although misuse detection can be built on your own data mining techniques, I would suggest well known product like Snort which relays on crowd-sourcing.

Anomaly/outlier detection systems looks for deviation from normal or established patterns within given data. In case of network security any threat will be marked as an anomaly. Below you can find two features graph, where number of logins are plotted on x axis and number of queries are plotter on y axis. The color indicates the group to which points are assigned - blue ones are normal, red ones - anomalies.

Anomaly detection systems constantly evolves - what was a norm year ago can be an anomaly today. The algorithm compares network flow with historical flow over given period and looks for outliers with are far away. Such dynamic approach allows to detect novel attacks, nevertheless it generates false positive alerts (marks normal flow as suspicious). Moreover, hackers can mimic normal profile, if they know that such system is deployed.

The first task when implementing anomaly detection (AD) is collection of the data. If AD is going to be network based, there are two possibilities to collect aggregated data from network. Some Cisco products provide aggregated data as Netflow protocol. However, you can use Wireshark or tshark to collect network flow data from the computer. For example:

tshark -T fields -E separator , -E quote d -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport -e udp.srcport -e upd.dstport -e tcp.len -e ip.len -e eth.type -e frame.time_epoch -e frame.len


Once you have enough data for mining process, you need to preprocess acquired data. In the context of intrusion, anomalous actions happen in bursts rather than single event. Varun Chandola et al. proposed to derive following features:

• Time window based: Number of flows to unique destination IP addresses inside the network in the last T seconds from the same source Number of flows from unique source IP addresses inside the network in the last T seconds to the same destination Number of flows from the source IP to the same destination port in the last T seconds host based - system calls network based - packet information Number of flows to the destination IP address using same source port in the last T seconds

• Connection based: Number of flows to unique destination IP addresses inside the network in the last N flows from the same source Number of flows from unique source IP addresses inside the network in the last N flows to the same destination Number of flows from the source IP to the same destination port in the last N flows Number of flows to the destination IP address using same source port in the last N flows

Below you can find an example of feature creation in R. The dataset was created by calling tshark script, which is specified above.

#load data
#get rid of everything below min. in timestamp
tmp[,10]=as.integer(as.POSIXct(format(as.POSIXct(as.integer(tmp[,10]),origin="1970-01-01"),"%Y-%m-%d %H:%M:00")))
#fix some rows
tmp=tmp[-which(sapply(tmp[,1],function(x) nchar(x)>15)),] tmp=tmp[which(!is.na(tmp[,4])),]

#aggregate date by 5 mins. it assumes, that flow is continuous
factor=as.factor(tmp[1:5000,10])

feature=do.call(rbind, sapply(seq(from=1,to=length(factor),by=4),function(x){ return(list(ddply( subset(tmp,factor==levels(factor)[x:(x+4)]),.(V1,V4),summarize,times=length(V11),.parallel=FALSE ))) }))


After preprocessing the data we can apply local outlier detection, KNN, random forest and others algorithms. I will provide R code and practical implementation of some algorithms in the following post.

While preparing this post, I was looking for the books, I found only few books covering data mining and network security. To my surprise Data Mining and Machine Learning in Cybersecurity book includes both topics and it is well written. However, if you are security specialist looking for data mining books, you can read my summary of “Data Mining: Practical Machine Learning Tools and Techniques”

# My First Competition at Kaggle

For me Kaggle becomes a social network for data scientist, as stackoverflow.com or github.com for programmers. If you are data scientist, machine learner or statistician you better off to have a profile there, otherwise you do not exist.

Nevertheless, I won’t bet on rosy future for data scientist as journalists suggest (sexy job for next decade). For sure, the demand for such specialists is on rise. However, I see one big threat for data scientist - Kaggle and similar service providers. You see, such services allows to tap high end data scientists (think of PhD in hard science) at minuscule fraction of real price. Think of Hollywood business model - top players get majority of the pool and the rest is starving. If you try the same service model on IT projects you will most likely get burned. My reasoning can be wrong, but I suspect, that project timespan is the issue - IT projects can take for while to finish (1-10 years), but main stream ML project won’t take that long.

Notwithstanding these obstacles, machine learning, information retrieval, data mining and etc. is a must with ability to write code for production, deal with streaming big data and cope with performance of intelligent system. Then, in programmers parlance, you will became “data scientist ninja” and every company will die for you. There is a good post on the subject on mikiobraun blog, but I mind you, that it is a bit controversial.

Although for last 4 years I often has been working on financial models and time-series, this competition added a new experience to me and hunger for the knowledge. During competition I found this book very practical and plentiful of ideas what to do next: Data Mining: Practical Machine Learning Tools and Techniques. As complimentary book I used Data Mining: Concepts and Techniques, though most of information can be found in one of them. I will try to summarize some chapters in my own story.

Understanding the data. ”Online Product Sales” competition metadata (data about data) is miserly - there are three types of the data - the date fields, categorical fields, quantitative fields and response data for next 12 months. However metadata is most important element in all ML projects, which can save you a lot of time once you understand it better and it leads to much better forecast if you have “domain knowledge”.

Cleaning data. There is famous phrase: “garbage in garbage out”, meaning, that before any further action you have to detect and fix incorrect, incomplete or missing data. You have many possibilities to deal with missing data - remove all rows, where the data is missing; replace it with mean or regressed value or nearest value and etc. If your data is plentiful and missing values are random (meaning, that NA values do not bear any information) - just get rid of them. Otherwise you need impute new values based on mean or other technique. Mean based replacement worked best for me in this competition. Outliers are another type of the troubles. Suppose, that variable is normally distributed, but few variables are far away from the center. The easiest solution would be to remove such values - as many do in finance by removing “crisis period”. When next crisis hits, the journalists are rushing to learn a new buzzword- black swan. Turns out, that outliers can’t be ignored, because the impact of them is huge. So be precautious while dealing with outliers.

Feature selection. It was surprising to me that too many features or variables can pollute forecast, therefore you need to do feature selection. Such task can be done manually be checking correlation matrix, co-variance and etc. However, random forest or generalized boosted methods can lead to better selection. In R you just call randomForest() or gbm() and job is done.

Variable transformation - a way to get superior performance. “Online Product Sales” competition has two date fields, however these fields encoded as integers. By transforming these variables into date and retrieving year and month led to better performance of the model. In most of cases taking logarithm for numeric fields gives performance boost. Scaling (from 0 to 1 or from -1 to 1) and centering (normal distribution) might be considered when linear models are in use. It is worth to transform categorical variables as well, where 1 would mean, that a feature belongs to the group and 0 otherwise. Check model.matrix function in R for latter transformation and preProcess function in caret package for numerical variables.

Validation stage - helps you to measure performance of the model. If you have huge database to build a model you can divide you set into two/three parts - for training, testing and cross validation and you are ready to start. However, if you are not so lucky, then other methods come to play. Most popular method is division of the set into two groups, namely “training” and “test” and rotating it for 10 times. For example, you have 100 rows, so you take first 75 for training and 25 for test and you check the performance ratio. In the next step you take the rows from 25 to 100 for training and you use first 25 for test. Once you repeat such procedure 10 times, you have 10 performance ratios and you take average of it. Stratified sampling is a buzzword, which you should know when you do a sampling. Keeping all this information in mind I wasn’t able to to implement accurate cross validation and my results differ within 0.05 range.

Model selection and ensemble. Intuitively you want to choose the best performing algorithm, however the mix of them can lead to superior performance. For regression problem I trained four models (two random forest versions, gbm, svm), made the predictions, averaged the results and that led to better prediction.

# GitHub Data Analysis

Few weeks ago GitHub announced, that its timeline data is available on bigquery for analysis. Moreover, it offers prizes for the best visualization of the data. Despite my art skills and minimal chances to win beauty contest, I decided to crunch GitHub data and run data analysis.

After initial trial of bigquery service, I found hard to know, what price, if any, I’m going to pay for the service. Hence, I pulled the data (6.5 GB) from bigquery on my machine and further I used my machine for analysis. Bash scripts have been used to clean up and extract necessary data, R for data analysis and visualization and C++ for text extraction.

GitHub dataset is one table, where each row consist of information about repository (i.e. path, date of creation, name, description, programming language, number of forks/watchers and etc.) and action, which was done by user (i.e. username, location, timestamp and etc.).

As a result, we can check how GitHub users actions are spread over time during the day. The X axis on the graph below is labeled with the hours of the day (GMT) and the Y axis represent median values of the actions for each hour. From it, we can make a deduction, that highest load for GitHub can be expected between 15:00 and 17:00 GMT and lowest to be expected between 05:00 and 07:00 GMT. The color of the line indicates how busy was the day based on quantiles: green are calm days (20% of days), blue - normal days (50% quantile) and red are busy days (80% quantile). I should to mention, that auto-correlation or serial correlation is high (70% for following hour), which means, that busy hours tend to be followed by busy hours and calm hours tend to be followed by calm hours. Moreover, busy days tend happen after busy days.

Second graph below shows median of actions divided by weekdays. There is not big surprise - weekends are more slow than weekdays, nevertheless the programmers are slightly less productive on Mondays and Fridays.

The analysis of creation of new repository shows, that the pattern of busy or calm hours remains over the years. This can be attributed to the fact, that majority of the users comes from North America and Europe. Another hypothesis can be drawn from this information, that number of creation of the new repositories grow exponentially. However, I mind you, that the graph below is biased - most likely, GitHub users update recent projects, consequently more recent projects appeared on timeline. Even though, 2009-2011 years show exponential grow. The X axis of the graph below is labeled with the hour of the day, the Y axis - log of median values of new repositories.

Following graph shows the number of forks per project (the X axis, log scale) versus number of watchers (the Y axis, log scale). As expected, there is linear correlation between forks and watchers. Even so there is something interesting about outliers, which are below bottom line - the projects, where number of watchers is low, but number of forks is high. These are anomalies and worth to check.

The next thing to do is to look at the repository description. Let’s group the repositories by programming language and count most dominant words in the description. The graph below has C++ word cloud on the left and Java - right . C++ projects are about library, game, simple(?), engine, Arduino. Java is dominated by android, plugin, server, minecraft, spring, maven.

Ruby (left) vs Python(right ):

“Surprise”, “surprise” - R projects (left) are largely about data analysis, however “machine” word, which corresponds to Machine learning is very tiny. Shell (right) is dominated by configuration, managing, git(?).

GitHub dataset includes location field. Unfortunately, the users can enter whatever they want - country, city or leave it empty. Nevertheless, I was able to extract good chunk of actions, where location field has meaningful value. The video below shows country based users activity, where dark red corresponds to high activity and light red - minor. Only 30 most active countries are included, the rest are grey. The same pattern persist over the days - activity in Asia increases around midnight, Europe wakes up around 8:00 or 9:00, where America starts around 15:00. Who said, that hackers and programmers work at night?

What else can be done with GitHub dataset? Most repositories have description field, which can be used to find similar projects by implementing tf-idf method. I tried that method and the results are satisfying.

Most of the graphs shown above are reproducible (except word clouds) and the code can be found on GitHub.