Kaggle: TalkingData AdTracking Fraud Detection Challenge

How to Achieve a Silver Medal within the Kaggle Competition as a Beginner

Kaggle is a data science and machine learning competition platform, owned by Google, upon which companies, including Facebook, Google, Quora, and others, publish their data and problems. Many data scientists, data-mining experts, and machine-learning engineers from all over the world, who call themselves “Kagglers” attempt to produce the most accurate predictions by analyzing the data and building predictive models.

The contest that we took part in was the TalkingData AdTracking Fraud Detection Challenge held by Kaggle. TalkingData is China’s largest independent big data service platform. Online advertisement companies, such as TalkingData, encounter click fraud often, resulting in inaccurate click data, and thus wasting both time and capital when it occurs in large volumes. TalkingData handles over 3 billion clicks per day, thus building an IP and device blacklist record of users who do not install apps after clicking on them by employing machine learning methods.

The data presented by TalkingData contains information relating to users who click on advertisements, such as IP address’, apps, devices like iPhone, Xiaomi, and Huawei, operating systems like IOS or Android, channels, the time of the clicking, and so on. Our job was to employ the training data relating to circa 200 million clicks over a four day period, including the features mentioned above, in order to build a reasonable model with which to predict whether the click is fraudulent or not. The score will be determined by the precision of our prediction based on the testing data.

NOTE: You can see our full experience and details of our model including feature engineering, trainning algorithm and some blending skills in this report.