http://www.csmining.org/cdmc2016/

Data Mining Tasks Description

Task 1: 2016 e-News categorisation

For this year, the dataset is sourced from 6 online news media:

The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).

Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.

Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.

The statistical information of the training dataset is summarised as below:

Topic No. of News
Business 361
Entertainment 343
Sport 363
Technology 356
Travel 362

Task 2: UniteCloud Operation Log for Anomaly Detection

UniteCloud is a resilient private Cloud infrastructure created in New Zealand Unitec Institute of Technology using OpenNebula for cloud orchestration and KVM for virtualization.

This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.

The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.

The statistical information of this dataset is summarized as:

No. of Sample No. of Features No. of Classes

No. of Training

No. of Testing

82,363 243 8 57,654 24,709

Task 3: Android Malware Classification

This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems use an .exe file for installing software,Android use APK files for installing software on the Android operating system.

The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.

To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.

The statistical information of the dataset is summarized as:

No. of APK files No. of Permissions No. of Classes No. of Training No. of Testing
61,730 up to 583 2 30,920 30,810

Also, the MD5 hash is provided if you may need for checksum:
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)

最新文章

  1. 把 excel 和 mysq l数据库相互转换
  2. oracle 11g dmp文件导入10g
  3. Python 10 —— 杂
  4. http协议(五)web服务器
  5. win7登入使用的是临时档案解决方法
  6. ruby编程语言-学习笔记4(第4章 表达式和操作符)
  7. Git和Github的应用与命令方法总结
  8. Neutron/ML2学习
  9. Emit技术使用实例及应用思路
  10. Linux常用命令  新手必看
  11. http接口加密《一》:移动应用中,通过在客户端对访问的url进行加密处理来保护服务器上的数据
  12. Java集合List、Set、Map
  13. 一、tars简单介绍 二、tars 安装部署资料准备
  14. jQuery 操作input select,checkbox
  15. Dockerfile 构建kibana 反向代理应用做用户认证访问
  16. Linux安装R记要
  17. FireMonkey 源码学习(3)
  18. 启动和停止mysql的正确姿势
  19. iOS 建立项目过滤机制 —— 给工程添加忽略文件.gitignore
  20. 在xadmin中自定义内容的变量及优化汇总

热门文章

  1. WPF学习(8)数据绑定 https://www.cnblogs.com/jellochen/p/3541197.html
  2. 求x!在k进制下后缀零的个数(洛谷月赛T1)
  3. PHP核心编程--目录操作(包含文件操作)
  4. viewpager实现进入程序之前的欢迎界面效果
  5. oracle-Restrict权限
  6. [React Native]去掉WebStorm中黄色警告
  7. SQLServer —— 用户权限操作
  8. time,datetime模块
  9. php输入流简单小例子
  10. ERROR: epmd error for host "yourhostname": timeout