基于One-Class的矩阵分解方法
在矩阵分解中。 有类问题比較常见,即矩阵的元素仅仅有0和1。 相应实际应用中的场景是:用户对新闻的点击情况,对某些物品的购买情况等。
基于graphchi里面的矩阵分解结果不太理想。调研了下相关的文献,代码主要实现了基于PLSA的分解方法,具体请參考后面的參考文献
#!/usr/local/bin/python
#-*-coding:utf-8-*-
import sys
import math
import numpy as np
import string
import random
"""
reimplement algorithm of aspect_bernoulli.m
which may be usefull in one class matrix factorization
"""
"""
get matrix that all elements are 1
"""
def get_unit_matrix(row, col):
return np.ones((row, col), dtype=int)
def norm_zero(array):
col, row = array.shape
for c in xrange(col):
for r in xrange(row):
if array[c][r] < 0.000001:
array[c][r] = 0.000001
return array
"""
X: binary data matrix, each column is an observation
K: number of aspects(components) to estimate
iter: max number of iterations to do
ufile: save matrix result u
vfile: save martix result v
"""
def run_bernoulli(X, K, iter, ufile, vfile):
X = np.array(X)
T,N = X.shape
S = np.random.rand(K, N)
S_SUM = np.add.reduce(S)
S_SUM_EXT = np.tile(S_SUM, (K, 1))
S = S/S_SUM_EXT
A = np.random.rand(T, K)
UM_A = get_unit_matrix(T, K)
A_TEMP = UM_A - A
UM_X = get_unit_matrix(T, N)
X_TEMP = UM_X - X
l = []
for i in xrange(iter):
#STEP1
AS = np.dot(A, S) #A*S
AS_NORM = norm_zero(AS) #max(eps, A*S)
X_AS_NORM = X/AS_NORM #(X./max(eps, A*S))
FIRST = np.dot(A.T, X_AS_NORM) #A'*(X./max(eps, A*S))
S_1 = S*FIRST #S.*(A'*(X./max(eps, A*S)))
AS_TEMP = UM_X - AS #1-A*S
AS_TEMP_NORM = norm_zero(AS_TEMP) #max(eps, 1-A*S)
X_AS_TEMP_NORM = X_TEMP/AS_TEMP_NORM #(1-x)./max(eps, 1-A*S)
SECOND = np.dot(A_TEMP.T, X_AS_TEMP_NORM) #(1-A)'*((1-x)./max(eps, 1-A*S))
S_2 = S*SECOND #S.*((1-A)'*((1-x)./max(eps, 1-A*S)))
S = S_1 + S_2 #S.*(A'*(X./max(eps, A*S))) + S.*((1-A)'*((1-x)./max(eps, 1-A*S)))
S_SUM = np.add.reduce(S)
S_SUM_EXT = np.tile(S_SUM, (K, 1))
S = S/S_SUM_EXT
#STEP2
AS = np.dot(A, S) #A*S
AS_NORM = norm_zero(AS) #max(eps, A*S)
X_AS_NORM = X/AS_NORM #(X./max(eps, A*S))
A = A* np.dot(X_AS_NORM, S.T)
#STEP3
A_TEMP_S = np.dot(A_TEMP, S) #A1*S
A_TEMP_S_NORM = norm_zero(A_TEMP_S) #max(eps, A1*S)
A_TEMP_S_NORM_X = X_TEMP/A_TEMP_S_NORM #(1-X)./max(eps, A1*S)
A_TEMP = A_TEMP * (np.dot(A_TEMP_S_NORM_X, S.T)) #A1.*(((1-x)./max(eps, A1*S))*S')
A = A/(A + A_TEMP) #A./(A+A1)
A_TEMP = UM_A - A #1-A
#STEP4 compute loss function
AS = np.dot(A, S) #A*S
AS_NORM = norm_zero(AS) #max(eps, A*S)
AS_NORM = np.log(AS_NORM) #log(max(eps, A*S))
LOSS_FIRST = X * AS_NORM # X.*log(max(eps, A*S))
AS_TEMP = UM_X - AS #1-A*S
AS_TEMP_NORM = norm_zero(AS_TEMP) #max(eps, 1-A*S)
AS_TEMP_NORM = np.log(AS_TEMP_NORM) #log(max(eps, 1-A*S))
LOSS_SECOND = X_TEMP * AS_TEMP_NORM #(1-X).*log(max(eps, 1-A*S))
LOSS = LOSS_FIRST + LOSS_SECOND #X.*log(max(eps, A*S)) + (1-X).*log(max(eps, 1-A*S))
LOSS_SUM = np.add.reduce(LOSS) #sum(X.*log(max(eps, A*S)) + (1-X).*log(max(eps, 1-A*S)))
loss = np.sum(LOSS_SUM)/(N*T) #sum(sum(X.*log(max(eps, A*S)) + (1-X).*log(max(eps, 1-A*S))))/(N*T)
l.append(loss)
rmse = estimate_rmse(X, AS)
if i > 1:
if math.fabs(l[i] - l[i-1]) < 0.00001:break
print "iter = %d, loss = %f, rmse = %f"%(i, loss, rmse)
np.savetxt(ufile, A)
np.savetxt(vfile, S)
參考文献:
1. The aspect Bernoulli model: multiple causes of presences and absences (文章算法相应文章)
2. One-Class Matrix Completion with Low-Density Factorizations
最新文章
- html5 drag
- OSG中的HUD
- 工作中使用的html5和css3 新特性
- Scala 中的函数式编程基础(二)
- uva12532 线段树单点更新
- 软件工程随堂小作业——随机四则运算Ⅱ之算法思路(C++)
- javascript实现快速排序和二分法查找
- (转载)Delphi开发经验谈
- 汉字转拼音的Java类库:JPinyin
- JavaBean基础
- HBase Compact
- 关于LZO和LZOP
- TCP/IP 标志位 SYN ACK RST UTG PSH FIN
- IT经典书籍——Head First系列…
- javascript第七章--DOM
- Linux文件上传工具下载工具及详细使用说明
- 分布式之redis
- 100个MySQL 的调节和优化的提示
- Centos下搭建golang环境
- Spring JDBC处理CLOB类型字段
热门文章
- 本地自旋锁与信号量/多服务台自旋队列-spin wait风格的信号量
- PHP7添加swoole扩展
- [MySQL] 按年度、季度、月度、周、日统计查询
- 什么是BOM头(字节顺序标记(ByteOrderMark))
- 【转自网络】JS实现保存当前网页HTML到本地
- struts2学习之基础笔记5
- C++_String_类字符串操作(转)
- Linux基础、常用命令
- GitHub报错error: bad signature
- BZOJ 2806 [Ctsc2012]Cheat (后缀自动机+二分+单调队列+dp)