我们在第一篇提到的《集体智慧》英文名为《Programming Collective Intelligence》,这边书不仅实战型超级强大,而且很多复杂的机器学习算法被解释的异常的浅显易懂,很是让我惊叹!其中包括协作过滤算法,聚类算法,分类算法,数据挖掘算法etc。好了,废话不讲,今天就让我们来看看被一堆公式堆积起来的欧几里得距离和皮尔森相关系数,是如何被这本书用大众化的方式描述出来的。ps:本书是使用python
首先,来一段python code,其实python现在用来教学的最好语言了,如果对python不熟悉的话,推荐先看《a byte of python》下载 然后看,Dive Into Python 下载 dive into Python的代码 下载 critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
以上代码就代表我们获得的最原始的用户对电影的评论数据。那面我们来看看如何用欧几里德距离来衡量两个用户之间的相似度。
横轴表示电影Dupree的1-5的评分,纵轴表示电影Snakes 1-5的评分,中间的点表示每一个用户,空间的位置是他们对这两部电影的评分情况。可以看到有的用户靠的很近,有的离得较远,越近表示欧几里得空间中越相似,反之则越不相似。
但是欧几里得距离有一个缺点,不过碰到不是正则化的评分数据就很难判断相似性。比如,用户Toby的评分总是要比平均水平高,Toby评5分(推荐级别)的电影他人一般评4.5(推荐级别),因为衡量的标准不一样,导致欧几里得距离不太适用。所以这里另外一种相似度计算原则就是 皮尔森相关系数。他是代表两个数据集在一条直线上的拟合程度。来看下面两个例子:

在图2中,表示两个用户Gene Seymour和Mick LaSalle对一系列电影的打分情况。中间的虚线为最佳拟合曲线,因为他努力使之靠近所有的点。如果两个用户评分都一样的话,这将是一条对角线,相关系数为1.在图中两个用户只是在少数的电影上评分一致,所以相关系数只有0.4.

图3中,表示两个用户之间具有更高的相关系数。有趣的是,Jack Matthews总是给出比Lisa Rose高的分(英文中叫Grade Inflation),但是在皮尔森空间中还是表征出两个用户的相似性。但在欧几里得空间中,这两个向量将被判为不相似,即使他们的口味很相近。
那我们该用哪种相似度衡量方法呢?总结一点就是要根据不同的应用采用相应的方法。除了皮尔森,欧几里得距离,还有其他很多相似度计算方法(Jaccard coefficent , Manhattan distance)
guwendong 的一篇 《Programming Collective Intelligence》书评让 我对个性化技术产生了极大的兴趣,并一口气看完了他blog上的所有文章,让我受益匪浅,并使自己疯了似的狂看了两周的collaborative recommendation,之后立马向老板推荐,我的研究方向就是协作过滤了,还blabla想了一堆以后如何利用这个技术创业(天~~别忘了我的人 生目标就是创业,在成熟的时候周游四海~),很多idea比如建立一个健康知识的专业推荐社区,一个以省钱为目标的咨询、知识推荐网…不过,后来发现这个 论题有些大,我现在的研究领域进过压缩,变成了网上社区的热点话题发现。但我还是会关注协作过滤!为什么呢?
1. 未来的咨询只为海量的爆发,但是每个人的时间和精力都是有限的,如何在有限的时间内获得自己感兴趣的最合适的资讯,我觉得第一是靠自己的主动获取,但最终要的也是发展前景非常好的推荐技术,借他人(一般与你兴趣相仿或相近的人)的意见来产生自己感兴趣的内容
2.未来的社会更加互联。如何理解?就像现在蓬勃发展的SNS(social network service),你可以随时掌握到好友的各种信息,但是现在缺乏的是如何来利用这些信息为自己主动提供帮助,这个时侯协作推荐也必有用武之地。
另外,guwendong还专门研究了一些列新兴的个性化应用,供参考。
今后,本博客今后将关注个性化方面的相关技术和应用:)并且,接下来,我会就《Programming Collective Intelligence》写一些实践性的文章,一方面把技术落地,一方面也为我的小论文铺铺路,希望大家多多监督,多多分享:)
Recently, I am focusing on Collaborative filtering, where I found lots of interest. You can see lots of application using this technology, such as on amaze.com, when you view the certain product, the system will offer you other products you are most likely to buy. That’s it. So with the internet going on, we will face millions of data, how can we get what we really need? Just come to Collaborative filtering, which will help you to find the gold behind.
Here is some notes after I readed some papers, for details you can view the wiki on this site Papers note on collaborative filtering . AND moreover, I strongly recommend to you my friends this book, which make complicated conception and many machine learning algorithms unbelievablly easier.
These days, I has changed to focus on Sql server 2008 business intelligence, which may last for 2 or 3 months. You know, I never use the any business intelligence service of Sql server, it is a hard time for me to familar with these new staff, but really it’s so an exciting work. I never found before, there are so much availible service inside the Sql server, especially in Sql server 2008(A paper which compare the features between SQL SERVER 2008 and Oracle 11g, you can download it and take a review sql2008_vs_oracle11g )
This week I mainly practise some demos with the help of Sql server 2008 tutorial. In generally, the business intelligence of SQL SERVER 2008 can divided into 3 parts: SSIS(sql server integration service) and SSAS(sql server analysis service) , SSRS(sql server reporting service). Like most of projects, firstly we should preprocess the original data into db or data warehouse, this work is very tedious but very important for future anlysis(db/data warehouse structure).
So, next post, we will dive into SSIS, to extract data into sql server database from complicated flat files.
For further read:
1. Business Intelligence and Data Warehousing in SQL Server 2005
Pentaho Reporting is a collection of open source projects primarily focused on the creation, generation and distribution of rich and sophisticated report content from all sources of information.
There are 2 visual tools to make our reporting design very easy:Report Designer and Report Design Wizard. The first one makes it easy for report authors to quickly create sophisticated, rich reports for the classic-engine. and the other one make report creation easy, based on templates, the wizard can also create simple to sophisticated reports for the classis-engine. Today, let’s begin with the report designer of Pentaho, it’s very powerfull!!:) More …
AS time goes by, I find that data mining work is the strongest tool to find potential value that exsiting in large amount of data. The relationshop between bear and nappy is known for most of people. So, would you like to make value by digging data, would like to find the most exciting knowleage hide behind the data? if you do, let’s begin our travel to data ming…:)
I am a fresh person on this area, so let’s just do it, learning and practise, make a progress every day is the big success for your life.
In later days, we will a open source tool called “WEKA”, which now is bought in 2007 by Pentaho company. Definetely “Weka” will be a data mining component in Pentaho platform, however, this component is under developing and will be available to public at least 3 months later. Well, It’s a good time for us to get started with weka.
FOR LINK:
1. WEKA original project:http://www.cs.waikato.ac.nz/ml/weka/
2. Pentaho data mining website: http://www.pentaho.com/products/data_mining/
Pentaho modrian mdx learning and practice: use mysql and pentaho modrian workbench

Let’s learn and practise step by step
Pentaho+tomcat+mysql tutorial
[pentaho materia download link: http://community.pentaho.com/sourceforge/]
in this tutorial, tools you will use:
1. pentaho_j2ee_deployments_1.7.0.M1.1018.zip
2. pentaho_demo_mysql5-1.7.0.M1.1018
3. Eclipse 3.2 or above
4. tomcat 5.5 or above
5. mysql 5.0 or above; now let’s begin More …
Hi, everyone, I am Kyle from China and now has been an intern at INTEL almost 3 months. Most work I do is business intelligence related, such as ETL(extract, transfer, load), multi-dimension anlyasis based on database, and do the reporting and analysis work through Pentaho BI platform.
I was really a green guy on this area at first, but with these 3 month self-learning and practise, I am getting really interested in BI area. And for the most important, recording the way how I improved will do a lot for me and laterly beginners.
As a post-graduated student focusing on Knowledge Managment, I will aslo log some experiment samples, and ideas after papers reading, such as how to evaluating user interest based on their behivors etc.
Now, let’s both get on this exciting travel, and go after great future of BI.


