对sfgg所有博客的数据统计

sfgg最近才把博客系统开放给大众申请使用，但是不得不说由于写作体验良好，人气在一步一步的增加。突发奇想，想统计一下sfgg博客的一些信息，搞起我最熟悉的python语言，开始写爬虫，然后整理数据，做一个简单的数据统计，历时一个晚上加一个上午，终于成功完成。下面是我的过程记录和代码分享，顺带问高手们几个写的过程中遇到的问题。

url列表的获取

http://segmentfault.com/blogs/newest?page=1
这是sfgg博客分页列表的格式，我只要将page后面的参数遍历一遍，那么所有的博客链接就可以成功的捕捉下来了，于是配置好python的beautifulsoup包以后，几句代码就完成了工作。

getUrl.py

import bs4
import urllib.request as req

urlList=open(r'E:\source\python\urlList.txt','w')
for i in range(47):
    content=req.urlopen("http://segmentfault.com/blogs/newest?page="+str(1+i))
    soups=bs4.BeautifulSoup(content.read())
    for soup in soups('h2',{ "class" : "title" }):
        urlList.write(soup.a['href']+'\n')
        print('write',soup.a['href'],'done!')

urlList.close()
print("All url done!")

这个过程也没有什么效率上面的考虑，因为速度已经够用的。

博文元信息获取

第一版的规划还是用beautifulsoup来获取信息，然后将数据放在pandas包的DataFrame中，然后一次性存入数据库。
getMeta.py

import pandas as pd
import mysql.connector as conn
import urllib.request as req
import bs4

data=pd.DataFrame()

urlList=open(r'E:\source\python\urlList.txt','r')
for url in urlList:
    content=req.urlopen(url)
    soup=bs4.BeautifulSoup(content.read())
    author=soup('div',{'class':'author-status'})[0].h4.a.text
    title=soup('h1',{'class':'post-title'})[0].text
    vote=int(soup('span',{'id':'article-rank'})[0].text)
    bookmark=int(soup('a',{'class':'btn btn-rank bookmark'})[0].span.text)
    if soup('a',{'href':'#comments'})[0].text == "没有评论":
        comment=0
    else:
        comment=int(soup('a',{'href':'#comments'})[0].text[:-2])
    pageview=int(soup('span',{'class':'views'})[0].text[:-2])
    data=data.append({'url':url,'author':author,'vote':vote,'bookmark':bookmark,'comment':comment,'pageview':pageview},ignore_index=True)
    print("Finish",url,"operation!")

data.index=range(data['url'].count())
cnx = conn.connect(user='root', password='xxxxxxxx',
                              host='127.0.0.1',
                              database='sfgg')
data.to_sql("meta",cnx,flavor='mysql',if_exits='fail')

出现了几个蛋疼的问题：

有几篇博文居然没有作者信息，导致author=soup('div',{'class':'author-status'})[0]会出现越界访问。
数据存入mysql时，中文字符会报1366 (HY000): Incorrect string value:的错误，谷歌了好一会儿，揭示大概就是mysql不支持三个bytes以上的utf-8格式，好吧，反正我是不知道如何解决。

于是我就开始寻思着写第二个版本，加入线程池提高速度，数据库也采用sqlite。参考了sfgg中的一行python实现并行化的文章。于是就有了第二版：
getMetav2.py

import bs4
import base64
import pickle
import sqlite3
import pandas as pd
import urllib.request as req
from multiprocessing.dummy import Pool as ThreadPool

data=pd.DataFrame()
urlList=open(r'E:\source\python\urlList.txt','r')

def getData(url):
    try:
        content=req.urlopen(url)
        soup=bs4.BeautifulSoup(content.read())
        title=soup('h1',{'class':'post-title'})[0].text
        author=soup('div',{'class':'author-status'})[0].h4.a.text
        vote=int(soup('span',{'id':'article-rank'})[0].text)
        bookmark=int(soup('a',{'class':'btn btn-rank bookmark'})[0].span.text)
        if soup('a',{'href':'#comments'})[0].text == "没有评论":
            comment=0
        else:
            comment=int(soup('a',{'href':'#comments'})[0].text[:-2])
        pageview=int(soup('span',{'class':'views'})[0].text[:-2])
        print(url)
        return {'url':url,'author':author,'vote':vote,'bookmark':bookmark,'title':title,'comment':comment,'pageview':pageview}
    except:
        errorList=open(r'E:\source\python\errorList.txt','w+')
        errorList.write(url)
        errorList.close()

pool=ThreadPool(50)
result=pool.map(getData, urlList)
pool.close()
pool.join()

print('finish meta data')

with open(r'E:\source\python\result','wb') as f:
    pickle.dump(result,f)

for ob in result:
    data=data.append(ob,ignore_index=True)

data.index=range(data['url'].count())
cnx = sqlite3.connect(r'E:\SQLite\meta.db')
data.to_sql("meta",cnx)

print('All finished!')

这个版本，首先考虑了对出错链接的存储，然后在链接遍历完后持久化保存相关变量，使用sqlite保存数据。

简单的统计分析

谁是最高产的博主？
谁的得票数最多？
谁是最有珍藏价值的博主，赶快抱回家？
谁是单篇文章认同最高的博主？
谁的博客一呼百应?
出镜率最高的文章？

这里的title太长了，我还是不写在图上面了。

0.每个人都得经历挫折和不断成长的过程：从退学到创业的这几年！
1.SegmentFault.php
2.皮包公司的秘密 - 人口贩子公司
3.技术人攻略访谈十九：iOS大V养成记
4.Gulp.js：比 Grunt 更简单的自动化的项目构建利器
5.Vim 的哲学（一）
6.技术人攻略访谈二十五：运维人的野蛮生长
7.告别码农，成为真正的程序员
8.从13到14（一个没有文采的女同学）
9.小团队，大梦想：寻找小伙伴，成为SegmentFault 的小伙伴
出镜率最高的博主？
标题一定要长？
浏览量对标题长度做回归，得到的结果是：

pageview = 0.6495*titleLength + 231.0448

貌似看上去两者相关系数是正的，但是系数的t统计量的置信度才50%，基本上来说，是不可信的。标题一定要长是扯淡的。

代码：
analysis.py

import sqlite3
from pylab import *
import pandas as pd
import statsmodels.formula.api as sm

conn=sqlite3.connect(r'E:\SQLite\meta.db')

data=pd.read_sql(r'select * from meta',conn)

#谁是最高产的博主
topAuthor=data['author'].value_counts()[:10]
show(topAuthor.plot(kind='bar',rot=270))


#谁的得票数最多
topVote=data['vote'].groupby(data['author']).sum()
topVote.sort(ascending=False)
topVote=topVote[:10]
show(topVote.plot(kind='bar',rot=270))


#谁是最有珍藏价值的博主，赶快抱回家
topMark=data['bookmark'].groupby(data['author']).sum()
topMark.sort(ascending=False)
topMark=topMark[:10]
show(topMark.plot(kind='bar',rot=270))


#谁是单篇文章最高的博主
topValue=(data['vote']+data['bookmark']).groupby(data['author']).sum()
topValue.sort(ascending=False)
topValue=topValue[:10]
show(topValue.plot(kind='bar',rot=270))


#谁的博客一呼百应
topComment=data['comment'].groupby(data['author']).sum()
topComment.sort(ascending=False)
topComment=topComment[:10]
show(topComment.plot(kind='bar',rot=270))


#出镜率最高的文章
topViewSingle=data['pageview'].groupby(data['title']).sum()
topViewSingle.sort(ascending=False)
topViewSingle=topViewSingle[:10]
show(topViewSingle.plot(kind='bar',rot=270))


#出镜率最高的博主
topViewSum=data['pageview'].groupby(data['author']).sum()
topViewSum.sort(ascending=False)
topViewSum=topViewSum[:10]
show(topViewSum.plot(kind='bar',rot=270))


#标题长度和浏览量的关系
titleLength=[len(x) for x in data['title']]
lengthData=pd.DataFrame({'titleLength':titleLength,'pageview':data['pageview']})
res = sm.ols('pageview~titleLength',data = lengthData).fit()
print(res.summary())

几个问题

1.getMetav2.py中，我试过使用进程池，也就是from multiprocessing import Pool as ThreadPool，这个除了会把我cpu跑满以外，没有任何效果，这是为嘛？这里是什么原因导致线程池和进程池效果不一？
2. 希望大家指出我这个代码中不好的习惯，有些东西自己很难意识到，感谢各位

写在最后

感谢sfgg的管理员在我爬的时候没有封我IP，服务器质量很高嘛，不错；感谢各位原创博主的不懈努力，sfgg才有这么多优秀的文章；感谢python，让我一天不到完成了这个想法。最后感谢一下沙渺同学，很大度的让我爬sfgg的文章。

对sfgg所有博客的数据统计

url列表的获取

博文元信息获取

简单的统计分析

几个问题

写在最后

bigtan

引用和评论

其他备忘

适用于交通行业的桌面端NVIDIA RTX™ GPU 最新选型方案

传媒行业的最新工作站GPU选型指南

如何用tick数据实现逐笔成交监控？来自一个开发者实测分享