【Python爬蟲+數(shù)據(jù)分析】2018年電影，你看了幾部？

作者：法納斯特 2018-12-05 13:59:45

12月已開始了，離2018年的結(jié)束也就半個(gè)多月的時(shí)間了，還記得年初立下的flag嗎?完成了多少?相信很多人和我一樣，抱頭痛哭...本次利用貓眼電影，實(shí)現(xiàn)對2018年的電影大數(shù)據(jù)進(jìn)行分析。

12月已開始了，離2018年的結(jié)束也就半個(gè)多月的時(shí)間了，還記得年初立下的flag嗎?

完成了多少?相信很多人和我一樣，抱頭痛哭...

本次利用貓眼電影，實(shí)現(xiàn)對2018年的電影大數(shù)據(jù)進(jìn)行分析。

一、網(wǎng)頁分析

01 標(biāo)簽

通過點(diǎn)擊貓眼電影已經(jīng)歸類好的標(biāo)簽，得到網(wǎng)址信息。

02 索引頁

打開開發(fā)人員工具，獲取索引頁里電影的鏈接以及評分信息。

索引頁一共有30多頁，但是有電影評分的只有10頁。

本次只對有電影評分的數(shù)據(jù)進(jìn)行獲取。

03 詳情頁

對詳情頁的信息進(jìn)行獲取。

主要是名稱，類型，國家，時(shí)長，上映時(shí)間，評分，評分人數(shù)，累計(jì)票房。

二、反爬破解

通過開發(fā)人員工具發(fā)現(xiàn)，貓眼針對評分，評分人數(shù)，累計(jì)票房的數(shù)據(jù)，施加了文字反爬。

通過查看網(wǎng)頁源碼，發(fā)現(xiàn)只要刷新頁面，三處文字編碼就會(huì)改變，無法直接匹配信息。

所以需要下載文字文件，對其進(jìn)行雙匹配。

from fontTools.ttLib import TTFont 
 
#font = TTFont('base.woff') 
#font.saveXML('base.xml') 
font = TTFont('maoyan.woff') 
font.saveXML('maoyan.xml')

將woff格式轉(zhuǎn)換為xml格式，以便在Pycharm中查看詳細(xì)信息。

利用下面這個(gè)網(wǎng)站，打開woff文件。

url: http://fontstore.baidu.com/static/editor/index.html

可以得到下面數(shù)字部分信息(上下兩塊)。

在Pycharm中查看xml格式文件(左右兩塊)，你就會(huì)發(fā)現(xiàn)有對應(yīng)信息。

通過上圖你就可以將數(shù)字6對上號了，其他數(shù)字一樣的。

def get_numbers(u): 
    """ 
    對貓眼的文字反爬進(jìn)行破解 
    """ 
    cmp = re.compile(",\n           url\('(//.*.woff)'\) format\('woff'\)") 
    rst = cmp.findall(u) 
    ttf = requests.get("http:" + rst[0], stream=True) 
    with open("maoyan.woff", "wb") as pdf: 
        for chunk in ttf.iter_content(chunk_size=1024): 
            if chunk: 
                pdf.write(chunk) 
    base_font = TTFont('base.woff') 
    maoyanFont = TTFont('maoyan.woff') 
    maoyan_unicode_list = maoyanFont['cmap'].tables[0].ttFont.getGlyphOrder() 
    maoyan_num_list = [] 
    base_num_list = ['.', '3', '0', '8', '9', '4', '1', '5', '2', '7', '6'] 
    base_unicode_list = ['x', 'uniF561', 'uniE6E1', 'uniF125', 'uniF83F', 'uniE9E2', 'uniEEA6', 'uniEEC2', 'uniED38', 'uniE538', 'uniF8E7'] 
    for i in range(1, 12): 
        maoyan_glyph = maoyanFont['glyf'][maoyan_unicode_list[i]] 
        for j in range(11): 
            base_glyph = base_font['glyf'][base_unicode_list[j]] 
            if maoyan_glyph == base_glyph: 
                maoyan_num_list.append(base_num_list[j]) 
                break 
    maoyan_unicode_list[1] = 'uni0078' 
    utf8List = [eval(r"'\u" + uni[3:] + "'").encode("utf-8") for uni in maoyan_unicode_list[1:]] 
    utf8last = [] 
    for i in range(len(utf8List)): 
        utf8List[i] = str(utf8List[i], encoding='utf-8') 
        utf8last.append(utf8List[i]) 
    return (maoyan_num_list ,utf8last)

三、數(shù)據(jù)獲取

01 構(gòu)造請求頭

head = """ 
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 
Accept-Encoding:gzip, deflate, br 
Accept-Language:zh-CN,zh;q=0.8 
Cache-Control:max-age=0 
Connection:keep-alive 
Host:maoyan.com 
Upgrade-Insecure-Requests:1 
Content-Type:application/x-www-form-urlencoded; charset=UTF-8 
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36 
""" 
 
def str_to_dict(header): 
    """ 
    構(gòu)造請求頭,可以在不同函數(shù)里構(gòu)造不同的請求頭 
    """ 
    header_dict = {} 
    header = header.split('\n') 
    for h in header: 
        h = h.strip() 
        if h: 
            k, v = h.split(':', 1) 
            header_dict[k] = v.strip() 
    return header_dict

因?yàn)樗饕摵驮斍轫撜埱箢^不一樣，這里為了簡便，構(gòu)造了一個(gè)函數(shù)。

02 獲取電影詳情頁鏈接

def get_url(): 
    """ 
    獲取電影詳情頁鏈接 
    """ 
    for i in range(0, 300, 30): 
        time.sleep(10) 
        url = 'http://maoyan.com/films?showType=3&yearId=13&sortId=3&offset=' + str(i) 
        host = """Referer:http://maoyan.com/films?showType=3&yearId=13&sortId=3&offset=0 
        """ 
        header = head + host 
        headers = str_to_dict(header) 
        response = requests.get(url=url, headers=headers) 
        html = response.text 
        soup = BeautifulSoup(html, 'html.parser') 
        data_1 = soup.find_all('div', {'class': 'channel-detail movie-item-title'}) 
        data_2 = soup.find_all('div', {'class': 'channel-detail channel-detail-orange'}) 
        num = 0 
        for item in data_1: 
            num += 1 
            time.sleep(10) 
            url_1 = item.select('a')[0]['href'] 
            if data_2[num-1].get_text() != '暫無評分': 
                url = 'http://maoyan.com' + url_1 
                for message in get_message(url): 
                    print(message) 
                    to_mysql(message) 
                print(url) 
                print('---------------^^^Film_Message^^^-----------------') 
            else: 
                print('The Work Is Done') 
                break

03 獲取電影詳情頁信息

def get_message(url): 
    """ 
    獲取電影詳情頁里的信息 
    """ 
    time.sleep(10) 
    data = {} 
    host = """refer: http://maoyan.com/news 
    """ 
    header = head + host 
    headers = str_to_dict(header) 
    response = requests.get(url=url, headers=headers) 
    u = response.text 
    # 破解貓眼文字反爬 
    (mao_num_list, utf8last) = get_numbers(u) 
    # 獲取電影信息 
    soup = BeautifulSoup(u, "html.parser") 
    mw = soup.find_all('span', {'class': 'stonefont'}) 
    score = soup.find_all('span', {'class': 'score-num'}) 
    unit = soup.find_all('span', {'class': 'unit'}) 
    ell = soup.find_all('li', {'class': 'ellipsis'}) 
    name = soup.find_all('h3', {'class': 'name'}) 
    # 返回電影信息 
    data["name"] = name[0].get_text() 
    data["type"] = ell[0].get_text() 
    data["country"] = ell[1].get_text().split('/')[0].strip().replace('\n', '') 
    data["length"] = ell[1].get_text().split('/')[1].strip().replace('\n', '') 
    data["released"] = ell[2].get_text()[:10] 
    # 因?yàn)闀?huì)出現(xiàn)沒有票房的電影,所以這里需要判斷 
    if unit: 
        bom = ['分', score[0].get_text().replace('.', '').replace('萬', ''), unit[0].get_text()] 
        for i in range(len(mw)): 
            moviewish = mw[i].get_text().encode('utf-8') 
            moviewish = str(moviewish, encoding='utf-8') 
            # 通過比對獲取反爬文字信息 
            for j in range(len(utf8last)): 
                moviewish = moviewish.replace(utf8last[j], maoyan_num_list[j]) 
            if i == 0: 
                data["score"] = moviewish + bom[i] 
            elif i == 1: 
                if '萬' in moviewish: 
                    data["people"] = int(float(moviewish.replace('萬', '')) * 10000) 
                else: 
                    data["people"] = int(float(moviewish)) 
            else: 
                if '萬' == bom[i]: 
                    data["box_office"] = int(float(moviewish) * 10000) 
                else: 
                    data["box_office"] = int(float(moviewish) * 100000000) 
    else: 
        bom = ['分', score[0].get_text().replace('.', '').replace('萬', ''), 0] 
        for i in range(len(mw)): 
            moviewish = mw[i].get_text().encode('utf-8') 
            moviewish = str(moviewish, encoding='utf-8') 
            for j in range(len(utf8last)): 
                moviewish = moviewish.replace(utf8last[j], maoyan_num_list[j]) 
            if i == 0: 
                data["score"] = moviewish + bom[i] 
            else: 
                if '萬' in moviewish: 
                    data["people"] = int(float(moviewish.replace('萬', '')) * 10000) 
                else: 
                    data["people"] = int(float(moviewish)) 
        data["box_office"] = bom[2] 
    yield data

四、數(shù)據(jù)存儲

01 創(chuàng)建數(shù)據(jù)庫及表格

db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306) 
cursor = db.cursor() 
cursor.execute("CREATE DATABASE maoyan DEFAULT CHARACTER SET utf8mb4") 
db.close() 
 
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='maoyan') 
cursor = db.cursor() 
sql = 'CREATE TABLE IF NOT EXISTS films (name VARCHAR(255) NOT NULL, type VARCHAR(255) NOT NULL, country VARCHAR(255) NOT NULL, length VARCHAR(255) NOT NULL, released VARCHAR(255) NOT NULL, score VARCHAR(255) NOT NULL, people INT NOT NULL, box_office BIGINT NOT NULL, PRIMARY KEY (name))' 
cursor.execute(sql) 
db.close()

其中票房收入數(shù)據(jù)類型為BIGINT(19位數(shù))，***為18446744073709551615。

INT(10位數(shù))，***為2147483647，達(dá)不到36億(3600000000)。

02 數(shù)據(jù)存儲

def to_mysql(data): 
    """ 
    信息寫入mysql 
    """ 
    table = 'films' 
    keys = ', '.join(data.keys()) 
    values = ', '.join(['%s'] * len(data)) 
    db = pymysql.connect(host='localhost', user='root', password='774110919', port=3306, db='maoyan') 
    cursor = db.cursor() 
    sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=table, keys=keys, values=values) 
    try: 
        if cursor.execute(sql, tuple(data.values())): 
            print("Successful") 
            db.commit() 
    except: 
        print('Failed') 
        db.rollback() 
    db.close()