爬取兩萬多租房數(shù)據(jù)，告訴你廣州房租現(xiàn)狀

作者：zone7 2018-12-20 11:50:46

，早就有挺多小伙伴叫我分析一下廣州的租房價格現(xiàn)狀，這不，文章就這樣在眾多呼聲中出爐了。然后，此次爬蟲技術(shù)也升級了，完善了更多細(xì)節(jié)。源碼值得細(xì)細(xì)探究。

概述

前言
統(tǒng)計結(jié)果
爬蟲代碼實現(xiàn)
爬蟲分析實現(xiàn)
后記

前言

建議在看這篇文章之前，請看完這三篇文章，因為本文是依賴于前三篇文章的：

爬蟲利器初體驗(1)
聽說你的爬蟲又被封了?(2)
爬取數(shù)據(jù)不保存，就是耍流氓(3)

八月份的時候，由于腦洞大開，決定用 python 爬蟲爬取了深圳的租房數(shù)據(jù)，并寫了文章《用Python告訴你深圳房租有多高》，文章得到了一致好評和眾多轉(zhuǎn)載。由于我本身的朋友圈大多都在廣州、深圳，因此，早就有挺多小伙伴叫我分析一下廣州的租房價格現(xiàn)狀，這不，文章就這樣在眾多呼聲中出爐了。然后，此次爬蟲技術(shù)也升級了，完善了更多細(xì)節(jié)。源碼值得細(xì)細(xì)探究。此次分析采集了廣州 11 個區(qū)，23339 條數(shù)據(jù)，如下圖：

樣本數(shù)據(jù)

其中后半部分地區(qū)數(shù)據(jù)量偏少，是由于該區(qū)房源確實不足。因此，此次調(diào)查也并非非常準(zhǔn)確，權(quán)且當(dāng)個娛樂項目，供大家觀賞。

統(tǒng)計結(jié)果

我們且先看統(tǒng)計結(jié)果，然后再看技術(shù)分析。

廣州房源分布：(按區(qū)劃分)

其中天河占據(jù)了大部分房源。但這塊地的房租可是不菲啊。

房源分布

房租單價：(每月每平方米單價 -- 平均數(shù))

即是 1 平方米 1 個月的價格。方塊越大，代表價格越高。

房租單價：平方米/月

可以看出天河、越秀、海珠都越過了 50 大關(guān)，分別是 75.042 、64.249、59.621 ，是其他地區(qū)的幾倍。如果在天河租個 20 平方的房間：

75.042 x 20 = 1500.84

再來個兩百的水電、物業(yè)：

1500.84 + 200 = 1700.84

我們按正常生活來算的話，每天早餐 10 塊，中午 15 塊，晚飯 15 塊：

1700.84 + 40 x 30 = 2900.84

那么平時的日常生活需要 2900.84 塊。

隔斷時間下個館子，每個月買些衣服，交通費，談個女朋友，與女朋友出去逛街，妥妥滴加個 2500

2900.84 + 2500 = 5400.84

給爸媽一人一千：

5200.84 + 2000 = 7200.84

月薪一萬還是有點存款的，比深圳好一點，但是可能廣州的薪資就沒深圳那么高了。

房租單價：(每日每平方米單價 -- 平均數(shù))

即是 1 平方米 1 天的價格。

租房單價：平方米/日

哈哈，感受一下***的感覺。[捂臉]
戶型

戶型主要以 3 室 2 廳與 2 室 2 廳為主。與小伙伴抱團(tuán)租房是***的選擇了，不然與不認(rèn)識的人一起合租，可能會發(fā)生一系列讓你不舒服的事情。字體越大，代表戶型數(shù)量越多。

戶型

戶型

租房面積統(tǒng)計

其中 30 - 90 平方米的租房占大多數(shù)，如今之計，也只能是幾個小伙伴一起租房，抱團(tuán)取暖了。

租房面積統(tǒng)計

租房描述詞云

這是爬取的租房描述，其中字體越大，標(biāo)識出現(xiàn)的次數(shù)越多。其中【住家、全套、豪華、齊全】占據(jù)了很大的部分，說明配套設(shè)施都是挺齊全的。

租房描述

爬蟲技術(shù)分析

請求庫：scrapy、requests
HTML 解析：BeautifulSoup
詞云：wordcloud
數(shù)據(jù)可視化：pyecharts
數(shù)據(jù)庫：MongoDB
數(shù)據(jù)庫連接：pymongo

爬蟲代碼實現(xiàn)

跟上一篇文章不一樣，這是使用了 scrapy 爬蟲框架來爬取數(shù)據(jù)，各個方面也進(jìn)行了優(yōu)化，例如：自動生成各個頁面的地址。

由于房某下各個區(qū)域的首頁地址和首頁以外的地址的形式是不一樣的，但是又一定的規(guī)律，所以需要拼接各個部分的地址。

首頁地址案例：

# ***頁 
http://gz.zu.fang.com/house-a073/

非首頁地址：

# 第二頁 
http://gz.zu.fang.com/house-a073/i32/ 
# 第三頁 
http://gz.zu.fang.com/house-a073/i33/ 
# 第四頁 
http://gz.zu.fang.com/house-a073/i34/

先解析首頁 url

def head_url_callback(self, response): 
    soup = BeautifulSoup(response.body, "html5lib") 
    dl = soup.find_all("dl", attrs={"id": "rentid_D04_01"})  # 獲取各地區(qū)的 url 地址的 dl 標(biāo)簽 
    my_as = dl[0].find_all("a")  # 獲取 dl 標(biāo)簽中所有的 a 標(biāo)簽， 
    for my_a in my_as: 
        if my_a.text == "不限":  # 不限地區(qū)的,特殊處理 
            self.headUrlList.append(self.baseUrl) 
            self.allUrlList.append(self.baseUrl) 
            continue 
        if "周邊" in my_a.text:  # 清除周邊地區(qū)的數(shù)據(jù) 
            continue 
        # print(my_a["href"]) 
        # print(my_a.text) 
        self.allUrlList.append(self.baseUrl + my_a["href"]) 
        self.headUrlList.append(self.baseUrl + my_a["href"]) 
    print(self.allUrlList) 
    url = self.headUrlList.pop(0) 
    yield Request(url, callback=self.all_url_callback, dont_filter=True)

再解析非首頁 url

這里先獲取到各個地區(qū)一共有多少頁，才能拼接具體的頁面地址。

再根據(jù)頭部 url 拼接其他頁碼的url 
ef all_url_callback(self, response): # 解析并拼接所有需要爬取的 url 地址 
   soup = BeautifulSoup(response.body, "html5lib") 
   div = soup.find_all("div", attrs={"id": "rentid_D10_01"})  # 獲取各地區(qū)的 url 地址的 dl 標(biāo)簽 
   span = div[0].find_all("span")  # 獲取 dl 標(biāo)簽中所有的 span 標(biāo)簽， 
   span_text = span[0].text 
   for index in range(int(span_text[1:len(span_text) - 1])): 
       if index == 0: 
           pass 
           # self.allUrlList.append(self.baseUrl + my_a["href"]) 
       else: 
           if self.baseUrl == response.url: 
               self.allUrlList.append(response.url + "house/i3" + str(index + 1) + "/") 
               continue 
           self.allUrlList.append(response.url + "i3" + str(index + 1) + "/") 
   if len(self.headUrlList) == 0: 
       url = self.allUrlList.pop(0) 
       yield Request(url, callback=self.parse, dont_filter=True) 
   else: 
       url = self.headUrlList.pop(0) 
       yield Request(url, callback=self.all_url_callback, dont_filter=True)

***解析一個頁面的數(shù)據(jù)

def parse(self, response): # 解析一個頁面的數(shù)據(jù) 
    self.logger.info("==========================") 
    soup = BeautifulSoup(response.body, "html5lib") 
    divs = soup.find_all("dd", attrs={"class": "info rel"})  # 獲取需要爬取得 div 
    for div in divs: 
        ps = div.find_all("p") 
        try:  # 捕獲異常，因為頁面中有些數(shù)據(jù)沒有被填寫完整，或者被插入了一條廣告，則會沒有相應(yīng)的標(biāo)簽，所以會報錯 
            for index, p in enumerate(ps):  # 從源碼中可以看出，每一條 p 標(biāo)簽都有我們想要的信息，故在此遍歷 p 標(biāo)簽， 
                text = p.text.strip() 
                print(text)  # 輸出看看是否為我們想要的信息 
            roomMsg = ps[1].text.split("|") 
            area = roomMsg[2].strip()[:len(roomMsg[2]) - 1] 
            item = RenthousescrapyItem() 
            item["title"] = ps[0].text.strip() 
            item["rooms"] = roomMsg[1].strip() 
            item["area"] = int(float(area)) 
            item["price"] = int(ps[len(ps) - 1].text.strip()[:len(ps[len(ps) - 1].text.strip()) - 3]) 
            item["address"] = ps[2].text.strip() 
            item["traffic"] = ps[3].text.strip() 
            if (self.baseUrl+"house/") in response.url: # 對不限區(qū)域的地方進(jìn)行區(qū)分 
                item["region"] = "不限" 
            else: 
                item["region"] = ps[2].text.strip()[:2] 
            item["direction"] = roomMsg[3].strip() 
            print(item) 
            yield item 
        except: 
            print("糟糕，出現(xiàn) exception") 
            continue 
    if len(self.allUrlList) != 0:  
        url = self.allUrlList.pop(0) 
        yield Request(url, callback=self.parse, dont_filter=True)

數(shù)據(jù)分析實現(xiàn)

這里主要通過 pymongo 的一些聚合運算來進(jìn)行統(tǒng)計，再結(jié)合相關(guān)的圖標(biāo)庫，來進(jìn)行數(shù)據(jù)的展示。

數(shù)據(jù)分析：

# 求一個區(qū)的房租單價（平方米/元） 
   def getAvgPrice(self, region): 
       areaPinYin = self.getPinyin(region=region) 
       collection = self.zfdb[areaPinYin] 
       totalPrice = collection.aggregate([{'$group': {'_id': '$region', 'total_price': {'$sum': '$price'}}}]) 
       totalArea = collection.aggregate([{'$group': {'_id': '$region', 'total_area': {'$sum': '$area'}}}]) 
       totalPrice2 = list(totalPrice)[0]["total_price"] 
       totalArea2 = list(totalArea)[0]["total_area"] 
       return totalPrice2 / totalArea2 
 
   # 獲取各個區(qū) 每個月一平方米需要多少錢 
   def getTotalAvgPrice(self): 
       totalAvgPriceList = [] 
       totalAvgPriceDirList = [] 
       for index, region in enumerate(self.getAreaList()): 
           avgPrice = self.getAvgPrice(region) 
           totalAvgPriceList.append(round(avgPrice, 3)) 
           totalAvgPriceDirList.append({"value": round(avgPrice, 3), "name": region + "  " + str(round(avgPrice, 3))}) 
 
       return totalAvgPriceDirList 
 
   # 獲取各個區(qū) 每一天一平方米需要多少錢 
   def getTotalAvgPricePerDay(self): 
       totalAvgPriceList = [] 
       for index, region in enumerate(self.getAreaList()): 
           avgPrice = self.getAvgPrice(region) 
           totalAvgPriceList.append(round(avgPrice / 30, 3)) 
       return (self.getAreaList(), totalAvgPriceList) 
 
   # 獲取各區(qū)統(tǒng)計樣本數(shù)量 
   def getAnalycisNum(self): 
       analycisList = [] 
       for index, region in enumerate(self.getAreaList()): 
           collection = self.zfdb[self.pinyinDir[region]] 
           print(region) 
           totalNum = collection.aggregate([{'$group': {'_id': '', 'total_num': {'$sum': 1}}}]) 
           totalNum2 = list(totalNum)[0]["total_num"] 
           analycisList.append(totalNum2) 
       return (self.getAreaList(), analycisList) 
 
   # 獲取各個區(qū)的房源比重 
   def getAreaWeight(self): 
       result = self.zfdb.rent.aggregate([{'$group': {'_id': '$region', 'weight': {'$sum': 1}}}]) 
       areaName = [] 
       areaWeight = [] 
       for item in result: 
           if item["_id"] in self.getAreaList(): 
               areaWeight.append(item["weight"]) 
               areaName.append(item["_id"]) 
               print(item["_id"]) 
               print(item["weight"]) 
               # print(type(item)) 
       return (areaName, areaWeight) 
 
   # 獲取 title 數(shù)據(jù)，用于構(gòu)建詞云 
   def getTitle(self): 
       collection = self.zfdb["rent"] 
       queryArgs = {} 
       projectionFields = {'_id': False, 'title': True}  # 用字典指定需要的字段 
       searchRes = collection.find(queryArgs, projection=projectionFields).limit(1000) 
       content = '' 
       for result in searchRes: 
           print(result["title"]) 
           content += result["title"] 
       return content 
 
   # 獲取戶型數(shù)據(jù)（例如：3 室 2 廳） 
   def getRooms(self): 
       results = self.zfdb.rent.aggregate([{'$group': {'_id': '$rooms', 'weight': {'$sum': 1}}}]) 
       roomList = [] 
       weightList = [] 
       for result in results: 
           roomList.append(result["_id"]) 
           weightList.append(result["weight"]) 
       # print(list(result)) 
       return (roomList, weightList) 
 
   # 獲取租房面積 
   def getAcreage(self): 
       results0_30 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 0, '$lte': 30}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results30_60 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 30, '$lte': 60}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results60_90 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 60, '$lte': 90}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results90_120 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 90, '$lte': 120}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results120_200 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 120, '$lte': 200}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results200_300 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 200, '$lte': 300}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results300_400 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 300, '$lte': 400}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results400_10000 = self.zfdb.rent.aggregate([ 
           {'$match': {'area': {'$gt': 300, '$lte': 10000}}}, 
           {'$group': {'_id': '', 'count': {'$sum': 1}}} 
       ]) 
       results0_30_ = list(results0_30)[0]["count"] 
       results30_60_ = list(results30_60)[0]["count"] 
       results60_90_ = list(results60_90)[0]["count"] 
       results90_120_ = list(results90_120)[0]["count"] 
       results120_200_ = list(results120_200)[0]["count"] 
       results200_300_ = list(results200_300)[0]["count"] 
       results300_400_ = list(results300_400)[0]["count"] 
       results400_10000_ = list(results400_10000)[0]["count"] 
       attr = ["0-30平方米", "30-60平方米", "60-90平方米", "90-120平方米", "120-200平方米", "200-300平方米", "300-400平方米", "400+平方米"] 
       value = [ 
           results0_30_, results30_60_, results60_90_, results90_120_, results120_200_, results200_300_, results300_400_, results400_10000_ 
       ] 
       return (attr, value)

數(shù)據(jù)展示：

# 展示餅圖 
   def showPie(self, title, attr, value): 
       from pyecharts import Pie 
       pie = Pie(title) 
       pie.add("aa", attr, value, is_label_show=True) 
       pie.render() 
 
   # 展示矩形樹圖 
   def showTreeMap(self, title, data): 
       from pyecharts import TreeMap 
       data = data 
       treemap = TreeMap(title, width=1200, height=600) 
       treemap.add("深圳", data, is_label_show=True, label_pos='inside', label_text_size=19) 
       treemap.render() 
 
   # 展示條形圖 
   def showLine(self, title, attr, value): 
       from pyecharts import Bar 
       bar = Bar(title) 
       bar.add("深圳", attr, value, is_convert=False, is_label_show=True, label_text_size=18, is_random=True, 
               # xaxis_interval=0, xaxis_label_textsize=9, 
               legend_text_size=18, label_text_color=["#000"]) 
       bar.render() 
 
   # 展示詞云 
   def showWorkCloud(self, content, image_filename, font_filename, out_filename): 
       d = path.dirname(__name__) 
       # content = open(path.join(d, filename), 'rb').read() 
       # 基于TF-IDF算法的關(guān)鍵字抽取, topK返回頻率***的幾項, 默認(rèn)值為20, withWeight 
       # 為是否返回關(guān)鍵字的權(quán)重 
       tags = jieba.analyse.extract_tags(content, topK=100, withWeight=False) 
       text = " ".join(tags) 
       # 需要顯示的背景圖片 
       img = imread(path.join(d, image_filename)) 
       # 指定中文字體, 不然會亂碼的 
       wc = WordCloud(font_path=font_filename, 
                      background_color='black', 
                      # 詞云形狀， 
                      mask=img, 
                      # 允許***詞匯 
                      max_words=400, 
                      # ***號字體，如果不指定則為圖像高度 
                      max_font_size=100, 
                      # 畫布寬度和高度，如果設(shè)置了msak則不會生效 
                      # width=600, 
                      # height=400, 
                      margin=2, 
                      # 詞語水平擺放的頻率，默認(rèn)為0.9.即豎直擺放的頻率為0.1 
                      prefer_horizontal=0.9 
                      ) 
       wc.generate(text) 
       img_color = ImageColorGenerator(img) 
       plt.imshow(wc.recolor(color_func=img_color)) 
       plt.axis("off") 
       plt.show() 
       wc.to_file(path.join(d, out_filename)) 
 
   # 展示 pyecharts 的詞云 
   def showPyechartsWordCloud(self, attr, value): 
       from pyecharts import WordCloud 
       wordcloud = WordCloud(width=1300, height=620) 
       wordcloud.add("", attr, value, word_size_range=[20, 100]) 
       wordcloud.render()

后記

距離上一篇租房市場的分析已經(jīng)3、4 個月了，我的技術(shù)水平也得到了一定的提高。所以努力編碼才是成長的捷徑。***，應(yīng)對外界條件的變動，我們還是應(yīng)該提升自己的硬實力，這樣才能提升自己的生存能力。

責(zé)任編輯：未麗燕來源： zone7

Python 房租分析爬蟲

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

爬取兩萬多租房數(shù)據(jù)，告訴你廣州房租現(xiàn)狀