手把手教你使用Flask搭建ES搜索引擎(預備篇)

作者：Python進階者 2021-06-18 09:02:26

開發前端

Elasticsearch 是一個開源的搜索引擎，建立在一個全文搜索引擎庫 Apache Lucene™ 基礎之上。那么如何實現 Elasticsearch和 Python 的對接成為我們所關心的問題了。

[[406279]]

1 前言

Elasticsearch 是一個開源的搜索引擎，建立在一個全文搜索引擎庫 Apache Lucene™ 基礎之上。

那么如何實現 Elasticsearch和 Python 的對接成為我們所關心的問題了 (怎么什么都要和 Python 關聯啊)。

2 Python 交互

所以，Python 也就提供了可以對接 Elasticsearch的依賴庫。

pip install elasticsearch

初始化連接一個 Elasticsearch 操作對象。

def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"): 
 
    # self.es = Elasticsearch([ip], http_auth=('username', 'password'), port=9200) 
    self.es = Elasticsearch("localhost:9200") 
    self.index_type = index_type 
    self.index_name = index_name

默認端口 9200，初始化前請確保本地已搭建好 Elasticsearch的所屬環境。

根據 ID 獲取文檔數據

def get_doc(self, uid): 
    return self.es.get(index=self.index_name, id=uid)

插入文檔數據

def insert_one(self, doc: dict): 
    self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
 
def insert_array(self, docs: list): 
    for doc in docs: 
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

搜索文檔數據

def search(self, query, count: int = 30): 
    dsl = { 
        "query": { 
            "multi_match": { 
                "query": query, 
                "fields": ["title", "content", "link"] 
            } 
        }, 
        "highlight": { 
            "fields": { 
                "title": {} 
            } 
        } 
    } 
    match_data = self.es.search(index=self.index_name, body=dsl, size=count) 
    return match_data 
 
def __search(self, query: dict, count: int = 20): # count: 返回的數據大小 
    results = [] 
    params = { 
        'size': count 
    } 
    match_data = self.es.search(index=self.index_name, body=query, params=params) 
    for hit in match_data['hits']['hits']: 
        results.append(hit['_source']) 
 
    return results

刪除文檔數據

def delete_index(self): 
    try: 
        self.es.indices.delete(index=self.index_name) 
    except: 
        pass

好啊，封裝 search 類也是為了方便調用，整體貼一下。

from elasticsearch import Elasticsearch 
 
 
class elasticSearch(): 
 
    def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"): 
 
        # self.es = Elasticsearch([ip], http_auth=('elastic', 'password'), port=9200) 
        self.es = Elasticsearch("localhost:9200") 
        self.index_type = index_type 
        self.index_name = index_name 
 
    def create_index(self): 
        if self.es.indices.exists(index=self.index_name) is True: 
            self.es.indices.delete(index=self.index_name) 
        self.es.indices.create(index=self.index_name, ignore=400) 
 
    def delete_index(self): 
        try: 
            self.es.indices.delete(index=self.index_name) 
        except: 
            pass 
 
    def get_doc(self, uid): 
        return self.es.get(index=self.index_name, id=uid) 
 
    def insert_one(self, doc: dict): 
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
 
    def insert_array(self, docs: list): 
        for doc in docs: 
            self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
 
    def search(self, query, count: int = 30): 
        dsl = { 
            "query": { 
                "multi_match": { 
                    "query": query, 
                    "fields": ["title", "content", "link"] 
                } 
            }, 
            "highlight": { 
                "fields": { 
                    "title": {} 
                } 
            } 
        } 
        match_data = self.es.search(index=self.index_name, body=dsl, size=count) 
        return match_data

嘗試一下把 Mongodb 中的數據插入到 ES 中。

import json 
from datetime import datetime 
import pymongo 
from app.elasticsearchClass import elasticSearch 
 
client = pymongo.MongoClient('127.0.0.1', 27017) 
db = client['spider'] 
sheet = db.get_collection('Spider').find({}, {'_id': 0, }) 
 
es = elasticSearch(index_type="spider_data",index_name="spider") 
es.create_index() 
 
for i in sheet: 
    data = { 
            'title': i["title"], 
            'content':i["data"], 
            'link': i["link"], 
            'create_time':datetime.now() 
        } 
 
    es.insert_one(doc=data)

到 ES 中查看一下，啟動 elasticsearch-head 插件。

如果是 npm 安裝的那么 cd 到根目錄之后直接 npm run start 就跑起來了。

本地訪問 http://localhost:9100/

發現新加的 spider 數據文檔確實已經進去了。

3 爬蟲入庫

要想實現 ES 搜索，首先要有數據支持，而海量的數據往往來自爬蟲。

為了節省時間，編寫一個最簡單的爬蟲，抓取百度百科。

簡單粗暴一點，先遞歸獲取很多很多的 url 鏈接

import requests 
import re 
import time 
 
exist_urls = [] 
headers = { 
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36', 
} 
 
def get_link(url): 
    try: 
        response = requests.get(url=url, headers=headers) 
        response.encoding = 'UTF-8' 
        html = response.text 
        link_lists = re.findall('.*?<a target=_blank href="/item/([^:#=<>]*?)".*?</a>', html) 
        return link_lists 
    except Exception as e: 
        pass 
    finally: 
        exist_urls.append(url) 
 
 
# 當爬取深度小于10層時，遞歸調用主函數，繼續爬取第二層的所有鏈接 
def main(start_url, depth=1): 
    link_lists = get_link(start_url) 
    if link_lists: 
        unique_lists = list(set(link_lists) - set(exist_urls)) 
        for unique_url in unique_lists: 
            unique_url = 'https://baike.baidu.com/item/' + unique_url 
 
            with open('url.txt', 'a+') as f: 
                f.write(unique_url + '\n') 
                f.close() 
        if depth < 10: 
            main(unique_url, depth + 1) 
 
if __name__ == '__main__': 
    start_url = 'https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91' 
    main(start_url)

把全部 url 存到 url.txt 文件中之后，然后啟動任務。

# parse.py 
from celery import Celery 
import requests 
from lxml import etree 
import pymongo 
app = Celery('tasks', broker='redis://localhost:6379/2') 
client = pymongo.MongoClient('localhost',27017) 
db = client['baike'] 
@app.task 
def get_url(link): 
    item = {} 
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'} 
    res = requests.get(link,headers=headers) 
    res.encoding = 'UTF-8' 
    doc = etree.HTML(res.text) 
    content = doc.xpath("//div[@class='lemma-summary']/div[@class='para']//text()") 
    print(res.status_code) 
    print(link,'\t','++++++++++++++++++++') 
    item['link'] = link 
    data = ''.join(content).replace(' ', '').replace('\t', '').replace('\n', '').replace('\r', '') 
    item['data'] = data 
    if db['Baike'].insert(dict(item)): 
        print("is OK ...") 
    else: 
        print('Fail')

run.py 飛起來

from parse import get_url 
 
def main(url): 
    result = get_url.delay(url) 
    return result 
 
def run(): 
    with open('./url.txt', 'r') as f: 
        for url in f.readlines(): 
            main(url.strip('\n')) 
 
if __name__ == '__main__': 
    run()

黑窗口鍵入

celery -A parse worker -l info -P gevent -c 10

哦豁 !! 你居然使用了 Celery 任務隊列，gevent 模式，-c 就是10個線程刷刷刷就干起來了，速度杠杠的 !!

啥?分布式? 那就加多幾臺機器啦，直接把代碼拷貝到目標服務器，通過 redis 共享隊列協同多機抓取。

這里是先將數據存儲到了 MongoDB 上(個人習慣)，你也可以直接存到 ES 中，但是單條單條的插入速度堪憂(接下來會講到優化，哈哈)。

使用前面的例子將 Mongo 中的數據批量導入到 ES 中，OK !!!

到這一個簡單的數據抓取就已經完畢了。

好啦，現在 ES 中已經有了數據啦，接下來就應該是 Flask web 的操作啦，當然，Django，FastAPI 也很優秀。嘿嘿，你喜歡 !!

責任編輯：姜華來源： Python爬蟲與數據挖掘

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看