基于 TypeScript/Node 從 0 到 1 搭建一款爬蟲工具

2022-01-27 13:02:46

開發前端

今天，我們將使用TS這門語言搭建一款爬蟲工具。目標網址是什么呢？我們去上網一搜，經過幾番排查之后，我們選定了這一個網站。

前言

今天，我們將使用TS這門語言搭建一款爬蟲工具。目標網址是什么呢？我們去上網一搜，經過幾番排查之后，我們選定了這一個網站。

??https://www.hanju.run/ ??

一個視頻網站，我們的目的主要是爬取這個網站上視頻的播放鏈接。下面，我們就開始進行第一步。

第一步

俗話說，萬事開頭難。不過對于這個項目而言，恰恰相反。你需要做以下幾個事情：

我們需要創建一個項目文件夾
鍵入命令，初始化項目

npm init -y

局部安裝typescript

npm install typescript -D

接著鍵入命令，生成ts配置文件

tsc --init

局部安裝ts-node，用于命令行輸出命令

npm install -D ts-node

在項目文件夾中創建一個src文件夾

然后我們在src文件夾中創建一個crawler.ts文件。

在package.json文件中修改快捷啟動命令

"scripts": {
    "dev-t": "ts-node ./src/crawler.ts"
  }

第二步

接下來，我們將進行實戰操作，也就是上文中crawler.ts文件是我們的主戰場。

我們首先需要引用的這幾個依賴，分別是

import superagent from "superagent";
import cheerio from "cheerio";
import fs from "fs";
import path from "path";

所以，我們會這樣安裝依賴：

superagent作用是獲取遠程網址html的內容。

npm install superagent

cheerio作用是可以通過jQ語法獲取頁面節點的內容。

npm install cheerio

剩余兩個依賴fs，path。它們是node內置依賴，直接引入即可。

我們完成了安裝依賴，但是會發現你安裝的依賴上會有紅色報錯。原因是這樣的，superagent和cheerio內部都是用JS寫的，并不是TS寫的，而我們現在的環境是TS。所以我們需要翻譯一下，我們將這種翻譯文件又稱類型定義文件（以.d.ts為后綴）。我們可以使用以下命令安裝類型定義文件。

npm install -D @types/superagent

npm install -D @types/cheerio

接下來，我們就認認真真看源碼了。

安裝完兩個依賴后，我們需要創建一個Crawler類，并且將其實例化。

import superagent from "superagent";
    import cheerio from "cheerio";
    import fs from "fs";
    import path from "path";
    class Crawler {  
      constructor() {    
      }
    }
    const crawler = new Crawler();

我們確定下要爬取的網址，然后賦給一個私有變量。最后我們會封裝一個getRawHtml方法來獲取對應網址的內容。

getRawHtml方法中我們使用了async/await關鍵字，主要用于異步獲取頁面內容，然后返回值。

import superagent from "superagent";
    import cheerio from "cheerio";
    import fs from "fs";
    import path from "path";
    class Crawler {  
      private url = "https://www.hanju.run/play/39221-4-0.html";  
      async getRawHtml() {    
        const result = await superagent.get(this.url);    
        return result.text;
      }  
      async initSpiderProcess() {  
        const html = await this.getRawHtml();  
      }
      constructor() { 
        this.initSpiderProcess();  
     }
    }
    const crawler = new Crawler();

使用cheerio依賴內置的方法獲取對應的節點內容。

我們通過getRawHtml方法異步獲取網頁的內容，然后我們傳給getJsonInfo這個方法，注意是string類型。我們這里通過cheerio.load(html)這條語句處理，就可以通過jQ語法來獲取對應的節點內容。我們獲取到了網頁中視頻的標題以及鏈接，通過鍵值對的方式添加到一個對象中。注：我們在這里定義了一個接口，定義鍵值對的類型。

import superagent from "superagent";
     import cheerio from "cheerio";
     import fs from "fs";
     import path from "path";
     interface Info {  
       name: string;  
       url: string;
     }
     class Crawler {  
       private url = "https://www.hanju.run/play/39221-4-0.html";  
       getJsonInfo(html: string) {    
         const $ = cheerio.load(html);  
         const info: Info[] = [];  
         const scpt: string = String($(".play>script:nth-child(1)").html());    
         const url = unescape(      
           scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "")  
         );    
         const name: string = String($("title").html());  
         info.push({  
           name,    
           url,  
         });  
         const result = {    
           time: new Date().getTime(),      
           data: info,    
         };    
         return result;  
      }  
      async getRawHtml() {  
        const result = await superagent.get(this.url);
        return result.text;  
      }  
      async initSpiderProcess() {  
          const html = await this.getRawHtml();    
        const info = this.getJsonInfo(html);  
      }  
        constructor() {  
          this.initSpiderProcess();  
       }
     }
     const crawler = new Crawler();

我們首先要在項目根目錄下創建一個data文件夾。然后我們將獲取的內容我們存入文件夾內的url.json文件（文件自動生成）中。

我們將其封裝成getJsonContent方法，在這里我們使用了path.resolve來獲取文件的路徑。fs.readFileSync來讀取文件內容，fs.writeFileSync來將內容寫入文件。注：我們分別定義了兩個接口objJson與InfoResult。

import superagent from "superagent";
import cheerio from "cheerio";
import fs from "fs";
import path from "path";
interface objJson {  
  [propName: number]: Info[];
}
interface Info {  
  name: string;  
  url: string;
}
interface InfoResult {  
  time: number;  
  data: Info[];
}
class Crawler {  
  private url = "https://www.hanju.run/play/39221-4-0.html";  
  getJsonInfo(html: string) {  
    const $ = cheerio.load(html);    
    const info: Info[] = [];    
    const scpt: string = String($(".play>script:nth-child(1)").html());  
    const url = unescape(      
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "")  
    );  
    const name: string = String($("title").html());    
    info.push({    
      name,    
      url,    
    });    
    const result = {    
      time: new Date().getTime(),    
      data: info,    
    };  
    return result;  
}  
async getRawHtml() {  
  const result = await superagent.get(this.url); 
  return result.text;  
}  
getJsonContent(info: InfoResult) {    
  const filePath = path.resolve(__dirname, "../data/url.json"); 
  let fileContent: objJson = {};  
  if (fs.existsSync(filePath)) {  
    fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8"));    
  }    
  fileContent[info.time] = info.data;  
  fs.writeFileSync(filePath, JSON.stringify(fileContent));  
}  
  async initSpiderProcess() {  
    const html = await this.getRawHtml();  
    const info = this.getJsonInfo(html);  
    this.getJsonContent(info);  
  }  
  constructor() {
    this.initSpiderProcess();  
  }
}
const crawler = new Crawler();

運行命令

npm run dev-t

查看生成文件的效果

{
      "1610738046569": [    
        {    
          "name": "《復仇者聯盟4：終局之戰》HD1080P中字m3u8在線觀看-韓劇網",    

          "url": "https://wuxian.xueyou-kuyun.com/20190728/16820_302c7858/index.m3u8"  
         }  
       ],  
       "1610738872042": [  
         {    
           "name": "《鋼鐵俠2》HD高清m3u8在線觀看-韓劇網",    
           "url": "https://www.yxlmbbs.com:65/20190920/54uIR9hI/index.m3u8"    
         }  
       ],  
       "1610739069969": [  
         {    
         "name": "《鋼鐵俠2》中英特效m3u8在線觀看-韓劇網",

         "url": "https://tv.youkutv.cc/2019/11/12/mjkHyHycfh0LyS4r/playlist.m3u8"  
       }  
     ]
     }

準結語

到這里真的結束了嗎？

不！

真的沒有結束。

我們會看到上面一坨代碼，真的很臭~

我們將分別使用組合模式與單例模式將其優化。

優化一：組合模式

組合模式（Composite Pattern），又叫部分整體模式，是用于把一組相似的對象當作一個單一的對象。組合模式依據樹形結構來組合對象，用來表示部分以及整體層次。這種類型的設計模式屬于結構型模式，它創建了對象組的樹形結構。

這種模式創建了一個包含自己對象組的類。該類提供了修改相同對象組的方式。

簡言之，就是可以像處理簡單元素一樣來處理復雜元素。

首先，我們在src文件夾下創建一個combination文件夾，然后在其文件夾下分別在創建兩個文件crawler.ts和urlAnalyzer.ts。

crawler.ts

crawler.ts文件的作用主要是處理獲取頁面內容以及存入文件內。

import superagent from "superagent";
import fs from "fs";
import path from "path";
import UrlAnalyzer from "./urlAnalyzer.ts";
export interface Analyzer {  
  analyze: (html: string, filePath: string) => string;
}
class Crowller {  
  private filePath = path.resolve(__dirname, "../../data/url.json");  
  async getRawHtml() {  
    const result = await superagent.get(this.url);    
    return result.text;
  }  
  writeFile(content: string) {  
    fs.writeFileSync(this.filePath, content);  
  }  
  async initSpiderProcess() {  
    const html = await this.getRawHtml();  
    const fileContent = this.analyzer.analyze(html, this.filePath);    
    this.writeFile(fileContent);  
  }  
  constructor(private analyzer: Analyzer, private url: string) {    
    this.initSpiderProcess();
  }
}
const url = "https://www.hanju.run/play/39257-1-1.html";
const analyzer = new UrlAnalyzer();
new Crowller(analyzer, url);

urlAnalyzer.ts

urlAnalyzer.ts文件的作用主要是處理獲取頁面節點內容的具體邏輯。

import cheerio from "cheerio";
import fs from "fs";import { Analyzer } from "./crawler.ts";
interface objJson {  
  [propName: number]: Info[];
}
interface InfoResult {  
  time: number;  
  data: Info[];
}
interface Info {  
  name: string;  
  url: string;
}
export default class UrlAnalyzer implements Analyzer {  
  private getJsonInfo(html: string) {  
    const $ = cheerio.load(html);  
    const info: Info[] = [];  
    const scpt: string = String($(".play>script:nth-child(1)").html());  
    const url = unescape(  
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "")  
    );  
    const name: string = String($("title").html());    
    info.push({    
      name,    
      url,  
    });  
    const result = {   
      time: new Date().getTime(),    
      data: info,  
    };  
    return result;  
  }  
  private getJsonContent(info: InfoResult, filePath: string) {    
    let fileContent: objJson = {};    
    if (fs.existsSync(filePath)) {    
      fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8"));    
    }  
    fileContent[info.time] = info.data;  
    return fileContent;  
  }  
  public analyze(html: string, filePath: string) {  
    const info = this.getJsonInfo(html);  
    console.log(info);  
    const fileContent = this.getJsonContent(info, filePath);    
    return JSON.stringify(fileContent);
  }
}

可以在package.json文件中定義快捷啟動命令。

然后使用npm run dev-c啟動即可。

"scripts": {  
    "dev-c": "ts-node ./src/combination/crawler.ts"
  },

優化二：單例模式

單例模式（Singleton Pattern）是 Java 中最簡單的設計模式之一。這種類型的設計模式屬于創建型模式，它提供了一種創建對象的最佳方式。

這種模式涉及到一個單一的類，該類負責創建自己的對象，同時確保只有單個對象被創建。這個類提供了一種訪問其唯一的對象的方式，可以直接訪問，不需要實例化該類的對象。

應用實例：

1、一個班級只有一個班主任。
2、Windows 是多進程多線程的，在操作一個文件的時候，就不可避免地出現多個進程或線程同時操作一個文件的現象，所以所有文件的處理必須通過唯一的實例來進行。
3、一些設備管理器常常設計為單例模式，比如一個電腦有兩臺打印機，在輸出的時候就要處理不能兩臺打印機打印同一個文件。

同樣，我們在src文件夾下創建一個singleton文件夾，然后在其文件夾下分別在創建兩個文件crawler1.ts和urlAnalyzer.ts。

這兩個文件的作用與上文同樣，只不過代碼書寫不一樣。

crawler1.ts

import superagent from "superagent";
import fs from "fs";
import path from "path";
import UrlAnalyzer from "./urlAnalyzer.ts";
export interface Analyzer {  
  analyze: (html: string, filePath: string) => string;
}
class Crowller {  
  private filePath = path.resolve(__dirname, "../../data/url.json");  
  async getRawHtml() {  
    const result = await superagent.get(this.url);  
    return result.text;  
  }  
  private writeFile(content: string) {  
    fs.writeFileSync(this.filePath, content);  
  }  
  private async initSpiderProcess() {  
    const html = await this.getRawHtml();  
    const fileContent = this.analyzer.analyze(html, this.filePath);  
    this.writeFile(JSON.stringify(fileContent));  
  }  
  constructor(private analyzer: Analyzer, private url: string) {  
    this.initSpiderProcess();  
  }
}
const url = "https://www.hanju.run/play/39257-1-1.html";
const analyzer = UrlAnalyzer.getInstance();
new Crowller(analyzer, url);

urlAnalyzer.ts

import cheerio from "cheerio";
import fs from "fs";
import { Analyzer } from "./crawler1.ts";
interface objJson {
  [propName: number]: Info[];
}
interface InfoResult {
  time: number;
  data: Info[];
}
interface Info {
  name: string;
  url: string;
}
export default class UrlAnalyzer implements Analyzer {
  static instance: UrlAnalyzer;
  static getInstance() {
    if (!UrlAnalyzer.instance) {
      UrlAnalyzer.instance = new UrlAnalyzer();
    }
    return UrlAnalyzer.instance;
  }
  private getJsonInfo(html: string) {
    const $ = cheerio.load(html);
    const info: Info[] = [];
    const scpt: string = String($(".play>script:nth-child(1)").html());
    const url = unescape(
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "")
    );
    const name: string = String($("title").html());
    info.push({
      name,
      url,
    });
    const result = {
      time: new Date().getTime(),
      data: info,
    };
    return result;
  }
  private getJsonContent(info: InfoResult, filePath: string) {
    let fileContent: objJson = {};
    if (fs.existsSync(filePath)) {
      fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8"));
    }
    fileContent[info.time] = info.data;
    return fileContent;
  }
  public analyze(html: string, filePath: string) {
     const info = this.getJsonInfo(html);
     console.log(info);
    const fileContent = this.getJsonContent(info, filePath);
    return JSON.stringify(fileContent);
  }
  private constructor() {}
}

可以在package.json文件中定義快捷啟動命令。

然后使用npm run dev-s啟動即可。

"scripts": {
     "dev-s": "ts-node ./src/singleton/crawler1.ts",
  },

結語

這下真的結束了，謝謝閱讀。希望可以幫到你。

完整源碼地址：

??https://github.com/maomincoding/TsCrawler ??

責任編輯：龐桂玉來源：前端大全

前端爬蟲工具

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

基于 TypeScript/Node 從 0 到 1 搭建一款爬蟲工具