成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

鴻蒙開發者社區

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發者社區

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發者社區訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業版APP

鴻蒙開發者社區視頻號

51CTO軟考題庫

賬號設置退出

手寫簡易瀏覽器之Html Parser 篇

作者：神說要有光zxg 2021-06-07 00:15:26

系統瀏覽器

這篇是簡易瀏覽器中 html parser 的實現，少了自閉合標簽的處理，就是差一個 if else，后面會補上。

[[403967]]

本文轉載自微信公眾號「神光的編程秘籍」，作者神說要有光zxg。轉載本文請聯系神光的編程秘籍公眾號。

思路分析

實現 html parser 主要分為詞法分析和語法分析兩步。

詞法分析

詞法分析需要把每一種類型的 token 識別出來，具體的類型有：

開始標簽，如 <div>
結束標簽，如 </div>
注釋標簽，如
doctype 標簽，如 <!doctype html>
text，如 aaa

這是最外層的 token，開始標簽內部還要分出屬性，如 id="aaa" 這種。

也就是有這幾種情況：

第一層判斷是否包含 <，如果不包含則是 text，如果包含則再判斷是哪一種，如果是開始標簽，還要對其內容再取屬性，直到遇到 > 就重新判斷。

語法分析

語法分析就是對上面分出的 token 進行組裝，生成 ast。

html 的 ast 的組裝主要是考慮父子關系，記錄當前的 parent，然后 text、children 都設置到當前 parent 上。

我們來用代碼實現一下：

代碼實現

詞法分析

首先，我們要把 startTag、endTag、comment、docType 還有 attribute 的正則表達式寫出來：

正則

結束標簽就是

const endTagReg = /^<\/([a-zA-Z0-9\-]+)>/;

注釋標簽是中間夾著非 --> 字符出現任意次

const commentReg = /^<!\-\-[^(-->)]*\-\->/;

doctype 標簽是字符出現多次，加 >

const docTypeReg = /^<!doctype [^>]+>/;

attribute 是多個空格開始，加 a-zA-Z0-9 或 - 出現多次，接一個 =，之后是非 > 字符出多次

const attributeReg = /^(?:[ ]+([a-zA-Z0-9\-]+=[^>]+))/;

開始標簽是 < 開頭，接 a-zA-Z0-9 和 - 出現多次，然后是屬性的正則，最后是 > 結尾

const startTagReg = /^<([a-zA-Z0-9\-]+)(?:([ ]+[a-zA-Z0-9\-]+=[^> ]+))*>/;

分詞

之后，我們就可以基于這些正則來分詞，第一層處理 < 和 text：

function parse(html, options) { 
    function advance(num) { 
        html = html.slice(num); 
    } 
 
    while(html){ 
        if(html.startsWith('<')) { 
            //... 
        } else { 
            let textEndIndex = html.indexOf('<'); 
            options.onText({ 
                type: 'text', 
                value: html.slice(0, textEndIndex) 
            }); 
            textEndIndex = textEndIndex === -1 ? html.length: textEndIndex; 
            advance(textEndIndex); 
        } 
    } 
}

第二層處理 <!-- 和 <!doctype 和結束標簽、開始標簽：

const commentMatch = html.match(commentReg); 
if (commentMatch) { 
    options.onComment({ 
        type: 'comment', 
        value: commentMatch[0] 
    }) 
    advance(commentMatch[0].length); 
    continue; 
} 
 
const docTypeMatch = html.match(docTypeReg); 
if (docTypeMatch) { 
    options.onDoctype({ 
        type: 'docType', 
        value: docTypeMatch[0] 
    }); 
    advance(docTypeMatch[0].length); 
    continue; 
} 
 
const endTagMatch = html.match(endTagReg); 
if (endTagMatch) { 
    options.onEndTag({ 
        type: 'tagEnd', 
        value: endTagMatch[1] 
    }); 
    advance(endTagMatch[0].length); 
    continue; 
} 
 
const startTagMatch = html.match(startTagReg); 
if(startTagMatch) {     
    options.onStartTag({ 
        type: 'tagStart', 
        value: startTagMatch[1] 
    }); 
 
    advance(startTagMatch[1].length + 1); 
    let attributeMath; 
    while(attributeMath = html.match(attributeReg)) { 
        options.onAttribute({ 
            type: 'attribute', 
            value: attributeMath[1] 
        }); 
        advance(attributeMath[0].length); 
    } 
    advance(1); 
    continue; 
}

經過詞法分析，我們能拿到所有的 token：

語法分析

token 拆分之后，我們需要再把這些 token 組裝在一起，只處理 startTag、endTag 和 text 節點。通過 currentParent 記錄當前 tag。

startTag 創建 AST，掛到 currentParent 的 children 上，然后 currentParent 變成新創建的 tag
endTag 的時候把 currentParent 設置為當前 tag 的 parent
text 也掛到 currentParent 上

function htmlParser(str) { 
    const ast = { 
        children: [] 
    }; 
    let curParent = ast; 
    let prevParent = null; 
    const domTree = parse(str,{ 
        onComment(node) { 
        }, 
        onStartTag(token) { 
            const tag = { 
                tagName: token.value, 
                attributes: [], 
                text: '', 
                children: [] 
            }; 
            curParent.children.push(tag); 
            prevParent = curParent; 
            curParent = tag; 
        }, 
        onAttribute(token) { 
            const [ name, value ] = token.value.split('='); 
            curParent.attributes.push({ 
                name, 
                value: value.replace(/^['"]/, '').replace(/['"]$/, '') 
            }); 
        }, 
        onEndTag(token) { 
            curParent = prevParent; 
        }, 
        onDoctype(token) { 
        }, 
        onText(token) { 
            curParent.text = token.value; 
        } 
    }); 
    return ast.children[0]; 
}

我們試一下效果：

const htmlParser = require('./htmlParser'); 
 
const domTree = htmlParser(` 
<!doctype html> 
<body> 
    <div> 
        <!--button--> 
        <button>按鈕</button> 
        <div id="container"> 
            <div class="box1"> 
                <p>box1 box1 box1</p> 
            </div> 
            <div class="box2"> 
                <p>box2 box2 box2</p> 
            </div> 
        </div> 
    </div> 
</body> 
`); 
 
console.log(JSON.stringify(domTree, null, 4));

成功生成了正確的 AST。

總結

這篇是簡易瀏覽器中 html parser 的實現，少了自閉合標簽的處理，就是差一個 if else，后面會補上。

我們分析了思路并進行了實現：通過正則來進行 token 的拆分，把拆出的 token 通過回調函數暴露出去，之后進行 AST 的組裝，需要記錄當前的 parent，來生成父子關系正確的 AST。

html parser 其實也是淘系前端的多年不變的面試題之一，而且 vue template compiler 還有 jsx 的 parser 也會用到類似的思路。還是有必要掌握的。希望本文能幫大家理清思路。

代碼在 github：https://github.com/QuarkGluonPlasma/tiny-browser

責任編輯：武曉燕來源：神光的編程秘籍

瀏覽器 Html Parser

51CTO技術棧公眾號

業務
速覽

媒體

51CTO CIOAge HC3i

社區

51CTO博客鴻蒙開發者社區 AI.x社區

教育

51CTO學堂精培企業培訓 CTO訓練營

主站蜘蛛池模板：亚洲欧美在线视频 | 91久久爽久久爽爽久久片 | 日韩精品成人免费观看视频 | 国产精品久久久久久久久久久久久久 | 日韩av在线免费 | 无吗视频| 91精品国产综合久久久久久蜜臀 | 日韩欧美一级 | 国产精品视频免费观看 | 日韩在线免费视频 | 香蕉视频1024 | 亚洲欧美综合精品久久成人 | 综合一区二区三区 | 欧美在线视频一区 | 国产特级毛片aaaaaa | 国产在线观看不卡一区二区三区 | 99久久婷婷国产综合精品电影 | 第一区在线观看免费国语入口 | 国产精品一二区 | 成人黄色在线 | 久久久久久国产精品 | 在线成人免费视频 | 日本不卡免费新一二三区 | 在线观看视频福利 | 久久久夜夜夜 | 国产一区二区三区欧美 | 国产精品一区二区三区在线 | 亚州成人 | 殴美成人在线视频 | 91视频在线| 91精品久久久久久久久 | 91精品国产麻豆 | 欧美日韩亚洲一区 | 国产成人免费观看 | 欧美日韩成人在线 | 国产精品我不卡 | 久久精品网 | 综合精品 | 一片毛片 | 99热在线免费 | 久久精品一二三影院 |