你可能不知道的字符串分割技巧

作者：前端小智 2022-12-21 08:05:04

Intl.Segmenter 是一個 JavaScript 對象，用于對文本進行區域設置敏感的分段。它可以幫助我們從字符串中提取有意義的項目，如單詞、句子或字形。

最近看到一個拆分字符串的新方式，就是使用Intl.Segmenter將emoji字符串分割成字形的方法。

我以前都沒用過這個Intl對象，現在我們一起來看看。

假設你想把用戶輸入拆分成句子，看起來是一個簡單的 split() 任務...但這個問題有很多細微之處。

'Hello! How are you?'.split(/[.!?]/);
// ['Hello', ' How are you', '']

使用 split() 會丟失定義的分隔符，并在所有地方包含這些空格。而且因為它依賴于硬編碼的分隔符，所以對語言不敏感。

我不懂日語，但你會如何嘗試將下面的字符串分割成單詞或句子？

// I am a cat. My name is Tanuki.
'吾輩は貓である。名前はたぬき。'

普通的字符串方法在這里是沒有用的，但是Intl JavaScript API 確能解決這個問題。

Intl.Segmenter 來救場

Intl.Segmenter 是一個 JavaScript 對象，用于對文本進行區域設置敏感的分段。它可以幫助我們從字符串中提取有意義的項目，如單詞、句子或字形。它的使用方式類似于其他的構造函數，可以使用 new 關鍵字來創建一個 Intl.Segmenter 對象。

const segmenter = new Intl.Segmenter(locale, { granularity: "word" });

在上面的代碼中，locale 是字符串，表示要使用的區域設置。granularity 是字符串，表示分段的粒度。它可以是 "grapheme"（字形）、"word"（單詞）或 "sentence"（句子）之一。

Intl.Segmenter 有一個很有用的方法叫做 segment()，它可以將文本拆分為有意義的段。

const segments = segmenter.segment(text);

在上面的代碼中，text 是要拆分的文本，segments 是返回的段的迭代器。你可以使用 for-of 循環來遍歷段，或者使用 Array.from() 將它們轉換為數組。

const text = "Hello, world! How are you today?";
const segmenter = new Intl.Segmenter("en-US", { granularity: "sentence" });
const segments = segmenter.segment(text);

for (const segment of segments) {
  console.log(segment);
}

// Output:
// { index: 0, value: "Hello, world!", breakType: "", breakIndex: 12 }
// { index: 13, value: "How are you today?", breakType: "", breakIndex: 31 }

Intl.Segmenter 對象還有其他一些有用的方法，比如 breakType，用于檢索分段的類型（例如，句子的末尾是否包含句號）。另一個有用的方法是 breakType，用于檢索分段的類型。

例如：

const text = "Hello, world! How are you today?";
const segmenter = new Intl.Segmenter("en-US", { granularity: "sentence" });
const segments = segmenter.segment(text);

for (const segment of segments) {
  console.log(segment.breakType);
}

// Output:
// "exclamation"
// "question"

Intl.Segmenter 還有一個很有用的靜態方法叫做 supportedLocalesOf()，它可以幫助你確定瀏覽器是否支持特定的區域設置。

const supported = Intl.Segmenter.supportedLocalesOf(["en-US", "zh-CN"]);
console.log(supported);

// Output:
// ['en-US', 'zh-CN']

在上面的代碼中，supported 數組包含瀏覽器支持的區域設置。

如果你想要對文本進行更細粒度的分段，你可以使用 Intl.ListFormat 對象。它可以幫助你將文本拆分為有意義的列表項。

使用方式類似于 Intl.Segmenter，你可以使用 new 關鍵字創建一個 Intl.ListFormat 對象。

const listFormat = new Intl.ListFormat(locale, { style: "long", type: "conjunction" });

在上面的代碼中，locale 是字符串，表示要使用的區域設置。style 和 type 是對象的屬性，用于指定列表格式。style 可以是 "long" 或 "short"，type 可以是 "conjunction"（并列）或 "disjunction"（或）。

Intl.ListFormat 有一個很有用的方法叫做 format()，它可以將數組轉換為有意義的列表。

const list = ["apple", "banana", "orange"];
const formatted = listFormat.format(list);
console.log(formatted);

// Output:
// "apple, banana, and orange"

在上面的代碼中，formatted 是轉換后的列表字符串。

Word 的顆粒度帶有一個額外的isWordLike屬性

如果把一個字符串分割成單詞，所有的片段都包括空格和換行符。使用isWordLike屬性將它們過濾掉。

const segmenterDe = new Intl.Segmenter('de', {
  granularity: 'word'
});
const segmentsDe = segmenterDe.segment('Was geht ab?');

console.log([...segmentsDe]);
// [
//   { segment: 'Was', index: 0, input: 'Was geht ab?', isWordLike: true },
//   { segment: ' ', index: 3, input: 'Was geht ab?', isWordLike: false },
//   ...
// ]

console.log([...segmentsDe].filter(s => s.isWordLike));
// [
//   { segment: 'Was', index: 0, input: 'Was geht ab?', isWordLike: true},
//   { segment: 'geht', index: 4, input: 'Was geht ab?', isWordLike: true },
//   { segment: 'ab', index: 9, input: 'Was geht ab?', isWordLike: true }
// ]

上面通過isWordLike進行過濾會刪除標點符號，如.、-、或？。

使用 Intl.Segmenter 來分割 emojis

如果你想把一個字符串分割成可視化的emojis，Intl.Segmenter也是一個很好的幫助。

const emojis = '???????????????';

// ----
// Split by code units
console.log(emojis.split(''));
// ['\uD83E', '\uDEE3', '\uD83E', '\uDEF5', '\uD83D', '\uDE48']

// ----
// Split by code points
console.log([...emojis]);
// ['??', '??', '??', '?', '??', '?', '??', '?', '??']

// ----
// Split by graphemes
const segmenter = new Intl.Segmenter('en', {
  granularity: 'grapheme'
});
const segments = segmenter.segment(emojis);

console.log(Array.from(
  segmenter.segment(emojis),
  s => s.segment
));
// ['??', '??', '???????????']

請注意，字形也包括空格和 "正常 "字符。

編輯中可能存在的bug沒法實時知道，事后為了解決這些bug,花了大量的時間進行log 調試，這邊順便給大家推薦一個好用的BUG監控工具 Fundebug。

參考

??https://www.stefanjudis.com/today-i-learned/how-to-split-javascript-strings-with-intl-segmenter/??
??https://2ality.com/2022/11/regexp-v-flag.html??

原文：https://www.stefanjudis.com/today-i-learned/how-to-split-javascript-strings-with-intl-segmenter/

最后

本文譯自：https://marmelab.com/blog/2022/09/20/react-i-love-you.html

責任編輯：武曉燕來源：大遷世界

字符串分割技巧

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

你可能不知道的字符串分割技巧

Intl.Segmenter 來救場

Word 的顆粒度帶有一個額外的isWordLike屬性

使用 Intl.Segmenter 來分割 emojis

參考

最后