Apache Kylin中對上億字符串的精確Count_Distinct示例

作者：佚名 2017-05-11 10:44:19

由于Global Dictionary 底層基于bitmap，其最大容量為Integer.MAX_VALUE，即21億多，如果全局字典中，累計值超過Integer.MAX_VALUE，那么在Build時候便會報錯。

[[190885]]

如果業務中能接受1.22%的誤差，那么肯定首選近似算法，因為它能節省很多資源和時間。如果業務中必須使用精確去重，那么就看看本文的例子(針對上億字符串的精確去重)。

事實表

hive> desc test_t_pbs_uv_fact; 
OK 
ad_id                   string  //維度 
material_id             string   //維度 
city_code               string  //維度 
user_id                 string   //指標，需要精確Count Distinct 
bid_request             bigint  //指標，SUM 
device_bid_request      bigint      //指標，SUM 
win                     bigint  //指標，SUM  
ck                      bigint  //指標，SUM  
pt                      string  //維度，日期，yyyy-MM-dd

該事實表一天的數據記錄大概1.5億+，其中user_id為字符串，類似MD5后的字符串。

創建Model

在Kylin中創建名為lxw1234_uv_model的模型。

選擇維度和指標字段：

創建Cube

創建名為lxw1234_uv_cube的Cube，其中，指標定義如下：

其他請按實際業務需求配置。

手動修改Cube(JSON)

如果不修改，精確Count Distinct使用了Default dictionary來保存編碼后的user_id，而Default dictionary的最大容量為500萬，并且，會為每個Segment生成一個Default dictionary，這樣的話，跨天進行UV分析的時候，便會產生錯誤的結果，如果每天不重復的user_id超過500萬，那么build的時候會報錯：

java.lang.IllegalArgumentException: Too high cardinality is not suitable for dictionary — cardinality: 43377845  
at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:96) 
at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:73)

該值由參數 kylin.dictionary.max.cardinality 來控制，當然，你可以修改該值為1億，但是Build時候可能會因為內存溢出而導致Kylin Server掛掉：

# java.lang.OutOfMemoryError: Requested array size exceeds VM limit  
# -XX:OnOutOfMemoryError=”kill -9 %p”  
# Executing /bin/sh -c “kill -9 16193″…

因此，這種需求我們需要手動使用Global Dictionary，顧名思義，它是一個全局的字典，不分Segments，同一個user_id，在全局字典中只有一個ID。

目前Kylin的UI中沒有可以直接配置Global Dictionary的地方，需要手動修改Cube的JSON描述：

在狀態為DISABLED的Cube列表中，點擊”Admins”菜單下的”Edit(JSON)”，進入Cube JSON描述的編輯頁面，

添加下面的JSON

其中，在override_kylin_properties 中增加了兩個Cube的配置參數，用于增加Mapper的運行內存。

"dictionaries": [ 
    { 
      "column": "USER_ID", 
      "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" 
    } 
  ]

定義了對USER_ID字段使用全局字典。

之后，保存JSON。

Build與查詢

Build完成后，在Hive和Kylin中執行下面的查詢：

SELECT city_code,SUM(bid_request) AS bid_request,COUNT(DISTINCT user_id) AS uvFROM liuxiaowen.TEST_T_PBS_UV_FACTGROUP BY city_codeORDER BY uv DESC limit 30;

Hive中耗時：181.134 seconds

Kylin中耗時：9 seconds

查詢結果完全一致：