從MySQL JOIN 算法角度看如何優化SQL

作者：陸宗藝 2024-12-10 00:00:10

Block Nested-Loop Join （Hash Join）與 Index Nested-Loop Join 對比，并沒有哪一種算法更優一說，只要其整體成本比另一種低，那他就是最合適的。

一、前言

在做MySQL的SQL優化時，如果只涉及到單表查詢，那么大部分慢SQL都只需從索引上入手優化即可，通過添加合適的索引來消除全表掃描或者排序操作，執行效果，大概率能實現質的飛躍。

然而，在實際生產中，除了單表查詢，更多的是多個表的聯合查詢，這樣的查詢通常是慢SQL的重災區，查詢速度慢，且使用服務器資源較多，如果能將這類SQL優化掉，那必將大大減輕數據庫服務器壓力。現在，咱就通過多表關聯內部數據操作的角度，看看如何進行SQL優化。

二、準備工作

現在線上環境大部分使用的都是MySQL 5.7.x版本，那咱就以5.7版本為主，適當延伸MySQL 8.0版本為輔進行講解測試。

創建測試表：

# 創建兩個表結構一模一樣的表：t1、t2
create table t1(
  id int not null auto_increment,
  a int,
  b int,
  c int,
  primary key(id),
  key idx_a(a)
);


create table t2 like t1;

構造測試數據：

# 創建2個存儲過程用于構造測試數據


# 構造t1表數據的存儲過程，數據為3的整除數，1000條
delimiter //
create procedure t1_proc()
begin
    declare i int default 1;
    while (i<=3000) do
        if (i%3) = 0 then
            insert into t1(a,b,c) values(i, i, i);
        end if;
        set i=i+1;
    end while;
end //
delimiter ;




# 構造t2表數據的存儲過程，數據為2的整除數，100000條
delimiter //
create procedure t2_proc()
begin
    declare i int default 1;
    while (i<=200000) do
        if (i%2) = 0 then
            insert into t2(a,b,c) values(i, i, i);
        end if;
        set i=i+1;
    end while;
end //
delimiter ;


# 調用存儲過程，生成測試數據
call t1_proc();
call t2_proc();


# 刪除存儲過程
drop procedure t1_proc;
drop procedure t2_proc;

數據樣例：

[5.7.37-log localhost:mysql.sock]>select * from t1 limit 5;
+----+------+------+------+
| id | a    | b    | c    |
+----+------+------+------+
|  1 |    3 |    3 |    3 |
|  2 |    6 |    6 |    6 |
|  3 |    9 |    9 |    9 |
|  4 |   12 |   12 |   12 |
|  5 |   15 |   15 |   15 |
+----+------+------+------+
5 rows in set (0.00 sec)


[5.7.37-log localhost:mysql.sock]>select * from t2 limit 5;
+----+------+------+------+
| id | a    | b    | c    |
+----+------+------+------+
|  1 |    2 |    2 |    2 |
|  2 |    4 |    4 |    4 |
|  3 |    6 |    6 |    6 |
|  4 |    8 |    8 |    8 |
|  5 |   10 |   10 |   10 |
+----+------+------+------+
5 rows in set (0.00 sec)

三、MySQL JOIN算法

MySQL對兩表關聯，支持多種Join算法，咱就以下面這個SQL為例，深入探討一下。

測試SQL：

select * from t1 join t2 on t1.b=t2.b;

1、Simple Nested-Loop Join

設想一下，如果兩表關聯，在沒有任何干預的情況下，他像不像下面這個偽代碼的嵌套循環：

for row_1 in t1: # 循環1000次
    for row_2 in t2: # 對應每個外層循環10w次
        if row_1.b == row_2.b:
            do something

從上面的偽代碼中，我們可以看到，其就是簡單粗暴的嵌套循環，我們將其稱為 Simple Nested-Loop Join。回到數據庫層面，在測試SQL兩個表關聯的過程中，t1表中的每一行數據，都會觸發掃描一次t2表的數據，然后進行數據匹配。總的來講就是，因為t1表有1000行數據，所以t2表會被掃描1000次，并進行1000 * 10w = 1億次數據比較。

圖片

很顯然，如果使用這種方式，當 t2 表足夠大時，反復掃描數據的過程中，磁盤必然會被拉爆，服務器性能會急劇下降。像MySQL這樣優秀的產品，必然會想方設法的避免這種情況的發生。

2、Block Nested-Loop Join

緊接上面所說，既然 Simple Nested-Loop Join最大的弊端是被驅動表被反復掃描，那是不是可以從這方面入手，減少被驅動表的掃描次數，以達到優化目的。咱繼續往下看，看他是怎么實現的。

一般情況下，兩表關聯，MySQL都會將結果集小（指根據條件過濾后）的表做驅動表，結果集大的表當被驅動表，那是不是可以嘗試一下，把驅動表的結果集放到內存中（Join Buffer），然后一次性掃描被驅動表的所有數據，反過來與Join Buffer中的驅動表結果集進行比較。這方式，驅動表和被驅動表都只掃描一次，但在內存中進行數據比較的次數依然為 10w * 1000 = 1億次。很顯然，這方式，相對于Simple Nested-Loop Join而言，優勢非常明顯，MySQL管這個叫Block Nested-Loop Join。

圖片

聰明的你，是不是在想：如果驅動表t1的結果集，無法一次性全部存放到Join Buffer內存中時，怎么辦？

Join Buffer 的大小由參數 join_buffer_size 控制，默認為256K。在使用Join Buffer時，如果無法一次性存放所有結果集，他會分多次進行，比如：

1）讀取驅動表t1的數據，存放到Join Buffer中，假設，存放400條后，Join Buffer滿了，停止讀取

2）讀取被驅動表t2的數據，每一行數據都與Join Buffer中的數據進行比較，并返回符合條件的結果集

3）清空Join Buffer

4）繼續讀取驅動表t1的數據，將401-800的數據存放到Join Buffer，直到存滿

5）...... 繼續重復相似的動作，直到所有數據都比對完

圖片

在上述假設情況下，因Join Buffer大小限制的原因，被驅動表 t2 被掃描了3次。總的來講，雖然不算完美，但顯然比使用Simple Nested-Loop Join的方式容易接受多了。也就是說，MySQL在經過對表鏈接進行優化后，就不會再出現使用Simple Nested-Loop Join的情況了。

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select * from t1 join t2 on t1.b=t2.b\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100256
     filtered: 10.00
        Extra: Using where; Using join buffer (Block Nested Loop)
2 rows in set, 1 warning (0.00 sec)

3、Hash Join

雖然經過MySQL優化后的 Block Nested-Loop Join 算是改進了不少，但是，對各位程序員大拿而言，必然是一眼就看出了還有改進的余地。

Block Nested-Loop Join 在將驅動表結果集存放到Join Buffer后，被驅動表的數據與其比對時，被驅動表的每一行數據，都要與Join Buffer中所有數據進行比對，才能得出匹配的結果。這像不像對MySQL實體表進行條件查詢時，進行了全表掃描的操作一樣。這情況，如果給條件列加個索引，查詢速度是不是要瞬間起飛。

想法很好，但很不幸，MySQL 5.7.x 版本不支持；但也很慶幸，MySQL 8.0版本實現了，他會根據驅動表結果集，將關聯列映射為哈希值后鍵創建哈希表，被驅動表的數據在與哈希表進行比較時，就大大降低了比較次數，這也達到了優化的目的，我們管其叫Hash Join。

圖片

咱看看其執行計劃：

[8.0.27 127.0.0.1:3380]>explain select * from t1 join t2 on t1.b=t2.b\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100400
     filtered: 10.00
        Extra: Using where; Using join buffer (hash join)
2 rows in set, 1 warning (0.00 sec)

咱再來對比一下Block Nested-Loop Join和Hash Join的執行速度：

[5.7.37-log localhost:mysql.sock]>select * from t1 join t2 on t1.b=t2.b;
......
+------+------+------+------+------+------+
500 rows in set (4.90 sec)






[8.0.27 127.0.0.1:3380]>select * from t1 join t2 on t1.b=t2.b;
......
+------+------+------+------+------+------+
500 rows in set (0.02 sec)

從執行邏輯和執行結果上，都印證了Hash Join必然會比Block Nested-Loop Join要好。所以，在MySQL 8.0版本，Block Nested-Loop Join將不復存在，所有原先使用其算法的表關聯SQL，最終都會被優化成選擇Hash Join進行表關聯。

4、Index Nested-Loop Join

前面提到的 Block Nested-Loop Join 和 Hash Join，都是MySQL自己內部實現的優化，如果沒有其他更好的算法，那么基于這兩種算法基礎上的表關聯慢SQL，人為干預改進的可能性，是不是就微無其微了。

我們仔細分析一下前面這兩種算法的特點，Block Nested-Loop Join 的改進是降低了表掃描次數， Hash Join的改進是降低了數據對比的次數，但他兩，依然有一個致命的共同點，如果被驅動表足夠大（大表）時，比如有N億數據量，那么，哪怕掃描一次被驅動表，也會引起數據庫性能急劇下降。

知道了問題在哪，自然就有了優化的方向。設想一下，如果被驅動表的關聯列，像Hash Join中的哈希表一樣，存在索引，會是個什么情況呢？

圖片

驅動表中的每一行記錄，都可以通過被驅動表的索引列，進行索引查找（與關聯列有關，可以是主鍵，也可以是二級索引），這瞬間就解決了被驅動表被掃描的問題。其本質，和單表查詢中，通過建立合適索引的方式進行優化，是不是很相似。哪怕驅動表再大，如果索引列每個鍵值對應的數據量不大，那么索引查找速度依然可以快到起飛，這算法就叫 Index Nested-Loop Join。

先前咱兩個測試表中，a列和b列數據是一樣的，a列有索引，b列無索引，所以，咱將測試SQL變通一下

select * from t1 join t2 on t1.b=t2.b;
# 替換為
select * from t1 join t2 on t1.b=t2.a;

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select * from t1 join t2 on t1.b=t2.a\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 100.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ref
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: db1.t1.b
         rows: 1
     filtered: 100.00
        Extra: NULL
2 rows in set, 1 warning (0.00 sec)

執行速度：

[5.7.37-log localhost:mysql.sock]>select * from t1 join t2 on t1.b=t2.a;
......
+------+------+------+------+------+------+
500 rows in set (0.01 sec)

你是不是在疑惑，看這執行速度，和Hash Join差別也不大，那是因為咱的被驅動表t2數據量太少，隨著測試數據量的增大，差距會越來越明顯。

四、優化思路

前面的測試SQL，相對來講，簡化的有點過于簡單了，實際應用中，必然會有一大堆查詢條件跟在其后，那這一堆查詢條件，在進行SQL優化時，會不會對你造成干擾呢？

1、初始SQL

測試SQL做個變化，讓其他稍微貼近實際情況：

select *
from t1 join t2 on t1.b = t2.b
where
    t1.c in (6, 12, 18, 24, 30)
    and t2.c in (6, 12, 18, 24, 30);

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select *
    -> from t1 join t2 on t1.b = t2.b
    -> where
    ->     t1.c in (6, 12, 18, 24, 30)
    ->     and t2.c in (6, 12, 18, 24, 30)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 50.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100345
     filtered: 5.00
        Extra: Using where; Using join buffer (Block Nested Loop)
2 rows in set, 1 warning (0.00 sec)

從上面的執行計劃可以看到，t1表較小為驅動表，t2表較大為被驅動表。咱一步一步分析，暫時剔除t2表，先看t1表是否有優化的空間，其現在是全表掃描，并通過t1.c列進行數據過濾。單表查詢，如果查詢條件列有索引，必然會加快查詢速度對吧。

2、SQL優化1

t1表中，a、b、c列數據是一樣的，a列有索引，所以咱不額外創建索引了，直接使用a列替代c列，重寫測試SQL：

select *
from t1 join t2 on t1.b = t2.b
where
    t1.a in (6, 12, 18, 24, 30)
    and t2.c in (6, 12, 18, 24, 30);

查看新的執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select *
    -> from t1 join t2 on t1.b = t2.b
    -> where
    ->     t1.a in (6, 12, 18, 24, 30)
    ->     and t2.c in (6, 12, 18, 24, 30)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100345
     filtered: 5.00
        Extra: Using where; Using join buffer (Block Nested Loop)
2 rows in set, 1 warning (0.00 sec)

t1表從原先的全表掃描，變成了索引查找，預估讀取的數據行，也從原來的1000行變成了5行，優化效果明顯。此時，再看看t2表，因為關聯列t2.b沒有索引，查詢列t2.c也沒有索引，所以t2表是掃描一次后，通過Block Nested-Loop Join算法與Join Buffer中的數據進行匹配。

在前面講解Index Nested-Loop Join時，咱知道，如果關聯列 t2.b 有索引，就會使用Index Nested-Loop Join算法進行數據匹配，那，如果關聯列沒索引，但是查詢過濾列 t2.c 有索引，會是怎樣的？

3、SQL優化2

同樣的，咱用 t2.a 列替代 t2.c 列，重寫測試SQL：

select *
from t1 join t2 on t1.b = t2.b
where
    t1.a in (6, 12, 18, 24, 30)
    and t2.a in (6, 12, 18, 24, 30);

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select *
    -> from t1 join t2 on t1.b = t2.b
    -> where
    ->     t1.a in (6, 12, 18, 24, 30)
    ->     and t2.a in (6, 12, 18, 24, 30)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: range
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 10.00
        Extra: Using index condition; Using where; Using join buffer (Block Nested Loop)
2 rows in set, 1 warning (0.00 sec)

與前面的執行計劃對比發現，其依然是使用Block Nested-Loop Join算法，只不過原先t2表，從全表掃描，變成了通過 t2.a 列索引，一次性查找出全部數據后，再與Join Buffer中t1表的結果集進行匹配，如果 t2.a 列根據查詢條件過濾出來的數據，足夠少，這也不失為一個較好的優化思路。

4、SQL優化3

當然了，如果關聯列有索引，查詢列沒索引，你已經知道了是使用Index Nested-Loop Join算法，繼續重寫測試SQL：

select *
from t1 join t2 on t1.b = t2.a
where
    t1.a in (6, 12, 18, 24, 30)
    and t2.c in (6, 12, 18, 24, 30);

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select *
    -> from t1 join t2 on t1.b = t2.a
    -> where
    ->     t1.a in (6, 12, 18, 24, 30)
    ->     and t2.c in (6, 12, 18, 24, 30)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ref
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: db1.t1.b
         rows: 1
     filtered: 50.00
        Extra: Using where
2 rows in set, 1 warning (0.00 sec)

被驅動表關聯列有索引，查詢列無索引，使用Index Nested-Loop Join算法。

5、疑問

如果t2表中，關聯列和查詢列，都有索引，他會怎么選？為了更好的比較，咱給 t2.c 列創建一個索引，并對 t2.a 列的數據進行適當的調整。

# 添加c列索引
alter table t2 add index idx_c(c);


# 調整t2表a列數據，a列查詢條件中的值，每個值對應的數據量為4000
update t2 set a=a%50;


# 消除表碎片，避免被其干擾
alter table t2 engine=innodb;


# 驅動表傳過來的鍵值，每個鍵值對應的數據為4000行
[5.7.37-log localhost:mysql.sock]>select a,count(a) cnt
    -> from t2
    -> where a in (6, 12, 18, 24, 30)
    -> group by a;
+------+------+
| a    | cnt  |
+------+------+
|    6 | 4000 |
|   12 | 4000 |
|   18 | 4000 |
|   24 | 4000 |
|   30 | 4000 |
+------+------+
5 rows in set (0.01 sec)




# 總共符合條件的數據，5行
[5.7.37-log localhost:mysql.sock]>select * from t2 where c in (6, 12, 18, 24, 30);
+----+------+------+------+
| id | a    | b    | c    |
+----+------+------+------+
|  3 |    6 |    6 |    6 |
|  6 |   12 |   12 |   12 |
|  9 |   18 |   18 |   18 |
| 12 |   24 |   24 |   24 |
| 15 |   30 |   30 |   30 |
+----+------+------+------+
5 rows in set (0.01 sec)

重寫測試SQL：

select *
from t1 join t2 on t1.b = t2.a
where
    t1.a in (6, 12, 18, 24, 30)
    and t2.c in (6, 12, 18, 24, 30);

執行計劃：

[5.7.37-log localhost:mysql.sock]>explain select *
    -> from t1 join t2 on t1.b = t2.a
    -> where
    ->     t1.a in (6, 12, 18, 24, 30)
    ->     and t2.c in (6, 12, 18, 24, 30)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: idx_a
          key: idx_a
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: range
possible_keys: idx_a,idx_c
          key: idx_c
      key_len: 5
          ref: NULL
         rows: 5
     filtered: 4.55
        Extra: Using index condition; Using where; Using join buffer (Block Nested Loop)
2 rows in set, 1 warning (0.00 sec)

由此可見，Block Nested-Loop Join （Hash Join）與 Index Nested-Loop Join 對比，并沒有哪一種算法更優一說，只要其整體成本比另一種低，那他就是最合適的。當然了，前面所有例子，都是只有2個表關聯，對于3表及以上的關聯SQL而言，如果你把前2個表的關聯結果，當成一個新的驅動表看待，那么所有后面的表關聯，是不是都只需分析兩表關聯的情況即可。

五、最后

至此，對于想學習SQL優化的你，功力是不是又有長進了。如果你還有其他疑問，可以寫在評論區，咱后面再繼續探討。

責任編輯：武曉燕來源：京東云開發者

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

從MySQL JOIN 算法角度看如何優化SQL

一、前言

二、準備工作

三、MySQL JOIN算法

1、Simple Nested-Loop Join

2、Block Nested-Loop Join

3、Hash Join

4、Index Nested-Loop Join

四、優化思路

1、初始SQL

2、SQL優化1

3、SQL優化2

4、SQL優化3

5、疑問

五、最后