驚！Go在十億次循環和百萬任務中表現不如Java，究竟為何？

作者：白明的贊賞賬戶 2024-12-02 10:47:45

本文主要探討了Go語言在十億次循環和百萬任務的測試中的表現令人意外地遜色于Java和C語言的原因。我認為Go在循環執行中的慢速表現，主要是其編譯器優化不足，影響了執行效率。

編程語言比較的話題總是能吸引程序員的眼球！

近期外網的兩篇編程語言對比的文章在國內程序員圈里引起熱議。一篇是由Ben Dicken (@BenjDicken)[1] 做的語言性能測試[2]，對比了十多種主流語言在執行10億次循環(一個雙層循環：1萬 * 10 萬)的速度；另一篇則是一個名為hez2010的開發者做的內存開銷測試[3]，對比了多種語言在處理百萬任務時的內存開銷。

下面是這兩項測試的結果示意圖：

10億循環測試結果

百萬任務內存開銷測試結果

我們看到：在這兩項測試中，Go的表現不僅遠不及NonGC的C/Rust，甚至還落后于Java，尤其是在內存開銷測試中，Go的內存使用顯著高于以“吃內存”著稱的Java。這一結果讓許多開發者感到意外，因為Go通常被認為是輕量級的語言，然而實際的測試結果卻揭示了其在高并發場景下的“內存效率不足”。

那么究竟為何在這兩項測試中，Go的表現都不及預期呢？在這篇文章中，我將探討可能的原因，以供大家參考。

我們先從十億次循環測試開始。

1. 循環測試跑的慢，都因編譯器優化還不夠

下面是作者給出的Go測試程序[4]：

// why-go-sucks/billion-loops/go/code.go 

package main

import (
 "fmt"
 "math/rand"
 "os"
 "strconv"
)

func main() {
 input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
 if e != nil {
  panic(e)
 }
 u := int32(input)
 r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
 var a [10000]int32                  // Array of 10k elements initialized to 0
 for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
  for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
   a[i] = a[i] + j%u // Simple sum
  }
  a[i] += r // Add a random value to each element in array
 }
 fmt.Println(a[r]) // Print out a single element from the array
}

這段代碼通過命令行參數獲取一個整數，然后生成一個隨機數，接著通過兩層循環對一個數組的每個元素進行累加，最終輸出該數組中以隨機數為下標對應的數組元素的值。

我們再來看一下"競爭對手"的測試代碼。C測試代碼如下：

// why-go-sucks/billion-loops/c/code.c

#include "stdio.h"
#include "stdlib.h"
#include "stdint.h"

int main (int argc, char** argv) {
  int u = atoi(argv[1]);               // Get an input number from the command line
  int r = rand() % 10000;              // Get a random integer 0 <= r < 10k
  int32_t a[10000] = {0};              // Array of 10k elements initialized to 0
  for (int i = 0; i < 10000; i++) {    // 10k outer loop iterations
    for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j%u;               // Simple sum
    }
    a[i] += r;                         // Add a random value to each element in array
  }
  printf("%d\n", a[r]);                // Print out a single element from the array
}

下面是Java的測試代碼：

// why-go-sucks/billion-loops/java/code.java

package jvm;

import java.util.Random;

public class code {

    public static void main(String[] args) {
        var u = Integer.parseInt(args[0]); // Get an input number from the command line
        var r = new Random().nextInt(10000); // Get a random number 0 <= r < 10k
        var a = new int[10000]; // Array of 10k elements initialized to 0
        for (var i = 0; i < 10000; i++) { // 10k outer loop iterations
            for (var j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
                a[i] = a[i] + j % u; // Simple sum
            }
            a[i] += r; // Add a random value to each element in array
        }
        System.out.println(a[r]); // Print out a single element from the array
    }
}

你可能不熟悉C或Java，但從代碼的形式上來看，C、Java與Go的代碼確實處于“同等條件”。這不僅意味著它們在相同的硬件和軟件環境中運行，更包括它們采用了相同的計算邏輯和算法，以及一致的輸入參數處理等方面的相似性。

為了確認一下原作者的測試結果，我在一臺阿里云ECS上(amd64，8c32g，CentOS 7.9)對上面三個程序進行了測試(使用time命令測量計算耗時)，得到一個基線結果。我的環境下，C、Java和Go的編譯器版本如下：

$go version
go version go1.23.0 linux/amd64

$java -version
openjdk version "17.0.9" 2023-10-17 LTS
OpenJDK Runtime Environment Zulu17.46+19-CA (build 17.0.9+8-LTS)
OpenJDK 64-Bit Server VM Zulu17.46+19-CA (build 17.0.9+8-LTS, mixed mode, sharing)

$gcc -v
使用內建 specs。
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
目標：x86_64-redhat-linux
配置為：../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
線程模型：posix
gcc 版本 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

測試步驟與結果如下：

Go代碼測試：

$cd why-go-sucks/billion-loops/go
$go build -o code code.go
$time ./code 10
456953

real 0m3.766s
user 0m3.767s
sys 0m0.007s

C代碼測試：

$cd why-go-sucks/billion-loops/c
$gcc -O3 -std=c99 -o code code.c
$time ./code 10
459383

real 0m3.005s
user 0m3.005s
sys 0m0.000s

Java代碼測試：

$javac -d . code.java
$time java -cp . jvm.code 10
456181

real 0m3.105s
user 0m3.092s
sys 0m0.027s

從測試結果看到(基于real時間)：采用-O3優化的C代碼最快，Java落后一個身位，而**Go則比C慢了25%，比Java慢了21%**。

注：time命令的輸出結果通常包含三個主要部分：real、user和sys。real是從命令開始執行到結束所經過的實際時間（墻鐘時間），我們依次指標為準。user是程序在用戶模式下執行所消耗的CPU時間。sys則是程序在內核模式下執行所消耗的CPU時間（系統調用）。如果總時間（real）略低于用戶時間（user），這表明程序可能在某些時刻被調度或等待，而不是持續占用CPU。這種情況可能是由于輸入輸出操作、等待資源等原因。如果real時間顯著小于user時間，這種情況通常發生在并發程序中，其中多個線程或進程在不同的時間段執行，導致總的用戶CPU時間遠大于實際的墻鐘時間。sys時間保持較低，說明系統調用的頻率較低，程序主要是執行計算而非進行大量的系統交互。

這時作為Gopher的你可能會說：原作者編寫的Go測試代碼不夠優化，我們能優化到比C還快！

大家都知道原代碼是不夠優化的，隨意改改計算邏輯就能帶來大幅提升。但我們不能忘了“同等條件”這個前提。你采用的優化方法，其他語言（C、Java）也可以采用。

那么，在不改變“同等條件”的前提下，我們還能優化點啥呢？本著能提升一點是一點的思路，我們嘗試從下面幾個點優化一下，看看效果：

去除不必要的if判斷
使用更快的rand實現
關閉邊界檢查
避免逃逸

下面是修改之后的代碼：

// why-go-sucks/billion-loops/go/code_optimize.go 

package main

import (
 "fmt"
 "math/rand"
 "os"
 "strconv"
)

func main() {
 input, _ := strconv.Atoi(os.Args[1]) // Get an input number from the command line
 u := int32(input)
 r := int32(rand.Uint32() % 10000)   // Use Uint32 for faster random number generation
 var a [10000]int32                  // Array of 10k elements initialized to 0
 for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
  for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
   a[i] = a[i] + j%u // Simple sum
  }
  a[i] += r // Add a random value to each element in array
 }
 z := a[r]
 fmt.Println(z) // Print out a single element from the array
}

我們編譯并運行一下測試：

$cd why-go-sucks/billion-loops/go
$go build -o code_optimize -gcflags '-B' code_optimize.go
$time ./code_optimize 10
459443

real 0m3.761s
user 0m3.759s
sys 0m0.011s

對比一下最初的測試結果，這些“所謂的優化”沒有什么卵用，優化前你估計也能猜測到這個結果，因為除了邊界檢查，其他優化都沒有處于循環執行的熱路徑之上。

注：rand.Uint32() % 10000的確要比rand.Intn(10000)快，我自己的benchmark結果是快約1倍。

那Go程序究竟慢在哪里呢？在“同等條件”下，我能想到的只能是Go編譯器后端在代碼優化方面優化做的還不夠，相較于GCC、Java等老牌編譯器還有明顯差距。

比如說，原先的代碼中在內層循環中頻繁訪問a[i]，導致數組訪問的讀寫操作較多（從內存加載a[i]，更新值后寫回）。GCC和Java編譯器在后端很可能做了這樣的優化：將數組元素累積到一個臨時變量中，并在外層循環結束后寫回數組，這樣做可以減少內層循環中的內存讀寫操作，充分利用CPU緩存和寄存器，加速數據處理。

注：數組從內存或緩存讀，而一個臨時變量很大可能是從寄存器讀，那讀取速度相差還是很大的。

如果我們手工在Go中實施這一優化，看看能達到什么效果呢？我們改一下最初版本的Go代碼(code.go)，新代碼如下：

// why-go-sucks/billion-loops/go/code_local_var.go 

package main

import (
 "fmt"
 "math/rand"
 "os"
 "strconv"
)

func main() {
 input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
 if e != nil {
  panic(e)
 }
 u := int32(input)
 r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
 var a [10000]int32                  // Array of 10k elements initialized to 0
 for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
  temp := a[i]
  for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
   temp += j % u // Simple sum
  }
  temp += r // Add a random value to each element in array
  a[i] = temp
 }
 fmt.Println(a[r]) // Print out a single element from the array
}

編譯并運行測試：

$go build -o code_local_var code_local_var.go 
$time ./code_local_var 10
459169

real 0m3.017s
user 0m3.017s
sys 0m0.007s

我們看到，測試結果直接就比Java略好一些了。顯然Go編譯器沒有做這種優化，從code.go的匯編也大致可以看出來：

圖片

使用[lensm](https://github.com/loov/lensm "lensm")生成的匯編與go源碼對應關系

而Java顯然做了這類優化，我們在原Java代碼版本上按上述優化邏輯修改了一下：

// why-go-sucks/billion-loops/java/code_local_var.java

package jvm;

import java.util.Random;

public class code {

    public static void main(String[] args) {
        var u = Integer.parseInt(args[0]); // 獲取命令行輸入的整數
        var r = new Random().nextInt(10000); // 生成隨機數 0 <= r < 10000
        var a = new int[10000]; // 定義長度為10000的數組a

        for (var i = 0; i < 10000; i++) { // 10k外層循環迭代
            var temp = a[i]; // 使用臨時變量存儲 a[i] 的值
            for (var j = 0; j < 100000; j++) { // 100k內層循環迭代，每次外層循環迭代
                temp += j % u; // 更新臨時變量的值
            }
            a[i] = temp + r; // 將臨時變量的值加上 r 并寫回數組
        }
        System.out.println(a[r]); // 輸出 a[r] 的值
    }
}

但從運行這個“優化”后的程序的結果來看，其對java代碼的提升幅度幾乎可以忽略不計：

$time java -cp . jvm.code 10
450375

real 0m3.043s
user 0m3.028s
sys 0m0.027s

這也直接證明了即便采用的是原版java代碼，java編譯器也會生成帶有抽取局部變量這種優化的可執行代碼，java程序員無需手工進行此類優化。

像這種編譯器優化，還有不少，比如大家比較熟悉的循環展開(Loop Unrolling)也可以提升Go程序的性能：

// why-go-sucks/billion-loops/go/code_loop_unrolling.go

package main

import (
 "fmt"
 "math/rand"
 "os"
 "strconv"
)

func main() {
 input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
 if e != nil {
  panic(e)
 }
 u := int32(input)
 r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
 var a [10000]int32                  // Array of 10k elements initialized to 0
 for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
  var sum int32
  // Unroll inner loop in chunks of 4 for optimization
  for j := int32(0); j < 100000; j += 4 {
   sum += j % u
   sum += (j + 1) % u
   sum += (j + 2) % u
   sum += (j + 3) % u
  }
  a[i] = sum + r // Add the accumulated sum and random value
 }

 fmt.Println(a[r]) // Print out a single element from the array
}

運行這個Go測試程序，性能如下：

$go build -o code_loop_unrolling code_loop_unrolling.go
$time ./code_loop_unrolling 10
458908

real 0m2.937s
user 0m2.940s
sys 0m0.002s

循環展開可以增加指令級并行性，因為展開后的代碼塊中可以有更多的獨立指令，比如示例中的計算j % u、(j+1) % u、(j+2) % u和(j+3) % u，這些計算操作是獨立的，可以并行執行，打破了依賴鏈，從而更好地利用處理器的并行流水線。而原版Go代碼中，每次迭代都會根據前一次迭代的結果更新a[i]，形成一個依賴鏈，這種順序依賴性迫使處理器只能按順序執行這些指令，導致流水線停頓。

不過其他語言也可以做同樣的手工優化，比如我們對C代碼做同樣的優化(why-go-sucks/billion-loops/c/code_loop_unrolling.c)，c測試程序的性能可以提升至2.7s水平，這也證明了初版C程序即便在-O3的情況下編譯器也沒有自動為其做這個優化：

$time ./code_loop_unrolling 10
459383

real 0m2.723s
user 0m2.722s
sys 0m0.001s

到這里我們就不再針對這個10億次循環的性能問題做進一步展開了，從上面的探索得到的初步結論就是Go編譯器優化做的還不到位所致，期待后續Go團隊能在編譯器優化方面投入更多精力，爭取早日追上GCC/Clang、Java這些成熟的編譯器優化水平。

下面我們再來看Go在百萬任務場景下內存開銷大的“問題”。

2. 內存占用高，問題出在Goroutine實現原理

我們先來看第二個問題的測試代碼：

package main

import (
 "fmt"
 "os"
 "strconv"
 "sync"
 "time"
)

func main() {
 numRoutines := 100000
 if len(os.Args) > 1 {
  n, err := strconv.Atoi(os.Args[1])
  if err == nil {
   numRoutines = n
  }
 }

 var wg sync.WaitGroup
 for i := 0; i < numRoutines; i++ {
  wg.Add(1)
  go func() {
   time.Sleep(10 * time.Second)
   wg.Done()
  }()
 }
 wg.Wait()
}

這個代碼其實就是根據傳入的task數量啟動等同數量的goroutine，然后每個goroutine模擬工作負載sleep 10s，這等效于百萬長連接的場景，只有連接，但沒有收發消息。

相對于上一個問題，這個問題更好解釋一些。

Go使用的groutine是一種有棧協程，文章中使用的是每個task一個goroutine的模型，且維護百萬任務一段時間，這會真實創建百萬個goroutine（G數據結構），并為其分配棧空間(2k起步)，這樣你可以算一算，不考慮其他結構的占用，僅每個goroutine的棧空間所需的內存都是極其可觀的：

mem = 1000000 * 2000 Bytes = 2000000000 Bytes = 2G Bytes

所以啟動100w goroutine，保底就2GB內存出去了，這與原作者測試的結果十分契合(原文是2.5GB多)。并且，內存還會隨著goroutine數量增長而線性增加。

那么如何能減少內存使用呢？如果采用每個task一個goroutine的模型，這個內存占用很難省去，除非將來Go團隊對goroutine實現做大修。

如果task是網絡通信相關的，可以使用類似gnet這樣的直接基于epoll建構的框架，其主要的節省在于不會啟動那么多goroutine，而是通過一個goroutine池來處理數據，每個池中的goroutine負責一批網絡連接或網絡請求。

在一些Gopher的印象中，Goroutine一旦分配就不回收，這會使他們會誤認為一旦分配了100w goroutine，這2.5G內存空間將始終被占用，真實情況是這樣么？我們用一個示例程序驗證一下就好了：

// why-go-sucks/million-tasks/million-tasks.go

package main

import (
 "fmt"
 "log"
 "os"
 "os/signal"
 "runtime"
 "sync"
 "syscall"
 "time"
)

// 打印當前內存使用情況和相關信息
func printMemoryUsage() {
 var m runtime.MemStats
 runtime.ReadMemStats(&m)

 // 獲取當前 goroutine 數量
 numGoroutines := runtime.NumGoroutine()

 // 獲取當前線程數量
 numThreads := runtime.NumCPU() // Go runtime 不直接提供線程數量，但可以通過 NumCPU 獲取邏輯處理器數量

 fmt.Printf("======>\n")
 fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
 fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
 fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
 fmt.Printf("\tNumGC = %v", m.NumGC)
 fmt.Printf("\tNumGoroutines = %v", numGoroutines)
 fmt.Printf("\tNumThreads = %v\n", numThreads)
 fmt.Printf("<======\n\n")
}

// 將字節轉換為 MB
func bToMb(b uint64) uint64 {
 return b / 1024 / 1024
}

func main() {
 const signal1Goroutines = 900000
 const signal2Goroutines = 90000
 const signal3Goroutines = 10000

 // 用于接收退出信號
 sigChan := make(chan os.Signal, 1)
 signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

 // 控制 goroutine 的退出
 signal1Chan := make(chan struct{})
 signal2Chan := make(chan struct{})
 signal3Chan := make(chan struct{})

 var wg sync.WaitGroup
 ticker := time.NewTicker(5 * time.Second)
 go func() {
  for range ticker.C {
   printMemoryUsage()
  }
 }()

 // 等待退出信號
 go func() {
  count := 0
  for {
   <-sigChan
   count++
   if count == 1 {
    log.Println("收到第一類goroutine退出信號")
    close(signal1Chan) // 關閉 signal1Chan，通知第一類 goroutine 退出
    continue
   }
   if count == 2 {
    log.Println("收到第二類goroutine退出信號")
    close(signal2Chan) // 關閉 signal2Chan，通知第二類 goroutine 退出
    continue
   }
   log.Println("收到第三類goroutine退出信號")
   close(signal3Chan) // 關閉 signal3Chan，通知第三類 goroutine 退出
   return
  }
 }()

 // 啟動第一類 goroutine（在收到 signal1 時退出）
 log.Println("開始啟動第一類goroutine...")
 for i := 0; i < signal1Goroutines; i++ {
  wg.Add(1)
  go func(id int) {
   defer wg.Done()
   // 模擬工作
   for {
    select {
    case <-signal1Chan:
     return
    default:
     time.Sleep(10 * time.Second) // 模擬一些工作
    }
   }
  }(i)
 }
 log.Println("啟動第一類goroutine(900000) ok")

 time.Sleep(time.Second * 5)

 // 啟動第二類 goroutine（在收到 signal2 時退出）
 log.Println("開始啟動第二類goroutine...")
 for i := 0; i < signal2Goroutines; i++ {
  wg.Add(1)
  go func(id int) {
   defer wg.Done()
   // 模擬工作
   for {
    select {
    case <-signal2Chan:
     return
    default:
     time.Sleep(10 * time.Second) // 模擬一些工作
    }
   }
  }(i)
 }
 log.Println("啟動第二類goroutine(90000) ok")

 time.Sleep(time.Second * 5)

 // 啟動第三類goroutine（隨程序退出而退出）
 log.Println("開始啟動第三類goroutine...")
 for i := 0; i < signal3Goroutines; i++ {
  wg.Add(1)
  go func(id int) {
   defer wg.Done()
   // 模擬工作
   for {
    select {
    case <-signal3Chan:
     return
    default:
     time.Sleep(10 * time.Second) // 模擬一些工作
    }
   }
  }(i)
 }
 log.Println("啟動第三類goroutine(90000) ok")

 // 等待所有 goroutine 完成
 wg.Wait()
 fmt.Println("所有 goroutine 已退出，程序結束")
}

這個程序我就不詳細解釋了。大致分三類goroutine，第一類90w個，在我發送第一個ctrl+c信號后退出，第二類9w個，在我發送第二個ctrl+c信號后退出，最后一類1w個，隨著程序退出而退出。

在我的執行環境下編譯和執行一下這個程序，并結合runtime輸出以及使用top -p pid的方式查看其內存占用：

$go build million-tasks.go
$./million-tasks 

2024/12/01 22:07:03 開始啟動第一類goroutine...
2024/12/01 22:07:05 啟動第一類goroutine(900000) ok
======>
Alloc = 511 MiB TotalAlloc = 602 MiB Sys = 2311 MiB NumGC = 9 NumGoroutines = 900004 NumThreads = 8
<======

2024/12/01 22:07:10 開始啟動第二類goroutine...
2024/12/01 22:07:11 啟動第二類goroutine(90000) ok
======>
Alloc = 577 MiB TotalAlloc = 668 MiB Sys = 2553 MiB NumGC = 9 NumGoroutines = 990004 NumThreads = 8
<======

2024/12/01 22:07:16 開始啟動第三類goroutine...
2024/12/01 22:07:16 啟動第三類goroutine(90000) ok
======>
Alloc = 597 MiB TotalAlloc = 688 MiB Sys = 2593 MiB NumGC = 9 NumGoroutines = 1000004 NumThreads = 8
<======

======>
Alloc = 600 MiB TotalAlloc = 690 MiB Sys = 2597 MiB NumGC = 9 NumGoroutines = 1000004 NumThreads = 8
<======
... ...

======>
Alloc = 536 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 10 NumGoroutines = 1000004 NumThreads = 8
<======

100w goroutine全部創建ok后，我們查看一下top輸出：

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 5800 root      20   0 3875556   2.5g    988 S  54.0  8.2   0:30.92 million-tasks

我們看到RES為2.5g，和我們預期的一致！

接下來，我們停掉第一批90w個goroutine，看RES是否會下降，何時會下降！

輸入ctrl+c，停掉第一批90w goroutine：

^C2024/12/01 22:10:15 收到第一類goroutine退出信號
======>
Alloc = 536 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 10 NumGoroutines = 723198 NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 10 NumGoroutines = 723198 NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 10 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 10 NumGoroutines = 100004 NumThreads = 8
<======
... ...

但同時刻的top顯示RES并沒有變化：

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 5800 root      20   0 3875812   2.5g    988 S   0.0  8.2   0:56.38 million-tasks

等待兩個GC間隔的時間后(大約4分)，Goroutine的棧空間被釋放：

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 12 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 12 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 12 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 12 NumGoroutines = 100004 NumThreads = 8
<======

top顯示RES從2.5g下降為大概700多MB（RES的單位是KB）：

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 5800 root      20   0 3875812 764136    992 S   0.0  2.4   1:01.87 million-tasks

接下來，我們再停掉第二批9w goroutine：

^C2024/12/01 22:16:21 收到第二類goroutine退出信號
======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 13 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 13 NumGoroutines = 100004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 13 NumGoroutines = 10004 NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 13 NumGoroutines = 10004 NumThreads = 8
<======

此時，top值也沒立即改變：

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 5800 root      20   0 3875812 764136    992 S   0.0  2.4   1:05.99 million-tasks

大約等待一個GC間隔(2分鐘)后，top中RES下降：

======>
Alloc = 458 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 14 NumGoroutines = 10004 NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 14 NumGoroutines = 10004 NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 14 NumGoroutines = 10004 NumThreads = 8
<======

RES變為不到700M：

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                      
 5800 root      20   0 3875812 699156    992 S   0.0  2.2   1:06.75 million-tasks

第三次按下ctrl+c，程序退出：

^C2024/12/01 22:18:46 收到第三類goroutine退出信號
======>
Alloc = 458 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 14 NumGoroutines = 10003 NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB Sys = 2606 MiB NumGC = 14 NumGoroutines = 10003 NumThreads = 8
<======

所有 goroutine 已退出，程序結束

我們看到Go是會回收goroutine占用的內存空間的，并且歸還給OS，只是這種歸還比較lazy。尤其是，第二次停止goroutine前，go程序剩下10w goroutine，按理論來講需占用大約200MB的空間，實際上卻是700多MB；第二次停止goroutine后，goroutine數量降為1w，理論占用應該在20MB，但實際卻是600多MB，我們看到go運行時這種lazy歸還OS內存的行為可能也是“故意為之”，是為了避免反復從OS申請和歸還內存。

3. 小結

本文主要探討了Go語言在十億次循環和百萬任務的測試中的表現令人意外地遜色于Java和C語言的原因。我認為Go在循環執行中的慢速表現，主要是其編譯器優化不足，影響了執行效率。而在內存開銷方面，Go的Goroutine實現是使得內存使用量大幅增加的“罪魁禍首”，這是由于每個Goroutine初始都會分配固定大小的棧空間。

通過本文的探討，我的主要目的是希望大家不要以訛傳訛，而是要搞清楚背后的真正原因，并正視Go在某些方面的不足，以及其當前在某些應用上下文中的局限性。同時，也希望Go開發團隊在編譯器優化方面進行更多投入，以提升Go在高性能計算領域的競爭力。

本文涉及的源碼可以在這里[5]下載。

4. 參考資料

Billion nested loop iterations[6] - https://benjdd.com/languages/
How Much Memory Do You Need in 2024 to Run 1 Million Concurrent Tasks?[7] - https://hez2010.github.io/async-runtimes-benchmarks-2024/

參考資料

[1] Ben Dicken (@BenjDicken): https://benjdd.com

[2] 語言性能測試: https://benjdd.com/languages/

[3] 內存開銷測試: https://hez2010.github.io/async-runtimes-benchmarks-2024/

[4] Go測試程序: https://github.com/bddicken/languages/blob/main/loops/go/code.go

[5] 這里: https://github.com/bigwhite/experiments/tree/master/why-go-sucks

[6] Billion nested loop iterations: https://benjdd.com/languages/

[7] How Much Memory Do You Need in 2024 to Run 1 Million Concurrent Tasks?: https://hez2010.github.io/async-runtimes-benchmarks-2024/

責任編輯：武曉燕來源： TonyBai

Go 語言 Java

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看