聊聊 Kube-Apiserver 內存優化進階

作者：Kaku 2023-12-02 20:41:32

Kube-apiserver 內存優化系系列包含前面的鋪墊，到此也 6 篇了，如果把這其中涉及到的知識都搞懂了，對 kube-apiserver 的理解一定可以上一個臺階，后續也會持續關注這塊的內容，不定時補充~

原理

內存優化是一個經典問題，在看具體 K8S 做了哪些工作之前，可以先抽象一些這個過程，思考一下如果是我們的話，會如何來優化。這個過程可以簡單抽象為外部并發請求從服務端獲取數據，如何在不影響吞吐的前提下降低服務端內存消耗？一般有幾種方式：

緩存序列化的結果
優化序列化過程內存分配

數據壓縮在這個場景可能不適用，壓縮確實可以降低網絡傳輸帶寬，從而提升請求響應速度，但對服務端內存的優化沒有太大的作用。kube-apiserver 已經支持基于 gzip 的數據壓縮，只需要設置 Accept-Encoding 為 gzip 即可，詳情可以參考官網[1]介紹。

當然緩存序列化的結果適用于客戶端請求較多的場景，尤其是服務端需要同時把數據發送給多個客戶時，緩存序列化的結果收益會比較明顯，因為只需要一次序列化的過程即可，只要完成一次序列化，后續給其他客戶端直接發送數據時直接使用之前的結果即可，省去了不必要的 CPU 和內存的開銷。當然緩存序列化的結果這個操作本身來說也是會占用一些內存的，如果客戶端數量較少，那么這個操作可能收益不大甚至可能帶來額外的內存消耗。kube-apiserver watch 請求就與這個場景非常吻合。

下文會就 kube-apiserver 中是如何就這兩點進行的優化做一個介紹。

實現

下文列出的時間線中的各種問題和優化可能而且有很大可能只是眾多問題和優化中的一部分。

緩存序列化結果

時間線

早在 2019 年的時候，社區有人反饋了一個問題[2]：在一個包含 5000 個節點的集群中，創建一個大型的 Endpoints 對象（5000 個 Pod，大小接近 1MB），kube-apiserver 可能會在 5 秒內完全過載；
接著社區定位了這個問題，并提出了 KEP 1152 less object serializations[3]，通過避免為不同的 watcher 重復多次序列化相同的對象，降低 kube-apiserver 的負載和內存分配次數，此功能在 v1.17 中發布，在 5000 節點的測試結果，內存分配優化 ~15%，CPU 優化 ~5%，但這個優化僅對 Http 協議生效，對 WebSocket 不生效；
3 年后，也就是 2023 年，通過 Refactor apiserver endpoint transformers to more natively use Encoders #119801[4] 對序列化邏輯進行重構，統一使用 Encoder 接口進行序列化操作，早在 2019 年就已經創建對應的 issue 83898[5]。本次重構同時還解決了 2 提到的針對 WebSocket 不生效的問題，于 1.29 中發布；

所以如果你不是在以 WebSocket 形式（默認使用 Http Transfer-Encoding: chunked）使用 watch，那么升級到 1.17 之后理論上就可以了。

原理

圖片

新增了 CacheableObject 接口，同時在所有 Encoder 中支持對 CacheableObject 的支持，如下

// Identifier represents an identifier.
// Identitier of two different objects should be equal if and only if for every
// input the output they produce is exactly the same.
type Identifier string

type Encoder interface {
 ...
 // Identifier returns an identifier of the encoder.
 // Identifiers of two different encoders should be equal if and only if for every input
 // object it will be encoded to the same representation by both of them.
 Identifier() Identifier
}


// CacheableObject allows an object to cache its different serializations
// to avoid performing the same serialization multiple times.
type CacheableObject interface {
 // CacheEncode writes an object to a stream. The <encode> function will
 // be used in case of cache miss. The <encode> function takes ownership
 // of the object.
 // If CacheableObject is a wrapper, then deep-copy of the wrapped object
 // should be passed to <encode> function.
 // CacheEncode assumes that for two different calls with the same <id>,
 // <encode> function will also be the same.
 CacheEncode(id Identifier, encode func(Object, io.Writer) error, w io.Writer) error

 // GetObject returns a deep-copy of an object to be encoded - the caller of
 // GetObject() is the owner of returned object. The reason for making a copy
 // is to avoid bugs, where caller modifies the object and forgets to copy it,
 // thus modifying the object for everyone.
 // The object returned by GetObject should be the same as the one that is supposed
 // to be passed to <encode> function in CacheEncode method.
 // If CacheableObject is a wrapper, the copy of wrapped object should be returned.
 GetObject() Object
}

func (e *Encoder) Encode(obj Object, stream io.Writer) error {
 if co, ok := obj.(CacheableObject); ok {
  return co.CacheEncode(s.Identifier(), s.doEncode, stream)
 }
 return s.doEncode(obj, stream)
}

func (e *Encoder) doEncode(obj Object, stream io.Writer) error {
 // Existing encoder logic.
}

// serializationResult captures a result of serialization.
type serializationResult struct {
 // once should be used to ensure serialization is computed once.
 once sync.Once

 // raw is serialized object.
 raw []byte
 // err is error from serialization.
 err error
}

// metaRuntimeInterface implements runtime.Object and
// metav1.Object interfaces.
type metaRuntimeInterface interface {
 runtime.Object
 metav1.Object
}

// cachingObject is an object that is able to cache its serializations
// so that each of those is computed exactly once.
//
// cachingObject implements the metav1.Object interface (accessors for
// all metadata fields). However, setters for all fields except from
// SelfLink (which is set lately in the path) are ignored.
type cachingObject struct {
 lock sync.RWMutex

 // Object for which serializations are cached.
 object metaRuntimeInterface

 // serializations is a cache containing object`s serializations.
 // The value stored in atomic.Value is of type serializationsCache.
 // The atomic.Value type is used to allow fast-path.
 serializations atomic.Value
}

cachingObject 實現了 CacheableObject 接口，其 object 為關注的事件對象（例如 Pod），serializations 用來保存序列化之后的結果，Identifier 是一個標識，代表序列化的類型，因為存在 json、yaml、protobuf 三種序列化方式。

cachingObject 的生成在上圖 Cacher dispatchEvent 消費自身 incoming chan 數據，將 event 發給所有相關的 cacheWatchers 的時候，會將事件對象轉化為 cachingObject 發給 cacheWatcher 的 input chan。最終的 Encode 操作是在 serveWatch 方法中將最終的對象進行序列化時調用的，會先判斷是否已經存在序列化的結果，存在則直接復用，避免重復的序列化。

注意：

上圖 wrap into cachingObject if len(watchers) >= 3 已成為過去式，新的代碼邏輯中已經去掉了后面的判斷，不管 watchers 數量，統一都進行 cachingObject 的封裝；

并沒有對 Init Event（watchcache 中的全量數據）進行 cachingObject 的封裝，只有發給 Cacher incoming chan 的數據會轉化為 cachingObject。也就是說這個優化對 Get/List 請求完全無效，因為他們是直接從 watchcache 返回數據的，針對 Watch 請求，也將會有部分數據在返回時沒有復用已有序列化結果，因為仍然可能會有部分 Init Event 數據是從 watchcache 獲取并返回的，這是一個很神奇的地方，cacheWatcher 的 input chan 的 event 對象的 object 有可能是正常的資源對象，例如 Pod，也有可能是 CacheableObject 對象，而真正的資源對象則保存在 CacheableObject 的 object 中；

為什么不把 Init Event 也覆蓋了，KEP 1152 中給的說法是先實現 Cache incoming chan 的覆蓋，收益就已經比較可觀了，解決了之前發現的問題。如果需要進一步優化的話，再來重新評估把 Init Event 也覆蓋的可能。而在 Refactor streaming watch encoder to enable caching #120300[6] 的評論中也有相關討論

圖片

同時在 KEP 3157 watch-list[7] 中也提到了這個待優化項。

優化內存分配

時間線

reduce the number of allocations in the WatchServer during objects serialisation #108186[8]，主要針對 protobuf 進行優化，對于 json 和 yaml 序列化無效，2022 年隨著 v1.24 發布，protobuf 一般是內部組件使用，而外部組件訪問 k8s 時一般都是使用 json 或者 yaml 序列化；
Do not copy bytes for cached serializations #118362[9]，自定義 SpliceBuffer，避免對 cachingObject 的序列化結果進行深拷貝，2023 年隨著 v1.28 發布；
Refactor streaming watch encoder to enable caching #120300，這個修復是在已有的緩存資源對象的序列化結果的基礎上，把 Event 的序列化結果也做緩存，因為最終返回給客戶端的是 Event 而不是資源對象；

原理

針對 2，巧妙地定義了 SpliceBuffer 通過淺拷貝的方式有效的優化了內存分配，避免 embeddedEncodeFn 對已經序列化后的結果 []byte 的深拷貝；

// A spliceBuffer implements Splice and io.Writer interfaces.
type spliceBuffer struct {
 raw []byte
 buf *bytes.Buffer
}

// Splice implements the Splice interface.
func (sb *spliceBuffer) Splice(raw []byte) {
 sb.raw = raw
}

Benchmark 效果顯著

go test -benchmem -run=^$ -bench ^BenchmarkWrite k8s.io/apimachinery/pkg/runtime -v -count 1
goos: linux
goarch: amd64
pkg: k8s.io/apimachinery/pkg/runtime
cpu: AMD EPYC 7B12
BenchmarkWriteSplice
BenchmarkWriteSplice-48         151164015                7.929 ns/op           0 B/op          0 allocs/op
BenchmarkWriteBuffer
BenchmarkWriteBuffer-48          3476392               357.8 ns/op          1024 B/op          1 allocs/op
PASS
ok      k8s.io/apimachinery/pkg/runtime 3.619s

針對 3，嚴格來說這個 pr 不是用來優化內存分配的，而是來解決 issue 110146[10] 的提到的 json 序列化時 json.compact 導致的 CPU 使用率過高的問題，隨著 v1.29 發布。問題產生的原因是雖然上面提到了通過 cachingObject 來緩存資源對象的序列化結果，但最終發回到客戶端的是 Event 對象，還是需要做一次 Event 的序列化操作，而 json.compact 會在每次 Marshal 后被調用，這是 golang 自帶的 json 序列化的實現，可以參考 golang json 源碼[11]。這個修復是在緩存資源對象的序列化結果的基礎上，把 Event 的序列化結果也做緩存，用來規避 json.compact 帶來的影響。

這個 PR 涉及到的改動較大，筆者目前對其實現仍然存在一些疑問，已經提了 issue 122153[12] 咨詢社區，等搞清楚后可以再專門安排一篇來講講這個實現，這塊涉及到了 watch handler 的整個序列化邏輯，Encoder 的嵌套非常深，連 google 大神在 review 代碼時都有如下感嘆

圖片

筆者在看這塊代碼時被接口的來回跳轉搞暈了，寫了個 unit test 來一步步調試才搞清楚這些 Encoder，真的是層層嵌套，梳理如下，可以感受下這五層嵌套

watchEncoder

—> watchEmbeddedEncoder

—> encoderWithAllocator

—> codec

—> json.Serializer

他們都實現了 Encoder 接口...

類似 cachingObject 序列化，對 Event 進行序列化同樣需要額外的內存空間，但可以避免對每個 Event 進行多次序列化帶來的內存消耗和 CPU 消耗，所以也起到了內存優化的作用。

效果

通過 WatchList 以及上述的種種優化，社區給出了優化效果

優化前

圖片

優化后

圖片

最后

序列化，聽上去簡單，調個方法的事情，但用好了也不容易，往往這種地方最能體現能力，尋常見功力，細微見真章，看看大牛寫的代碼，領會其中的設計和思想，總結轉化吸收為我所用。

k8s 使用起來容易，用好了不容易，搞明白背后是怎么回事難。項目經過 10 來年的迭代，無論代碼量還是復雜度上面都已經比較恐怖了，而且還在不斷地迭代更新，但路雖遠，行則將至，事雖難，做雖然不一定成吧，不做一定成不了。

Talk is cheap, Show me the code and PPT

最后，歡迎加筆者微信 YlikakuY，一起交流前沿技術，行業動態~

參考資料

[1]

kubernetes-api: https://kubernetes.io/zh-cn/docs/concepts/overview/kubernetes-api/

[2]issue#75294: https://github.com/kubernetes/kubernetes/issues/75294

[3]kep#1152-less-object-serializations: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1152-less-object-serializations

[4]pr#119801: https://github.com/kubernetes/kubernetes/pull/119801

[5]issue#83898: https://github.com/kubernetes/kubernetes/issues/83898

[6]pr#120300: https://github.com/kubernetes/kubernetes/pull/120300

[7]kep#3157 watch-list: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3157-watch-list/README.md

[8]pr#108186: https://github.com/kubernetes/kubernetes/pull/108186

[9]pr#118362: https://github.com/kubernetes/kubernetes/pull/118362/

[10]issue#11014: https://github.com/kubernetes/kubernetes/issues/110146

[11]golang#json: https://github.com/golang/go/blob/d8762b2f4532cc2e5ec539670b88bbc469a13938/src/encoding/json/encode.go#L498

[12]issue#122153: https://github.com/kubernetes/kubernetes/issues/122153

責任編輯：武曉燕來源：云原生散修

內存 kube

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

聊聊 Kube-Apiserver 內存優化進階

原理

實現

緩存序列化結果

時間線

原理

優化內存分配

時間線

原理

效果

最后

參考資料