UCL汪軍呼吁創(chuàng)新：后ChatGPT通用人工智能理論及其應(yīng)用

作者：機(jī)器之心 2023-03-06 10:20:10

本文由 UCL 汪軍教授撰寫(xiě)，他呼吁我們不僅要復(fù)制ChatGPT的成功，更重要的是在以下人工智能領(lǐng)域推動(dòng)開(kāi)創(chuàng)性研究和新的應(yīng)用開(kāi)發(fā)。

*本文原為英文寫(xiě)作，中文翻譯由 ChatGPT 完成，原貌呈現(xiàn)，少數(shù)歧義處標(biāo)注更正（紅色黃色部分）。英文原稿見(jiàn)附錄。筆者發(fā)現(xiàn) ChatGPT 翻譯不妥處，往往是本人才疏英文原稿表達(dá)不夠流暢，感興趣的讀者請(qǐng)對(duì)照閱讀。

ChatGPT 最近引起了研究界、商業(yè)界和普通公眾的關(guān)注。它是一個(gè)通用的聊天機(jī)器人，可以回答用戶的開(kāi)放式提示或問(wèn)題。人們對(duì)它卓越的、類(lèi)似于人類(lèi)的語(yǔ)言技能產(chǎn)生了好奇心，它能夠提供連貫、一致和結(jié)構(gòu)良好的回答。由于擁有一個(gè)大型的預(yù)訓(xùn)練生成式語(yǔ)言模型，它的多輪對(duì)話交互支持各種基于文本和代碼的任務(wù)，包括新穎創(chuàng)作、文字游戲和甚至通過(guò)代碼生成進(jìn)行機(jī)器人操縱。這使得公眾相信通用機(jī)器學(xué)習(xí)和機(jī)器理解很快就能實(shí)現(xiàn)。

如果深入挖掘，人們可能會(huì)發(fā)現(xiàn)，當(dāng)編程代碼被添加為訓(xùn)練數(shù)據(jù)時(shí)，模型達(dá)到特定規(guī)模時(shí)，某些推理能力、常識(shí)理解甚至思維鏈（一系列中間推理步驟）可能涌現(xiàn)出來(lái)。雖然這個(gè)新發(fā)現(xiàn)令人興奮，為人工智能研究和應(yīng)用開(kāi)辟了新的可能性，但它引發(fā)的問(wèn)題比解決的問(wèn)題更多。例如，這些新興涌現(xiàn)的能力能否作為高級(jí)智能的早期指標(biāo)，或者它們只是幼稚模仿人類(lèi)行為？繼續(xù)擴(kuò)展已經(jīng)龐大的模型能否導(dǎo)致通用人工智能（AGI）的誕生，還是這些模型只是表面上具有受限能力的人工智能？如果這些問(wèn)題得到回答，可能會(huì)引起人工智能理論和應(yīng)用的根本性轉(zhuǎn)變。

因此，我們敦促不僅要復(fù)制 ChatGPT 的成功，更重要的是在以下人工智能領(lǐng)域推動(dòng)開(kāi)創(chuàng)性研究和新的應(yīng)用開(kāi)發(fā)（這并不是詳盡列表）：

1.新的機(jī)器學(xué)習(xí)理論，超越了基于任務(wù)的特定機(jī)器學(xué)習(xí)的既定范式

歸納推理是一種推理類(lèi)型，我們根據(jù)過(guò)去的觀察來(lái)得出關(guān)于世界的結(jié)論。機(jī)器學(xué)習(xí)可以被松散地看作是歸納推理，因?yàn)樗眠^(guò)去（訓(xùn)練）數(shù)據(jù)來(lái)提高在新任務(wù)上的表現(xiàn)。以機(jī)器翻譯為例，典型的機(jī)器學(xué)習(xí)流程包括以下四個(gè)主要步驟：

1.定義具體問(wèn)題，例如需要將英語(yǔ)句子翻譯成中文：E → C，

2.收集數(shù)據(jù)，例如句子對(duì) {E → C}，

3.訓(xùn)練模型，例如使用輸入 {E} 和輸出 {C} 的深度神經(jīng)網(wǎng)絡(luò)，

4.將模型應(yīng)用于未知數(shù)據(jù)點(diǎn)，例如輸入一個(gè)新的英語(yǔ)句子 E'，輸出中文翻譯 C' 并評(píng)估結(jié)果。

如上所示，傳統(tǒng)機(jī)器學(xué)習(xí)將每個(gè)特定任務(wù)的訓(xùn)練隔離開(kāi)來(lái)。因此，對(duì)于每個(gè)新任務(wù)，必須從步驟 1 到步驟 4 重置并重新執(zhí)行該過(guò)程，失去了來(lái)自先前任務(wù)的所有已獲得的知識(shí)（數(shù)據(jù)、模型等）。例如，如果要將法語(yǔ)翻譯成中文，則需要不同的模型。

在這種范式下，機(jī)器學(xué)習(xí)理論家的工作主要集中在理解學(xué)習(xí)模型從訓(xùn)練數(shù)據(jù)到未見(jiàn)測(cè)試數(shù)據(jù)的泛化能力。例如，一個(gè)常見(jiàn)的問(wèn)題是在訓(xùn)練中需要多少樣本才能實(shí)現(xiàn)預(yù)測(cè)未見(jiàn)測(cè)試數(shù)據(jù)的某個(gè)誤差界限。我們知道，歸納偏差偏置（即先驗(yàn)知識(shí)或先驗(yàn)假設(shè)）是學(xué)習(xí)模型預(yù)測(cè)其未遇到的輸出所必需的。這是因?yàn)樵谖粗闆r下的輸出值完全是任意的，如果不進(jìn)行一定的假設(shè)，就不可能解決這個(gè)問(wèn)題。著名的沒(méi)有免費(fèi)午餐定理進(jìn)一步說(shuō)明，任何歸納偏差都有局限性；它只適用于某些問(wèn)題組，如果所假設(shè)的先驗(yàn)知識(shí)不正確，它可能在其他地方失敗。

圖 1 ChatGPT 用于機(jī)器翻譯的屏幕截圖。用戶提示信息僅包含說(shuō)明，無(wú)需演示示例。

雖然上述理論仍然適用，但基礎(chǔ)語(yǔ)言模型的出現(xiàn)可能改變了我們對(duì)機(jī)器學(xué)習(xí)的方法。新的機(jī)器學(xué)習(xí)流程可以如下（以機(jī)器翻譯問(wèn)題為例，見(jiàn)圖 1）：

1.API 訪問(wèn)其他人訓(xùn)練的基礎(chǔ)語(yǔ)言模型，例如訓(xùn)練有包括英語(yǔ) / 中文配對(duì)語(yǔ)料在內(nèi)的多樣文檔的模型。

2.根據(jù)少量示例或沒(méi)有示例，為手頭任務(wù)設(shè)計(jì)合適的文本描述（稱(chēng)為提示），例如提示~~Prompt~~ = {幾個(gè)示例 E ? C}。

3.在提示和給定的新測(cè)試數(shù)據(jù)點(diǎn)的條件下，語(yǔ)言模型生成答案，例如將 E’ 追加到提示中并從模型中生成 C’。

4.將答案解釋為預(yù)測(cè)結(jié)果。

如步驟 1 所示，基礎(chǔ)語(yǔ)言模型作為一個(gè)通用~~一刀切~~的知識(shí)庫(kù)。步驟 2 中提供的提示和上下文使基礎(chǔ)語(yǔ)言模型可以根據(jù)少量演示實(shí)例自定義以解決特定的目標(biāo)或問(wèn)題。雖然上述流程主要局限于基于文本的問(wèn)題，但可以合理地假設(shè)，隨著跨模態(tài)（見(jiàn)第 3 節(jié)）基礎(chǔ)預(yù)訓(xùn)練模型的發(fā)展，它將成為機(jī)器學(xué)習(xí)的標(biāo)準(zhǔn)。這可能會(huì)打破必要的任務(wù)障礙，為通用人工智能（AGI）鋪平道路。

但是，確定提示文本中演示示例的操作方式仍處于早期階段。從一些早期的工作中，我們現(xiàn)在理解到，演示樣本的格式比標(biāo)簽的正確性更重要（例如，如圖 1 所示，我們不需要提供翻譯示例，但只需要提供語(yǔ)言說(shuō)明），但它的可適應(yīng)性是否有理論上的限制，如 “沒(méi)有免費(fèi)的午餐” 定理所述？提示中陳述的上下文和指令式的知識(shí)能否集成到模型中以供未來(lái)使用？這些問(wèn)題只是開(kāi)始探討。因此，我們呼吁對(duì)這種新形式的上下文學(xué)習(xí)及其理論限制和性質(zhì)進(jìn)行新的理解和新的原則，例如研究泛化的界限在哪里。

圖 2 人工智能決策生成（AIGA）用于設(shè)計(jì)計(jì)算機(jī)游戲的插圖。

2.磨練推理技能

我們正處于一個(gè)令人興奮的時(shí)代邊緣，在這個(gè)時(shí)代里，我們所有的語(yǔ)言和行為數(shù)據(jù)都可以被挖掘出來(lái)，用于訓(xùn)練（并被巨大的計(jì)算機(jī)化模型吸收）。這是一個(gè)巨大的成就，因?yàn)槲覀冋麄€(gè)集體的經(jīng)驗(yàn)和文明都可以消化成一個(gè)（隱藏的）知識(shí)庫(kù)（以人工神經(jīng)網(wǎng)絡(luò)的形式），以供日后使用。實(shí)際上，ChatGPT 和大型基礎(chǔ)模型被認(rèn)為展示了某種形式的推理能力，甚至可能在某種程度上理解他人的心態(tài)（心智理論）。這是通過(guò)數(shù)據(jù)擬合（將掩碼語(yǔ)言標(biāo)記預(yù)測(cè)作為訓(xùn)練信號(hào)）和模仿（人類(lèi)行為）來(lái)實(shí)現(xiàn)的。然而，這種完全基于數(shù)據(jù)驅(qū)動(dòng)的策略是否會(huì)帶來(lái)更大的智能還有待商榷。

為了說(shuō)明這個(gè)觀點(diǎn)，以指導(dǎo)一個(gè)代理（智能體）如何下棋為例。即使代理（智能體）可以訪問(wèn)無(wú)限量的人類(lèi)下棋數(shù)據(jù)，僅通過(guò)模仿現(xiàn)有策略來(lái)生成比已有數(shù)據(jù)更優(yōu)的新策略將是非常困難的。但是，使用這些數(shù)據(jù)，可以建立對(duì)世界的理解（例如，游戲規(guī)則），并將其用于 “思考”（在其大腦中構(gòu)建一個(gè)模擬器，以收集反饋來(lái)創(chuàng)建更優(yōu)的策略）。這突顯了歸納偏置的重要性；與其單純地采用蠻力方法，要求學(xué)習(xí)代理（智能體）具有一定的世界模型以便自我改進(jìn)。

因此，迫切需要深入研究和理解基礎(chǔ)模型的新興能力。除了語(yǔ)言技能，我們主張通過(guò)研究底層機(jī)制來(lái)獲得實(shí)際推理能力。一個(gè)有前途的方法是從神經(jīng)科學(xué)和腦科學(xué)中汲取靈感，以解密人類(lèi)推理的機(jī)制，并推進(jìn)語(yǔ)言模型的發(fā)展。同時(shí)，建立一個(gè)扎實(shí)的心智理論可能也需要深入了解多智能體學(xué)習(xí)及其基本原理。

3.從 AI 生成內(nèi)容（AIGC）到 AI 生成行動(dòng)（AIGA）

人類(lèi)語(yǔ)言所發(fā)展出的隱式語(yǔ)義對(duì)于基礎(chǔ)語(yǔ)言模型來(lái)說(shuō)至關(guān)重要。如何利用它是通用機(jī)器學(xué)習(xí)的一個(gè)關(guān)鍵話題。例如，一旦語(yǔ)義空間與其他媒體（如照片、視頻和聲音）或其他形式的人類(lèi)和機(jī)器行為數(shù)據(jù)（如機(jī)器人軌跡 / 動(dòng)作）對(duì)齊，我們就可以無(wú)需額外成本地為它們獲得語(yǔ)義解釋能力。這樣，機(jī)器學(xué)習(xí)（預(yù)測(cè)、生成和決策）就會(huì)變得通用和可分解。然而，處理跨模態(tài)對(duì)齊是我們面臨的一個(gè)重大難題，因?yàn)闃?biāo)注關(guān)系需要耗費(fèi)大量的人力。此外，當(dāng)許多利益方存在沖突時(shí)，人類(lèi)價(jià)值觀的對(duì)齊變得困難。

ChatGPT 的一個(gè)基本缺點(diǎn)是它只能直接與人類(lèi)進(jìn)行交流。然而，一旦與外部世界建立了足夠的對(duì)齊，基礎(chǔ)語(yǔ)言模型應(yīng)該能夠?qū)W習(xí)如何與各種各樣的參與者和環(huán)境進(jìn)行交互。這很重要，因?yàn)樗鼘①x予其推理能力和基于語(yǔ)言的語(yǔ)義更廣泛的應(yīng)用和能力，超越了僅僅進(jìn)行對(duì)話。例如，它可以發(fā)展成為一個(gè)通用代理（智能體），能夠?yàn)g覽互聯(lián)網(wǎng)、控制計(jì)算機(jī)和操縱機(jī)器人。因此，更加重要的是實(shí)施確保代理（智能體）的響應(yīng)（通常以生成的操作形式）安全、可靠、無(wú)偏和可信的程序。

圖 2 展示了 AIGA 與游戲引擎交互的示例，以自動(dòng)化設(shè)計(jì)電子游戲的過(guò)程。

4.多智能體與基礎(chǔ)語(yǔ)言模型交互的理論

ChatGPT 使用上下文學(xué)習(xí)和提示工程來(lái)在單個(gè)會(huì)話中驅(qū)動(dòng)與人的多輪對(duì)話，即給定問(wèn)題或提示，整個(gè)先前的對(duì)話（問(wèn)題和回答）被發(fā)送到系統(tǒng)作為額外的上下文來(lái)構(gòu)建響應(yīng)。這是一個(gè)簡(jiǎn)單的對(duì)話驅(qū)動(dòng)的馬爾可夫決策過(guò)程（MDP）模型：

{狀態(tài) = 上下文，行動(dòng) = 響應(yīng)，獎(jiǎng)勵(lì) = 贊 / 踩評(píng)級(jí)}。

雖然有效，但這種策略具有以下缺點(diǎn)：首先，提示只是提供了用戶響應(yīng)的描述，但用戶真正的意圖可能沒(méi)有被明確說(shuō)明，必須被推斷。也許一個(gè)強(qiáng)大的模型，如之前針對(duì)對(duì)話機(jī)器人提出的部分可觀察馬爾可夫決策過(guò)程（POMDP），可以準(zhǔn)確地建模隱藏的用戶意圖。

其次，ChatGPT 首先以擬合語(yǔ)言的生成為目標(biāo)~~使用語(yǔ)言適應(yīng)性~~進(jìn)行訓(xùn)練，然后使用人類(lèi)標(biāo)簽進(jìn)行對(duì)話目標(biāo)的訓(xùn)練 / 微調(diào)。由于平臺(tái)的開(kāi)放性質(zhì)，實(shí)際用戶的目標(biāo)和目的可能與訓(xùn)練 / 微調(diào)的獎(jiǎng)勵(lì)不一致。為了檢查人類(lèi)和代理（智能體）之間的均衡和利益沖突，使用博弈論的視角可能是值得的。

5.新型應(yīng)用

正如 ChatGPT 所證明的那樣，我們相信基礎(chǔ)語(yǔ)言模型具有兩個(gè)獨(dú)特的特點(diǎn)，它們將成為未來(lái)機(jī)器學(xué)習(xí)和基礎(chǔ)語(yǔ)言模型應(yīng)用的推動(dòng)力。第一個(gè)是其優(yōu)越的語(yǔ)言技能，而第二個(gè)是其嵌入的語(yǔ)義和早期推理能力（以人類(lèi)語(yǔ)言形式存在）。作為接口，前者將極大地降低應(yīng)用機(jī)器學(xué)習(xí)的入門(mén)門(mén)檻，而后者將顯著地推廣機(jī)器學(xué)習(xí)的應(yīng)用范圍。

如第 1 部分介紹的新學(xué)習(xí)流程所示，提示和上下文學(xué)習(xí)消除了數(shù)據(jù)工程的瓶頸以及構(gòu)建和訓(xùn)練模型所需的工作量。此外，利用推理能力可以使我們自動(dòng)分解和解決困難任務(wù)的每個(gè)子任務(wù)。因此，它將大大改變?cè)S多行業(yè)和應(yīng)用領(lǐng)域。在互聯(lián)網(wǎng)企業(yè)中，基于對(duì)話的界面是網(wǎng)絡(luò)和移動(dòng)搜索、推薦系統(tǒng)和廣告的明顯應(yīng)用。然而，由于我們習(xí)慣于基于關(guān)鍵字的 URL 倒排索引搜索系統(tǒng)，改變并不容易。人們必須被重新教導(dǎo)使用更長(zhǎng)的查詢(xún)和自然語(yǔ)言作為查詢(xún)。此外，基礎(chǔ)語(yǔ)言模型通常是刻板和不靈活的。它們?nèi)狈﹃P(guān)于最近事件的當(dāng)前信息。它們通常會(huì)幻想事實(shí)，并不提供檢索能力和驗(yàn)證。因此，我們需要一種能夠隨著時(shí)間動(dòng)態(tài)演化的即時(shí)基礎(chǔ)模型。

因此，我們呼吁開(kāi)發(fā)新的應(yīng)用程序，包括但不限于以下領(lǐng)域：

創(chuàng)新新穎的提示工程、流程和軟件支持。
基于模型的網(wǎng)絡(luò)搜索、推薦和廣告生成；面向?qū)υ拸V告的新商業(yè)模式。
針對(duì)基于對(duì)話的 IT 服務(wù)、軟件系統(tǒng)、無(wú)線通信（個(gè)性化消息系統(tǒng)）和客戶服務(wù)系統(tǒng)的技術(shù)。
從基礎(chǔ)語(yǔ)言模型生成機(jī)器人流程自動(dòng)化（RPA）和軟件測(cè)試和驗(yàn)證。
AI 輔助編程。
面向創(chuàng)意產(chǎn)業(yè)的新型內(nèi)容生成工具。
將語(yǔ)言模型與運(yùn)籌學(xué)~~運(yùn)營(yíng)研究~~、企業(yè)智能和優(yōu)化統(tǒng)一起來(lái)。
在云計(jì)算中高效且具有成本效益地服務(wù)大型基礎(chǔ)模型的方法。
針對(duì)強(qiáng)化學(xué)習(xí)、多智能體學(xué)習(xí)和其他人工智能決策制定領(lǐng)域的基礎(chǔ)模型。
語(yǔ)言輔助機(jī)器人技術(shù)。
針對(duì)組合優(yōu)化、電子設(shè)計(jì)自動(dòng)化（EDA）和芯片設(shè)計(jì)的基礎(chǔ)模型和推理。

作者簡(jiǎn)介

汪軍，倫敦大學(xué)學(xué)院（UCL）計(jì)算機(jī)系教授，上海數(shù)字大腦研究院聯(lián)合創(chuàng)始人、院長(zhǎng)，主要研究決策智能及大模型相關(guān)，包括機(jī)器學(xué)習(xí)、強(qiáng)化學(xué)習(xí)、多智能體，數(shù)據(jù)挖掘、計(jì)算廣告學(xué)、推薦系統(tǒng)等。已發(fā)表 200 多篇學(xué)術(shù)論文，出版兩本學(xué)術(shù)專(zhuān)著，多次獲得最佳論文獎(jiǎng)，并帶領(lǐng)團(tuán)隊(duì)研發(fā)出全球首個(gè)多智能體決策大模型和全球第一梯隊(duì)的多模態(tài)決策大模型。

Appendix:

Call for Innovation: Post-ChatGPT Theories of Artificial General Intelligence and Their Applications

ChatGPT has recently caught the eye of the research community, the commercial sector, and the general public. It is a generic chatbot that can respond to open-ended prompts or questions from users. Curiosity is piqued by its superior and human-like language skills delivering coherent, consistent, and well-structured responses. Its multi-turn dialogue interaction supports a wide range of text and code-based tasks, including novel creation, letter composition, textual gameplay, and even robot manipulation through code generation, thanks to a large pre-trained generative language model. This gives the public faith that generalist machine learning and machine understanding are achievable very soon.

If one were to dig deeper, they may discover that when programming code is added as training data, certain reasoning abilities, common sense understanding, and even chain of thought (a series of intermediate reasoning steps) may appear as emergent abilities [1] when models reach a particular size. While the new finding is exciting and opens up new possibilities for AI research and applications, it, however, provokes more questions than it resolves. Can these emergent abilities, for example, serve as an early indicator of higher intelligence, or are they simply naive mimicry of human behaviour hidden by data? Would continuing the expansion of already enormous models lead to the birth of artificial general intelligence (AGI), or are these models simply superficially intelligent with constrained capability? If answered, these questions may lead to fundamental shifts in artificial intelligence theory and applications.

We therefore urge not just replicating ChatGPT’s successes but most importantly, pushing forward ground-breaking research and novel application development in the following areas of artificial intelligence (by no means an exhaustive list):

1.New machine learning theory that goes beyond the established paradigm of task-specific machine learning

Inductive reasoning is a type of reasoning in which we draw conclusions about the world based on past observations. Machine learning can be loosely regarded as inductive reasoning in the sense that it leverages past (training) data to improve performance on new tasks. Taking machine translation as an example, a typical machine learning pipeline involves the following four major steps:

1.define the specific problem, e.g., translating English sentences to Chinese: E→C,

2.collect the data, e.g., sentence pairs { E→C },

3.train a model, e.g., a deep neural network with inputs {E} and outputs {C},

4.apply the model to an unseen data point, e.g., input a new English sentence E’ and output a Chinese translation C’ and evaluate the result.

As shown above, traditional machine learning isolates the training for each specific task. Hence, for each new task, one must reset and redo the process from step 1 to step 4, losing all acquired knowledge (data, models, etc.) from previous tasks. For instance, you would need a different model if you want to translate French into Chinese, rather than English to Chinese.

Under this paradigm, the job of machine learning theorists is focused chiefly on understanding the generalisation ability of a learning model from the training data to the unseen test data [2, 3]. For instance, a common question would be how many samples we need in training to achieve a certain error bound of predicting unseen test data. We know that inductive bias (i.e.prior knowledge or prior assumption) is required for a learning model to predict outputs that it has not encountered. This is because the output value in unknown circumstances is completely arbitrary, making it impossible to address the problem without making certain assumptions. The celebrated no-free-lunch theorem [5] further says that any inductive bias has a limitation; it is only suitable for a certain group of problems, and it may fail elsewhere if the prior knowledge assumed is incorrect.

Figure 1 A screenshot of ChatGPT used for machine translation. The prompt contains instruction only, and no demonstration example is necessary.

While the above theories still hold, the arrival of foundation language models may have altered our approach to machine learning. The new machine learning pipeline could be the following (using the same machine translation problem as an example; see Figure 1):

1.API access to a foundation language model trained elsewhere by others, e.g., a model trained with diverse documents, including paring corpus of English/Chinese,

2.with a few examples or no example at all, design a suitable text description (known as a prompt) for the task at hand, e.g., Prompt = {a few examples E→C },

3.conditioned on the prompt and a given new test data point, the language model generates the answer, e.g., append E’ to the prompt and generate C’ from the model,

4.interpret the answer as the predicted result.

As shown in step 1, the foundation language model serves as a one-size-fits-all knowledge repository. The prompt (and context) presented in step 2 allow the foundation language model to be customised to a specific goal or problem with only a few demonstration instances. While the aforementioned pipeline is primarily limited to text-based problems, it is reasonable to assume that, as the development of cross-modality (see Section 3) foundation pre-trained models continues, it will become the standard for machine learning in general. This could break down the necessary task barriers to pave the way for AGI.

But, it is still early in the process of determining how the demonstration examples in a prompt text operate. Empirically, we now understand, from some early work [2], that the format of demonstration samples is more significant than the correctness of the labels (for instance, as illustrated in Figure 1, we don’t need to provide example translation but are required to provide language instruction), but are there any theoretical limits to its adaptability as stated in the no-free-lunch theorem? Can the context and instruction-based knowledge stated in prompts (step 2) be integrated into the model for future usage? We're only scratching the surface with these inquiries. We therefore call for a new understanding and new principles behind this new form of in-context learning and its theoretical limitations and properties, such as generalisation bounds.

Figure 2 An illustration of AIGA for designing computer games.

2.Developing reasoning skills

We are on the edge of an exciting era in which all our linguistic and behavioural data can be mined to train (and be absorbed by) an enormous computerised model. It is a tremendous accomplishment as our whole collective experience and civilisation could be digested into a single (hidden) knowledge base (in the form of artificial neural networks) for later use. In fact, ChatGPT and large foundation models are said to demonstrate some form of reasoning capacity. They may even arguably grasp the mental states of others to some extent (theory of mind) [6]. This is accomplished by data fitting (predicting masked language tokens as training signals) and imitation (of human behaviours). Yet, it is debatable if this entirely data-driven strategy will bring us greater intelligence.

To illustrate this notion, consider instructing an agent how to play chess as an example. Even if the agent has access to a limitless amount of human play data, it will be very difficult for it, by only imitating existing policies, to generate new policies that are more optimal than those already present in the data. Using the data, one can, however, develop an understanding of the world (e.g., the rules of the game) and use it to “think” (construct a simulator in its brain to gather feedback in order to create more optimal policies). This highlights the importance of inductive bias; rather than simple brute force, a learning agent is demanded to have some model of the world and infer it from the data in order to improve itself.

Thus, there is an urgent need to thoroughly investigate and understand the emerging capabilities of foundation models. Apart from language skills, we advocate research into acquiring of actual reasoning ability by investigating the underlying mechanisms [9]. One promising approach would be to draw inspiration from neuroscience and brain science to decipher the mechanics of human reasoning and advance language model development. At the same time, building a solid theory of mind may also necessitate an in-depth knowledge of multiagent learning [10,11] and its underlying principles.

3.From AI Generating Content (AIGC) to AI Generating Action (AIGA)

The implicit semantics developed on top of human languages is integral to foundation language models. How to utilise it is a crucial topic for generalist machine learning. For example, once the semantic space is aligned with other media (such as photos, videos, and sounds) or other forms of data from human and machine behaviours, such as robotic trajectory/actions, we acquire semantic interpretation power for them with no additional cost [7, 14]. In this manner, machine learning (prediction, generation, and decision-making) would be generic and decomposable. Yet, dealing with cross-modality alignment is a substantial hurdle for us due to the labour-intensive nature of labelling the relationships. Additionally, human value alignment becomes difficult when numerous parties have conflicting interests.

A fundamental drawback of ChatGPT is that it can communicate directly with humans only. Yet, once a sufficient alignment with the external world has been established, foundation language models should be able to learn how to interact with various parties and environments [7, 14]. This is significant because it will bestow its power on reasoning ability and semantics based on language for broader applications and capabilities beyond conversation. For instance, it may evolve into a generalist agent capable of navigating the Internet [7], controlling computers [13], and manipulating robots [12]. Thus, it becomes more important to implement procedures that ensure responses from the agent (often in the form of generated actions) are secure, reliable, unbiased, and trustworthy.

Figure 2 provides a demonstration of AIGA [7] for interacting with a game engine to automate the process of designing a video game.

4.Multiagent theories of interactions with foundation language models

ChatGPT uses in-context learning and prompt engineering to drive multi-turn dialogue with people in a single session, i.e., given the question or prompt, the entire prior conversation (questions and responses) is sent to the system as extra context to construct the response. It is a straightforward Markov decision process (MDP) model for conversation:

{State = context, Action = response, Reward = thumbs up/down rating}.

While effective, this strategy has the following drawbacks: first, a prompt simply provides a description of the user's response, but the user's genuine intent may not be explicitly stated and must be inferred. Perhaps a robust model, as proposed previously for conversation bots, would be a partially observable Markov decision process (POMDP) that accurately models a hidden user intent.

Second, ChatGPT is first trained using language fitness and then human labels for conversation goals. Due to the platform's open-ended nature, actual user's aim and objective may not align with the trained/fined-tuned rewards. In order to examine the equilibrium and conflicting interests of humans and agents, it may be worthwhile to use a game-theoretic perspective [9].

5.Novel applications

As proven by ChatGPT, there are two distinctive characteristics of foundation language models that we believe will be the driving force behind future machine learning and foundation language model applications. The first is its superior linguistic skills, while the second is its embedded semantics and early reasoning abilities (in the form of human language). As an interface, the former will greatly lessen the entry barrier to applied machine learning, whilst the latter will significantly generalise how machine learning is applied.

As demonstrated in the new learning pipeline presented in Section 1, prompts and in-context learning eliminate the bottleneck of data engineering and the effort required to construct and train a model. Moreover, exploiting the reasoning capabilities could enable us to automatically dissect and solve each subtask of a hard task. Hence, it will dramatically transform numerous industries and application sectors. In internet-based enterprises, the dialogue-based interface is an obvious application for web and mobile search, recommender systems, and advertising. Yet, as we are accustomed to the keyword-based URL inverted index search system, the change is not straightforward. People must be retaught to utilise longer queries and natural language as queries. In addition, foundation language models are typically rigid and inflexible. It lacks access to current information regarding recent events. They typically hallucinate facts and do not provide retrieval capabilities and verification. Thus, we need a just-in-time foundation model capable of undergoing dynamic evolution over time.

We therefore call for novel applications including but not limited to the following areas:

Novel prompt engineering, its procedure, and software support.
Generative and model-based web search, recommendation and advertising; novel business models for conversational advertisement.
Techniques for dialogue-based IT services, software systems, wireless communications (personalised messaging systems) and customer service systems.
Automation generation from foundation language models for Robotic process automation (RPA) and software test and verification.
AI-assisted programming.
Novel content generation tools for creative industries.
Unifying language models with operations research and enterprise intelligence and optimisation.
Efficient and cost-effective methods of serving large foundation models in Cloud computing.
Foundation models for reinforcement learning and multiagent learning and, other decision-making domains.
Language-assisted Robotics.
Foundation models and reasoning for combinatorial optimisation, EDA and chip design.