成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

機(jī)器學(xué)習(xí)第一步，這是一篇手把手的隨機(jī)森林入門實(shí)戰(zhàn)

作者：機(jī)器之心編譯 2020-02-17 15:05:28

新聞機(jī)器學(xué)習(xí)

到了 2020 年，我們已經(jīng)能找到很多好玩的機(jī)器學(xué)習(xí)教程。本文則從最流行的隨機(jī)森林出發(fā)，手把手教你構(gòu)建一個(gè)模型，它的完整流程到底是什么樣的。

到了 2020 年，我們已經(jīng)能找到很多好玩的機(jī)器學(xué)習(xí)教程。本文則從最流行的隨機(jī)森林出發(fā)，手把手教你構(gòu)建一個(gè)模型，它的完整流程到底是什么樣的。

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

作為數(shù)據(jù)科學(xué)家，我們可以通過很多方法來創(chuàng)建分類模型。最受歡迎的方法之一是隨機(jī)森林。我們可以在隨機(jī)森林上調(diào)整超參數(shù)來優(yōu)化模型的性能。

在用模型擬合之前，嘗試主成分分析（PCA）也是常見的做法。但是，為什么還要增加這一步呢？難道隨機(jī)森林的目的不是幫助我們更輕松地理解特征重要性嗎？

當(dāng)我們分析隨機(jī)森林模型的「特征重要性」時(shí)，PCA 會(huì)使每個(gè)「特征」的解釋變得更加困難。但是 PCA 會(huì)進(jìn)行降維操作，這可以減少隨機(jī)森林要處理的特征數(shù)量，因此 PCA 可能有助于加快隨機(jī)森林模型的訓(xùn)練速度。

請注意，計(jì)算成本高是隨機(jī)森林的最大缺點(diǎn)之一（運(yùn)行模型可能需要很長時(shí)間）。尤其是當(dāng)你使用數(shù)百甚至上千個(gè)預(yù)測特征時(shí)，PCA 就變得非常重要。因此，如果只想簡單地?fù)碛凶罴研阅艿哪Ｐ?，并且可以犧牲解釋特征的重要性，那?PCA 可能會(huì)很有用。

現(xiàn)在讓我們舉個(gè)例子。我們將使用 Scikit-learn 的「乳腺癌」數(shù)據(jù)集，并創(chuàng)建 3 個(gè)模型，比較它們的性能：

1. 隨機(jī)森林

2. 具有 PCA 降維的隨機(jī)森林

3. 具有 PCA 降維和超參數(shù)調(diào)整的隨機(jī)森林

導(dǎo)入數(shù)據(jù)

首先，我們加載數(shù)據(jù)并創(chuàng)建一個(gè) DataFrame。這是 Scikit-learn 預(yù)先清理的「toy」數(shù)據(jù)集，因此我們可以繼續(xù)快速建模。但是，作為最佳實(shí)踐，我們應(yīng)該執(zhí)行以下操作：

使用 df.head（）查看新的 DataFrame，以確保它符合預(yù)期。
使用 df.info（）可以了解每一列中的數(shù)據(jù)類型和數(shù)據(jù)量?？赡苄枰鶕?jù)需要轉(zhuǎn)換數(shù)據(jù)類型。
使用 df.isna（）確保沒有 NaN 值?？赡苄枰鶕?jù)需要處理缺失值或刪除行。
使用 df.describe（）可以了解每列的最小值、最大值、均值、中位數(shù)、標(biāo)準(zhǔn)差和四分位數(shù)范圍。

名為「cancer」的列是我們要使用模型預(yù)測的目標(biāo)變量?！?」表示「無癌癥」，「1」表示「癌癥」。

import pandas as pd 
from sklearn.datasets import load_breast_cancercolumns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']dataset = load_breast_cancer() 
data = pd.DataFrame(dataset['data'], columns=columns) 
data['cancer'] = dataset['target']display(data.head()) 
display(data.info()) 
display(data.isna().sum()) 
display(data.describe())

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

上圖是乳腺癌 DataFrame 的一部分。每行是一個(gè)患者的觀察結(jié)果。最后一列名為「cancer」是我們要預(yù)測的目標(biāo)變量。0 表示「無癌癥」，1 表示「癌癥」。

訓(xùn)練集/測試集分割

現(xiàn)在，我們使用 Scikit-learn 的「train_test_split」函數(shù)拆分?jǐn)?shù)據(jù)。我們想讓模型有盡可能多的數(shù)據(jù)進(jìn)行訓(xùn)練。但是，我們也要確保有足夠的數(shù)據(jù)來測試模型。通常數(shù)據(jù)集中行數(shù)越多，我們可以提供給訓(xùn)練集的數(shù)據(jù)越多。

例如，如果我們有數(shù)百萬行，那么我們可以將其中的 90％用作訓(xùn)練，10％用作測試。但是，我們的數(shù)據(jù)集只有 569 行，數(shù)據(jù)量并不大。因此，為了匹配這種小型數(shù)據(jù)集，我們會(huì)將數(shù)據(jù)分為 50％的訓(xùn)練和 50％的測試。我們設(shè)置 stratify = y 以確保訓(xùn)練集和測試集與原始數(shù)據(jù)集的 0 和 1 的比例一致。

from sklearn.model_selection import train_test_splitX = data.drop('cancer', axis=1)   
y = data['cancer']  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 2020, stratify=y)

規(guī)范化數(shù)據(jù)

在建模之前，我們需要先將數(shù)據(jù)「居中」和「標(biāo)準(zhǔn)化」，對不同的變量要在相同尺度進(jìn)行測量。我們進(jìn)行縮放以便決定預(yù)測變量的特征可以彼此「公平競爭」。我們還將「y_train」從 Pandas「Series」對象轉(zhuǎn)換為 NumPy 數(shù)組，以供模型稍后接收訓(xùn)練數(shù)據(jù)。

import numpy as np 
from sklearn.preprocessing import StandardScalerss = StandardScaler() 
X_train_scaled = ss.fit_transform(X_train) 
X_test_scaled = ss.transform(X_test) 
y_train = np.array(y_train)

擬合「基線」隨機(jī)森林模型

現(xiàn)在，我們創(chuàng)建一個(gè)「基線」隨機(jī)森林模型。該模型使用 Scikit-learn 隨機(jī)森林分類器文檔中定義的所有預(yù)測特征和默認(rèn)設(shè)置。首先，我們實(shí)例化模型并使用規(guī)范化的數(shù)據(jù)擬合模型。我們可以通過訓(xùn)練數(shù)據(jù)測量模型的準(zhǔn)確性。

from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import recall_scorerfc = RandomForestClassifier() 
rfc.fit(X_train_scaled, y_train) 
display(rfc.score(X_train_scaled, y_train))# 1.0

如果我們想知道哪些特征對隨機(jī)森林模型預(yù)測乳腺癌最重要，我們可以通過調(diào)用「feature_importances _」方法來可視化和量化這些重要特征：

feats = {} 
for feature, importance in zip(data.columns, rfc_1.feature_importances_): 
feats[feature] = importanceimportances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'}) 
importances = importances.sort_values(by='Gini-Importance', ascending=False) 
importances = importances.reset_index() 
importances = importances.rename(columns={'index': 'Features'})sns.set(font_scale = 5) 
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7) 
fig, ax = plt.subplots() 
fig.set_size_inches(30,15) 
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='skyblue') 
plt.xlabel('Importance', fontsize=25, weight = 'bold') 
plt.ylabel('Features', fontsize=25, weight = 'bold') 
plt.title('Feature Importance', fontsize=25, weight = 'bold')display(plt.show()) 
display(importances)

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

主成分分析（PCA）

現(xiàn)在，我們?nèi)绾胃倪M(jìn)基線模型呢？使用降維，我們可以用更少的變量來擬合原始數(shù)據(jù)集，同時(shí)降低運(yùn)行模型的計(jì)算花銷。使用 PCA，我們可以研究這些特征的累積方差比，以了解哪些特征代表數(shù)據(jù)中的最大方差。

我們實(shí)例化 PCA 函數(shù)并設(shè)置我們要考慮的成分（特征）數(shù)量。此處我們設(shè)置為 30，以查看所有生成成分的方差，并決定在何處切割。然后，我們將縮放后的 X_train 數(shù)據(jù)「擬合」到 PCA 函數(shù)中。

import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.decomposition import PCApca_test = PCA(n_components=30) 
pca_test.fit(X_train_scaled)sns.set(style='whitegrid') 
plt.plot(np.cumsum(pca_test.explained_variance_ratio_)) 
plt.xlabel('number of components') 
plt.ylabel('cumulative explained variance') 
plt.axvline(linewidth=4, color='r', linestyle = '--', x=10, ymin=0, ymax=1) 
display(plt.show())evr = pca_test.explained_variance_ratio_ 
cvr = np.cumsum(pca_test.explained_variance_ratio_)pca_df = pd.DataFrame() 
pca_df['Cumulative Variance Ratio'] = cvr 
pca_df['Explained Variance Ratio'] = evr 
display(pca_df.head(10))

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

該圖顯示，在超過 10 個(gè)特征之后，我們并未獲得太多的解釋方差。此 DataFrame 顯示了累積方差比（解釋了數(shù)據(jù)的總方差）和解釋方差比（每個(gè) PCA 成分說明了多少數(shù)據(jù)的總方差）。

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

從上面的 DataFrame 可以看出，當(dāng)我們使用 PCA 將 30 個(gè)預(yù)測變量減少到 10 個(gè)分量時(shí)，我們?nèi)匀豢梢越忉?95％以上的方差。其他 20 個(gè)分量僅解釋了不到 5％的方差，因此我們可以減少他們的權(quán)重。按此邏輯，我們將使用 PCA 將 X_train 和 X_test 的成分?jǐn)?shù)量從 30 個(gè)減少到 10 個(gè)。我們將這些重新創(chuàng)建的「降維」數(shù)據(jù)集分配給「X_train_scaled_pca」和「X_test_scaled_pca」。

pca = PCA(n_components=10) 
pca.fit(X_train_scaled)X_train_scaled_pca = pca.transform(X_train_scaled) 
X_test_scaled_pca = pca.transform(X_test_scaled)

每個(gè)分量都是原始變量和相應(yīng)「權(quán)重」的線性組合。通過創(chuàng)建一個(gè) DataFrame，我們可以看到每個(gè) PCA 成分的「權(quán)重」。

pca_dims = [] 
for x in range(0, len(pca_df)): 
pca_dims.append('PCA Component {}'.format(x))pca_test_df = pd.DataFrame(pca_test.components_, columns=columns, index=pca_dims) 
pca_test_df.head(10).T

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

PCA 后擬合「基線」隨機(jī)森林模型

現(xiàn)在，我們可以將 X_train_scaled_pca 和 y_train 數(shù)據(jù)擬合到另一個(gè)「基線」隨機(jī)森林模型中，測試我們對該模型的預(yù)測是否有所改進(jìn)。

rfc = RandomForestClassifier() 
rfc.fit(X_train_scaled_pca, y_train)display(rfc.score(X_train_scaled_pca, y_train))# 1.0

第 1 輪超參數(shù)調(diào)優(yōu)：RandomSearchCV

實(shí)現(xiàn) PCA 之后，我們還可以通過一些超參數(shù)調(diào)優(yōu)來調(diào)整我們的隨機(jī)森林以獲得更好的預(yù)測效果。超參數(shù)可以看作模型的「設(shè)置」。兩個(gè)不同數(shù)據(jù)集的理想設(shè)置并不相同，因此我們必須「調(diào)整」模型。

首先，我們可以從 RandomSearchCV 開始考慮更多的超參值。所有隨機(jī)森林的超參數(shù)都可以在 Scikit-learn 隨機(jī)森林分類器文檔中找到。

我們生成一個(gè)「param_dist」，其值的范圍適用于每個(gè)超參數(shù)。實(shí)例化 RandomSearchCV，首先傳入我們的隨機(jī)森林模型，然后傳入「param_dist」、測試迭代次數(shù)以及交叉驗(yàn)證次數(shù)。

超參數(shù)「n_jobs」可以決定要使用多少處理器內(nèi)核來運(yùn)行模型。設(shè)置「n_jobs = -1」將使模型運(yùn)行最快，因?yàn)樗褂昧怂杏?jì)算機(jī)核心。

我們將調(diào)整這些超參數(shù)：

n_estimators：隨機(jī)森林中「樹」的數(shù)量。
max_features：每個(gè)分割處的特征數(shù)。
max_depth：每棵樹可以擁有的最大「分裂」數(shù)。
min_samples_split：在樹的節(jié)點(diǎn)分裂前所需的最少觀察數(shù)。
min_samples_leaf：每棵樹末端的葉節(jié)點(diǎn)所需的最少觀察數(shù)。
bootstrap：是否使用 bootstrapping 來為隨機(jī)林中的每棵樹提供數(shù)據(jù)。（bootstrapping 是從數(shù)據(jù)集中進(jìn)行替換的隨機(jī)抽樣。）

from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]max_features = ['log2', 'sqrt']max_depth = [int(x) for x in np.linspace(start = 1, stop = 15, num = 15)]min_samples_split = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]min_samples_leaf = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]bootstrap = [True, False]param_dist = {'n_estimators': n_estimators, 
'max_features': max_features, 
'max_depth': max_depth, 
'min_samples_split': min_samples_split, 
'min_samples_leaf': min_samples_leaf, 
'bootstrap': bootstrap}rs = RandomizedSearchCV(rfc_2,  
param_dist,  
n_iter = 100,  
cv = 3,  
verbose = 1,  
n_jobs=-1,  
random_state=0)rs.fit(X_train_scaled_pca, y_train) 
rs.best_params_ 
 
———————————————————————————————————————————— 
# {'n_estimators': 700, 
# 'min_samples_split': 2, 
# 'min_samples_leaf': 2, 
# 'max_features': 'log2', 
# 'max_depth': 11, 
# 'bootstrap': True}

在 n_iter = 100 且 cv = 3 的情況下，我們創(chuàng)建了 300 個(gè)隨機(jī)森林模型，對上面輸入的超參數(shù)進(jìn)行隨機(jī)采樣組合。我們可以調(diào)用「best_params」以獲取性能最佳的模型參數(shù)（如上面代碼框底部所示）。

但是，現(xiàn)階段的「best_params」可能無法為我們提供最有效的信息，以獲取一系列參數(shù)來執(zhí)行下一次超參數(shù)調(diào)整。為了在更大范圍內(nèi)進(jìn)行嘗試，我們可以輕松地獲得 RandomSearchCV 結(jié)果的 DataFrame。

rs_df = pd.DataFrame(rs.cv_results_).sort_values('rank_test_score').reset_index(drop=True) 
rs_df = rs_df.drop([ 
'mean_fit_time',  
'std_fit_time',  
'mean_score_time', 
'std_score_time',  
'params',  
'split0_test_score',  
'split1_test_score',  
'split2_test_score',  
'std_test_score'], 
axis=1) 
rs_df.head(10)

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

現(xiàn)在，讓我們在 x 軸上創(chuàng)建每個(gè)超參數(shù)的柱狀圖，并針對每個(gè)值制作模型的平均得分，查看平均而言最優(yōu)的值：

fig, axs = plt.subplots(ncols=3, nrows=2) 
sns.set(style="whitegrid", color_codes=True, font_scale = 2) 
fig.set_size_inches(30,25)sns.barplot(x='param_n_estimators', y='mean_test_score', data=rs_df, ax=axs[0,0], color='lightgrey') 
axs[0,0].set_ylim([.83,.93])axs[0,0].set_title(label = 'n_estimators', size=30, weight='bold')sns.barplot(x='param_min_samples_split', y='mean_test_score', data=rs_df, ax=axs[0,1], color='coral') 
axs[0,1].set_ylim([.85,.93])axs[0,1].set_title(label = 'min_samples_split', size=30, weight='bold')sns.barplot(x='param_min_samples_leaf', y='mean_test_score', data=rs_df, ax=axs[0,2], color='lightgreen') 
axs[0,2].set_ylim([.80,.93])axs[0,2].set_title(label = 'min_samples_leaf', size=30, weight='bold')sns.barplot(x='param_max_features', y='mean_test_score', data=rs_df, ax=axs[1,0], color='wheat') 
axs[1,0].set_ylim([.88,.92])axs[1,0].set_title(label = 'max_features', size=30, weight='bold')sns.barplot(x='param_max_depth', y='mean_test_score', data=rs_df, ax=axs[1,1], color='lightpink') 
axs[1,1].set_ylim([.80,.93])axs[1,1].set_title(label = 'max_depth', size=30, weight='bold')sns.barplot(x='param_bootstrap',y='mean_test_score', data=rs_df, ax=axs[1,2], color='skyblue') 
axs[1,2].set_ylim([.88,.92])

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

通過上面的圖，我們可以了解每個(gè)超參數(shù)的值的平均執(zhí)行情況。

n_estimators：300、500、700 的平均分?jǐn)?shù)幾乎最高；

min_samples_split：較小的值（如 2 和 7）得分較高。23 處得分也很高。我們可以嘗試一些大于 2 的值，以及 23 附近的值；

min_samples_leaf：較小的值可能得到更高的分，我們可以嘗試使用 2–7 之間的值；

max_features：「sqrt」具有最高平均分；

max_depth：沒有明確的結(jié)果，但是 2、3、7、11、15 的效果很好；

bootstrap：「False」具有最高平均分。

現(xiàn)在我們可以利用這些結(jié)論，進(jìn)入第二輪超參數(shù)調(diào)整，以進(jìn)一步縮小選擇范圍。

第 2 輪超參數(shù)調(diào)整：GridSearchCV

使用 RandomSearchCV 之后，我們可以使用 GridSearchCV 對目前最佳超參數(shù)執(zhí)行更精細(xì)的搜索。超參數(shù)是相同的，但是現(xiàn)在我們使用 GridSearchCV 執(zhí)行更「詳盡」的搜索。

在 GridSearchCV 中，我們嘗試每個(gè)超參數(shù)的單獨(dú)組合，這比 RandomSearchCV 所需的計(jì)算力要多得多，在這里我們可以直接控制要嘗試的迭代次數(shù)。例如，僅對 6 個(gè)參數(shù)搜索 10 個(gè)不同的參數(shù)值，具有 3 折交叉驗(yàn)證，則需要擬合模型 3,000,000 次！這就是為什么我們在使用 RandomSearchCV 之后執(zhí)行 GridSearchCV，這能幫助我們首先縮小搜索范圍。

因此，利用我們從 RandomizedSearchCV 中學(xué)到的知識，代入每個(gè)超參數(shù)的平均最佳執(zhí)行范圍：

from sklearn.model_selection import GridSearchCVn_estimators = [300,500,700] 
max_features = ['sqrt'] 
max_depth = [2,3,7,11,15] 
min_samples_split = [2,3,4,22,23,24] 
min_samples_leaf = [2,3,4,5,6,7] 
bootstrap = [False]param_grid = {'n_estimators': n_estimators, 
'max_features': max_features, 
'max_depth': max_depth, 
'min_samples_split': min_samples_split, 
'min_samples_leaf': min_samples_leaf, 
'bootstrap': bootstrap}gs = GridSearchCV(rfc_2, param_grid, cv = 3, verbose = 1, n_jobs=-1) 
gs.fit(X_train_scaled_pca, y_train) 
rfc_3 = gs.best_estimator_ 
gs.best_params_ 
 
———————————————————————————————————————————— 
# {'bootstrap': False, 
# 'max_depth': 7, 
# 'max_features': 'sqrt', 
# 'min_samples_leaf': 3, 
# 'min_samples_split': 2, 
# 'n_estimators': 500}

在這里我們將對 3x 1 x 5x 6 x 6 x 1 = 540 個(gè)模型進(jìn)行 3 折交叉驗(yàn)證，總共是 1,620 個(gè)模型！現(xiàn)在，在執(zhí)行 RandomizedSearchCV 和 GridSearchCV 之后，我們可以調(diào)用「best_params_」獲得一個(gè)最佳模型來預(yù)測我們的數(shù)據(jù)（如上面代碼框的底部所示）。

根據(jù)測試數(shù)據(jù)評估模型的性能

現(xiàn)在，我們可以在測試數(shù)據(jù)上評估我們建立的模型。我們會(huì)測試 3 個(gè)模型：

基線隨機(jī)森林
具有 PCA 降維的基線隨機(jī)森林
具有 PCA 降維和超參數(shù)調(diào)優(yōu)的基線隨機(jī)森林

讓我們?yōu)槊總€(gè)模型生成預(yù)測結(jié)果：

y_pred = rfc.predict(X_test_scaled) 
y_pred_pca = rfc.predict(X_test_scaled_pca) 
y_pred_gs = gs.best_estimator_.predict(X_test_scaled_pca)

然后，我們?yōu)槊總€(gè)模型創(chuàng)建混淆矩陣，查看每個(gè)模型對乳腺癌的預(yù)測能力：

from sklearn.metrics import confusion_matrixconf_matrix_baseline = pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_baseline_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_pca), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_tuned_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_gs), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])display(conf_matrix_baseline) 
display('Baseline Random Forest recall score', recall_score(y_test, y_pred)) 
display(conf_matrix_baseline_pca) 
display('Baseline Random Forest With PCA recall score', recall_score(y_test, y_pred_pca)) 
display(conf_matrix_tuned_pca) 
display('Hyperparameter Tuned Random Forest With PCA Reduced Dimensionality recall score', recall_score(y_test, y_pred_gs))

下面是預(yù)測結(jié)果：

æœºå™¨å¦ä¹ ç¬¬ä¸€æ¥ï¼Œè¿™æ˜¯ä¸€ç¯‡æ‰‹æŠŠæ‰‹çš„éšæœºæ£®æž—å…¥é—¨å®žæˆ˜

我們將召回率作為性能指標(biāo)，因?yàn)槲覀兲幚淼氖前┌Y診斷，我們最關(guān)心的是將模型中的假陰性預(yù)測誤差最小。

考慮到這一點(diǎn)，看起來我們的基線隨機(jī)森林模型表現(xiàn)最好，召回得分為 94.97％。根據(jù)我們的測試數(shù)據(jù)集，基線模型可以正確預(yù)測 179 名癌癥患者中的 170 名。

這個(gè)案例研究提出了一個(gè)重要的注意事項(xiàng)：有時(shí)，在 PCA 之后，甚至在進(jìn)行大量的超參數(shù)調(diào)整之后，調(diào)整的模型性能可能不如普通的「原始」模型。但是嘗試很重要，你不嘗試，就永遠(yuǎn)都不知道哪種模型最好。在預(yù)測癌癥方面，模型越好，可以挽救的生命就更多。

責(zé)任編輯：張燕妮來源：機(jī)器之心

機(jī)器學(xué)習(xí)人工智能計(jì)算機(jī)

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

主站蜘蛛池模板：中文字幕一区二区三区不卡 | 久久久91 | 成年人网站免费视频 | 一本色道久久综合亚洲精品高清 | 国产美女永久免费无遮挡 | 日韩视频一区二区 | 欧美一区二区在线观看 | 欧美一级免费看 | 99免费在线观看 | 欧美a在线| 一级片免费在线观看 | 亚洲国产精品一区二区三区 | 亚洲国产精品自拍 | h视频免费在线观看 | 99热这里| 日韩在线免费视频 | 伊人久久大香线 | 精品免费国产一区二区三区四区介绍 | 国产传媒视频在线观看 | 免费午夜电影 | 欧美a在线 | 国产成人综合一区二区三区 | 色爱区综合| 亚洲综合一区二区三区 | 亚洲欧美第一视频 | 超碰免费观看 | 欧美视频第三页 | 日韩影音 | 午夜精品一区 | 狠狠操狠狠操 | 国产成人免费视频网站高清观看视频 | 国产精品成人国产乱一区 | 国产美女黄色片 | 中文字幕日韩欧美一区二区三区 | 日韩一级免费看 | 卡通动漫第一页 | 国产福利网站 | www狠狠爱com| 黄色在线免费观看 | h视频免费在线观看 | 精品九九九 |

<bdo id="qgomm"></bdo>

<sup id="qgomm"></sup>

<noscript id="qgomm"><option id="qgomm"></option></noscript>

<strike id="qgomm"></strike><kbd id="qgomm"></kbd>