我發現了用 Python 編寫簡潔代碼的秘訣!
作為數據科學家,我們常常使用 Jupyter Notebooks 進行數據探索和模型開發。在這個階段,我們關注的重點是快速驗證想法和證明概念。然而,一旦模型準備就緒,就需要將其部署到生產環境中,這時代碼質量就顯得尤為重要。
生產代碼必須足夠健壯、可讀且易于維護。不幸的是,數據科學家編寫的原型代碼通常難以滿足這些要求。作為一名機器學習工程師,我的職責就是確保代碼能夠順利地從概念驗證階段過渡到生產環境。
因此,編寫簡潔的代碼對于提高開發效率和降低維護成本至關重要。在本文中,我將分享一些 Python 編程技巧和最佳實踐,并通過簡潔的代碼示例,向您展示如何提高代碼的可讀性和可維護性。
我衷心希望這篇文章能為 Python 愛好者提供有價值的見解,特別是能夠激勵更多的數據科學家重視代碼質量,因為高質量的代碼不僅有利于開發過程,更能確保模型成功地投入生產使用。
有意義的名稱
很多開發人員沒有遵循給變量和函數命名富有意義的名稱這一最佳實踐。代碼的可讀性和可維護性因此大大降低。
命名對于代碼質量至關重要。好的命名不僅能直觀地表達代碼的功能,而且還能避免過多的注釋和解釋,提高代碼的整潔度。一個描述性強的名稱,就能讓函數的作用一目了然。
你給出的機器學習例子非常恰當。比如加載數據集并將其分割為訓練集和測試集這一常見任務,如果使用富有意義的函數名如load_dataset()和split_into_train_test()就能立刻看出這兩個函數的用途,而不需要查閱注釋。
可讀性強的代碼不僅能讓其他開發者更快理解,自己在未來維護時也能事半功倍。因此,我們應當養成良好的命名習慣,寫出簡潔直白的代碼。
以一個典型的機器學習例子為例:加載數據集并將其分割成訓練集和測試集。
import pandas as pd
from sklearn.model_selection import train_test_split
def load_and_split(d):
df = pd.read_csv(d)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
return X_train, X_test, y_train, y_test
當談到數據科學領域時,大多數人都了解其中涉及的概念和術語,例如 X 和 Y。然而,對于初入這一領域的人來說,是否將 CSV 文件的路徑命名為d是一個好的做法呢?另外,將特征命名為 X,將目標命名為 y 是一個好的做法嗎?或許我們可以通過一個更具意義的例子來進一步理解:
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data_and_split_into_train_test(dataset_path):
data_frame = pd.read_csv(dataset_path)
features = data_frame.iloc[:, :-1]
target = data_frame.iloc[:, -1]
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
這樣就更容易理解了。即使沒有使用過 pandas 和 train_test_split 的經驗,現在也能清楚地看到,這個函數是用來從 CSV 文件中加載數據(存儲在 dataset_path 中指定的路徑下),然后從數據框中提取特征和目標,最后計算訓練集和測試集的特征和目標。
這些變化使代碼更易讀和易懂,尤其是對于那些可能不熟悉機器學習代碼規范的人來說。在這些代碼中,特征大多以X表示,目標以y表示。
但也不要過度夸大命名,因為這并不會提供任何額外的信息。
來看另一個示例代碼片段:
import pandas as pd
from sklearn.model_selection import train_test_split
def load_data_from_csv_and_split_into_training_and_testing_sets(dataset_path_csv):
data_frame_from_csv = pd.read_csv(dataset_path_csv)
features_columns_data_frame = data_frame_from_csv.iloc[:, :-1]
target_column_data_frame = data_frame_from_csv.iloc[:, -1]
features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing = train_test_split(features_columns_data_frame, target_column_data_frame, test_size=0.2, random_state=42)
return features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing
用戶提到的代碼讓人感覺信息過多,卻沒有提供任何額外的信息,反而會分散讀者的注意力。因此,建議在函數中添加有意義的名稱,以取得描述性和簡潔性之間的平衡。至于是否需要說明函數是從 CSV 加載數據集路徑,這取決于代碼的上下文和實際需求。
函數
函數的規模與功能應該恰當地設計。它們應該保持簡潔,不超過20行,并將大塊內容分離到新的函數中。更重要的是,函數應該只負責一件事,而不是多個任務。如果需要執行其他任務,就應該將其放到另一個函數中。舉個例子
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_clean_feature_engineer_and_split(data_path):
# Load data
df = pd.read_csv(data_path)
# Clean data
df.dropna(inplace=True)
df = df[df['Age'] > 0]
# Feature engineering
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
# Data preprocessing
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
# Split data
features = df.drop('Survived', axis=1)
target = df['Survived']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
你有沒有注意到違反了上述規則的行為?
雖然這個函數并不冗長,但明顯違反了一個函數只負責一件事的規則。另外,注釋表明這些代碼塊可以放在一個單獨的函數中,因為根本不需要單行注釋(下一節將詳細介紹)。
一個重構后的示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
在這個經過重構的代碼片段中,每個函數只做一件事,這樣就更容易閱讀代碼了。測試本身也變得更容易了,因為每個函數都可以獨立于其他函數進行測試。
甚至連注釋也不再需要了,因為現在函數名本身就像是注釋。
注釋
有時注釋是有用的,但有時它們只是糟糕代碼的標志。
正確使用注釋是為了彌補我們無法用代碼表達的缺陷。
當需要在代碼中添加注釋時,可以考慮是否真的需要它,或者是否可以將其放入一個新函數中,并為函數命名,這樣就能清楚地知道發生了什么,而注釋并不是必需的。
來修改一下之前函數一章中的代碼示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_clean_feature_engineer_and_split(data_path):
# Load data
df = pd.read_csv(data_path)
# Clean data
df.dropna(inplace=True)
df = df[df['Age'] > 0]
# Feature engineering
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
# Data preprocessing
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
# Split data
features = df.drop('Survived', axis=1)
target = df['Survived']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
return features_train, features_test, target_train, target_test
代碼中注釋描述了每個代碼塊的作用,但實際上,注釋只是糟糕代碼的一個指標。根據前一章的建議,將這些代碼塊放入單獨的函數中,并為每個函數起一個描述性的名稱,這樣可以提高代碼的可讀性,減少對注釋的需求。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop('Survived', axis=1)
target = df['Survived']
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
代碼現在看起來像一個連貫的故事,不需要注釋就可以清楚地了解發生了什么。但還缺少最后一部分:文檔字符串。文檔字符串是 Python 的標準,旨在提供可讀性和可理解性的代碼。每個生產代碼中的函數都應該包含文檔字符串,描述其意圖、輸入參數和返回值信息。這些文檔字符串可以直接用于 Sphinx 等工具,其目的是為代碼創建文檔。
將文檔字符串添加到上述代碼片段中:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
"""
Load data from a CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
def clean_data(df):
"""
Clean the DataFrame by removing rows with missing values and
filtering out non-positive ages.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The cleaned dataset.
"""
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
"""
Perform feature engineering on the DataFrame, including age
grouping and adult identification.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with new features added.
"""
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
"""
Preprocess features by standardizing the 'Age' and 'Fare'
columns using StandardScaler.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with standardized features.
"""
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
"""
Split the dataset into training and testing sets.
Args:
df (DataFrame): The input dataset.
target_name (str): The name of the target variable column.
Returns:
tuple: Contains the training features, testing features,
training target, and testing target datasets.
"""
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
集成開發環境(如 VSCode)通常會提供 docstrings 擴展功能,以便在函數定義下方添加多行字符串時自動添加 docstrings。
這可以幫助你快速獲得所選的正確格式。
格式化
格式化是一個非常關鍵的概念。
代碼的閱讀頻率比編寫頻率高。避免人們閱讀不規范和難以理解的代碼。
在 Python 中有一個 PEP 8 樣式指南[1],可用于改善代碼的可讀性。
樣式指南包括如下重要規則:
- 使用四個空格進行代碼縮進
- 每行不超過 79 個字符
- 避免不必要的空白,在某些情況下(例如括號內、逗號和括號之間)
但請記住,格式化規則旨在提高代碼可讀性。有時,嚴格遵循規則可能不合理,會降低代碼的可讀性。此時可以忽略某些規則。
《清潔代碼》一書中提到的其他重要格式化規則包括:
- 使文件大小合理 (約 200 至 500 行),以促使更好的理解
- 使用空行來分隔不同概念(例如,在初始化 ML 模型的代碼塊和運行訓練的代碼塊之間)
- 將調用者函數定義在被調用者函數之上,幫助創建自然的閱讀流程
因此,與團隊一起決定遵守的規則,并堅持執行!您可以利用集成開發環境的擴展功能來支持準則遵守。例如,VSCode 提供了多種擴展。您可以使用 Pylint[2] 和 autopep8[3] 等 Python 軟件包來格式化您的 Python 腳本。Pylint 是一個靜態代碼分析器,自動對代碼進行評分,而autopep8可以自動格式化代碼,使其符合PEP8標準。
使用前面的代碼片段來進一步了解。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
將其保存到名為 train.py 的文件中,并運行 Pylint 來檢查該代碼段的得分:
pylint train.py
輸出結果
************* Module train
train.py:29:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:30:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:31:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:32:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:33:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:34:0: C0304: Final newline missing (missing-final-newline)
train.py:34:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:5:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:5:14: W0621: Redefining name 'data_path' from outer scope (line 29) (redefined-outer-name)
train.py:8:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:8:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:13:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:13:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:18:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:18:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:23:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:23:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:29:2: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)
------------------------------------------------------------------
Your code has been rated at 3.21/10
滿分 10 分,只有 3.21 分。
你可以選擇手動修復這些問題然后重新運行,或者使用autopep8軟件包來自動解決一些問題。下面我們選擇第二種方法。
autopep8 --in-place --aggressive --aggressive train.py
現在的 train.py 腳本如下所示:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
return pd.read_csv(data_path)
def clean_data(df):
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
df['AgeGroup'] = pd.cut(
df['Age'], bins=[
0, 18, 65, 99], labels=[
'child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data_path = 'data.csv'
df = load_data(data_path)
df = clean_data(df)
df = feature_engineering(df)
df = preprocess_features(df)
X_train, X_test, y_train, y_test = split_data(df)
再次運行 Pylint 后,我們得到了 5.71 分(滿分 10 分),這主要是由于缺少函數的文檔說明:
************* Module train
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:6:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:6:14: W0621: Redefining name 'data_path' from outer scope (line 38) (redefined-outer-name)
train.py:10:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:10:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:16:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:16:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:25:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:25:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:31:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:31:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:38:4: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)
------------------------------------------------------------------
Your code has been rated at 5.71/10 (previous run: 3.21/10, +2.50)
現在我已經添加了文檔說明,并修復了最后的缺失點。
現在的最終代碼是這樣的:
"""
This script aims at providing an end-to-end training pipeline.
Author: Patrick
Date: 2/14/2024
"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
def clean_data(df):
"""
Clean the input DataFrame by removing rows with
missing values and filtering out entries with non-positive ages.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The cleaned dataset.
"""
df.dropna(inplace=True)
df = df[df['Age'] > 0]
return df
def feature_engineering(df):
"""
Perform feature engineering on the DataFrame,
including creating age groups and determining if the individual is an adult.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with new features added.
"""
df['AgeGroup'] = pd.cut(
df['Age'], bins=[
0, 18, 65, 99], labels=[
'child', 'adult', 'senior'])
df['IsAdult'] = df['Age'] > 18
return df
def preprocess_features(df):
"""
Preprocess the 'Age' and 'Fare' features of the
DataFrame using StandardScaler to standardize the features.
Args:
df (DataFrame): The input dataset.
Returns:
DataFrame: The dataset with standardized features.
"""
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
return df
def split_data(df, target_name='Survived'):
"""
Split the DataFrame into training and testing sets.
Args:
df (DataFrame): The dataset to split.
target_name (str, optional): The name of the target variable column. Defaults to 'Survived'.
Returns:
tuple: The training and testing features and target datasets.
"""
features = df.drop(target_name, axis=1)
target = df[target_name]
return train_test_split(features, target, test_size=0.2, random_state=42)
if __name__ == "__main__":
data = load_data("data.csv")
data = clean_data(data)
data = feature_engineering(data)
data = preprocess_features(data)
X_train, X_test, y_train, y_test = split_data(data)
運行 Pylint 現在返回 10 分:
pylint train.py
-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.50/10, +2.50)
這突出顯示了 Pylint 的功能之強大,它可以幫助您簡化代碼并快速符合 PEP8 標準。
錯誤處理是另一個關鍵概念。它能確保你的代碼在遇到意外情況時不會崩潰或產生錯誤結果。
舉個例子,假設您在API后端部署了一個模型,用戶可以向該部署的模型發送數據。然而,用戶可能會發送錯誤的數據,而你的應用程序如果崩潰了,可能會給用戶留下不好的印象,并可能因此責備您的應用程序開發不到位。
如果用戶能夠獲取明確的錯誤代碼和相關信息,清晰地指出他們的錯誤,那就更好了。這正是Python中異常的作用所在。
舉例來說,用戶可以上傳一個CSV文件到您的應用程序,將其加載到pandas數據框架中,然后將數據傳給模型進行預測。這樣,您可以擁有類似下面這樣的函數:
import pandas as pd
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
return pd.read_csv(data_path)
到目前為止,一切順利。但如果用戶沒有提供 CSV 文件,會發生什么情況呢?
你的程序將崩潰,并出現以下錯誤信息:
FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'
你正在使用API,它只會以HTTP 500代碼響應用戶,告訴他們"服務器內部出錯"。用戶可能會因此責怪您的應用程序,因為他們無法確定自己是否對該錯誤負有責任。更好的處理方法是添加一個try-except塊,并捕獲FileNotFoundError以正確處理這種情況。
import pandas as pd
import logging
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
try:
return pd.read_csv(data_path)
except FileNotFoundError:
logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
我們目前只能記錄該錯誤消息。最佳做法是定義一個自定義異常,然后在應用程序接口中進行處理,以向用戶返回特定的錯誤代碼。
import pandas as pd
import logging
class DataLoadError(Exception):
"""Exception raised when the data cannot be loaded."""
def __init__(self, message="Data could not be loaded"):
self.message = message
super().__init__(self.message)
def load_data(data_path):
"""
Load dataset from a specified CSV file into a pandas DataFrame.
Args:
data_path (str): The file path to the dataset.
Returns:
DataFrame: The loaded dataset.
"""
try:
return pd.read_csv(data_path)
except FileNotFoundError:
logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
raise DataLoadError(f"The file at path {data_path} does not exist. Please ensure that you have uploaded the file properly.")
然后,在應用程序接口的主要函數中:
try:
df = load_data('path/to/data.csv')
# Further processing and model prediction
except DataLoadError as e:
# Return a response to the user with the error message
# For example: return Response({"error": str(e)}, status=400)
用戶將收到 400 錯誤代碼(錯誤請求),并將收到有關錯誤原因的錯誤消息。
現在他了解了應該怎么做,并不會再責備程序工作不正常。
面向對象編程
面向對象編程(OOP)是Python中一個重要的編程范式,即使是初學者也應該熟悉。那么,什么是OOP呢?
面向對象編程是一種將數據和行為封裝到單個對象中的編程方式,為程序提供了清晰的結構。
采用OOP有以下幾個主要好處:
- 通過封裝隱藏內部細節,提高代碼模塊化。
- 繼承機制允許代碼復用,提高開發效率。
- 將復雜問題分解為小對象,專注于逐個解決。
- 提升代碼可讀性和可維護性。
OOP還有其他一些優點,上述幾點是最為關鍵的。
現在讓我們看一個簡單的例子,我們創建了一個名為TrainingPipeline的類,包含幾個基本函數:
from abc import ABC, abstractmethod
class TrainingPipeline(ABC):
def __init__(self, data_path, target_name):
"""
Initialize the TrainingPipeline.
Args:
data_path (str): The file path to the dataset.
target_name (str): Name of the target column.
"""
self.data_path = data_path
self.target_name = target_name
self.data = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
@abstractmethod
def load_data(self):
"""Load dataset from data path."""
pass
@abstractmethod
def clean_data(self):
"""Clean the data."""
pass
@abstractmethod
def feature_engineering(self):
"""Perform feature engineering."""
pass
@abstractmethod
def preprocess_features(self):
"""Preprocess features."""
pass
@abstractmethod
def split_data(self):
"""Split data into training and testing sets."""
pass
def run(self):
"""Run the training pipeline."""
self.load_data()
self.clean_data()
self.feature_engineering()
self.preprocess_features()
self.split_data()
這是一個抽象基類,只定義了從基類派生出來的類必須實現的抽象方法。
這對于定義所有子類都必須遵循的藍圖或模板非常有用。
下面是一個子類示例:
import pandas as pd
from sklearn.preprocessing import StandardScaler
class ChurnPredictionTrainPipeline(TrainingPipeline):
def load_data(self):
"""Load dataset from data path."""
self.data = pd.read_csv(self.data_path)
def clean_data(self):
"""Clean the data."""
self.data.dropna(inplace=True)
def feature_engineering(self):
"""Perform feature engineering."""
categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns
self.data = pd.get_dummies(self.data, columns=categorical_cols, drop_first=True)
def preprocess_features(self):
"""Preprocess features."""
numerical_cols = self.data.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
self.data[numerical_cols] = scaler.fit_transform(self.data[numerical_cols])
def split_data(self):
"""Split data into training and testing sets."""
features = self.data.drop(self.target_name, axis=1)
target = self.data[self.target_name]
self.features_train, self.features_test, self.target_train, self.target_test = train_test_split(features, target, test_size=0.2, random_state=42)
這樣做的好處是,你可以創建一個自動調用訓練管道方法的應用程序,還可以創建不同的訓練管道類。它們始終是兼容的,并且必須遵循抽象基類中定義的藍圖。
測試
測試可以決定整個項目的成敗。
測試確實可能會增加一些開發時間投入,但從長遠來看,它能夠極大地提高代碼質量、可維護性和可靠性。
測試對于確保項目的成功至關重要,盡管一開始編寫測試代碼會耗費一些時間,但這是一種非常值得的投資。不編寫測試可能會在短期內加快開發速度,但從長遠來看,缺乏測試會帶來嚴重的代價:
- 代碼庫擴大后,任何小小修改都可能導致意外的破壞
- 新版本需要大量修復,給客戶帶來不佳體驗
- 開發人員畏懼修改代碼庫,新功能發布受阻
因此,遵循 TDD 原則對于提高代碼質量和開發效率至關重要。TDD 的三個核心原則是:
- 在開始編寫生產代碼之前,先編寫一個失敗的單元測試
- 編寫的單元測試內容不要多于足以導致失敗的內容
- 編寫的生產代碼不能多于足以通過當前失敗測試的部分。
這種測試先行的模式能促使開發者先思考代碼設計。
Python 擁有諸如 unittest 和 pytest 等優秀測試框架,其中 pytest 因其簡潔語法而更加易用。盡管短期增加了開發量,但測試絕對是保證項目長期成功所必需的。
再次看看前一章中的 ChurnPredictionTrainPipeline 類:
import pandas as pd
from sklearn.preprocessing import StandardScaler
class ChurnPredictionTrainPipeline(TrainingPipeline):
def load_data(self):
"""Load dataset from data path."""
self.data = pd.read_csv(self.data_path)
...
使用 pytest 為加載數據添加單元測試:
import os
import shutil
import logging
from unittest.mock import patch
import joblib
import pytest
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from churn_library import ChurnPredictor
@pytest.fixture
def path():
"""
Return the path to the test csv data file.
"""
return r"./data/bank_data.csv"
def test_import_data_returns_dataframe(path):
"""
Test that import data can load the CSV file into a pandas dataframe.
"""
churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
churn_predictor.load_data()
assert isinstance(churn_predictor.data, pd.DataFrame)
def test_import_data_raises_exception():
"""
Test that exception of "FileNotFoundError" gets raised in case the CSV
file does not exist.
"""
with pytest.raises(FileNotFoundError):
churn_predictor = ChurnPredictionTrainPipeline("non_existent_file.csv",
"Churn")
churn_predictor.load_data()
def test_import_data_reads_csv(path):
"""
Test that the pandas.read_csv function gets called.
"""
with patch("pandas.read_csv") as mock_csv:
churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
churn_predictor.load_data()
mock_csv.assert_called_once_with(path)
這些單元測試包括
- 測試 CSV 文件能否加載到 pandas 數據框架中。
- 測試 CSV 文件不存在時是否會拋出 FileNotFoundError 異常。
- 測試是否調用了 pandas 的 read_csv 函數。
這個過程并不完全是 TDD,因為在添加單元測試之前,我已經開發了代碼。但在理想情況下,你甚至可以在實現 load_data 函數之前編寫這些單元測試。
結論
四條簡單設計規則,目的是讓代碼更加簡潔、可讀和易維護。這四條規則是:
- 運行所有測試(最為重要)
- 消除重復代碼
- 體現程序員的原本意圖
- 減少類和方法的數量(最不重要)
前三條規則側重于代碼重構方面。在最初編碼時不要過于追求完美,可以先寫出簡單甚至"丑陋"的代碼,待代碼能夠運行后,再通過重構來遵循上述規則,使代碼變得優雅。
推薦"先實現,后重構"的編程方式。不要一開始就過分追求完美,而是先讓代碼運行起來,功能被實現,之后再反復重構,循序漸進地遵從這四條簡單設計原則,從而提高代碼質量。
編寫簡潔代碼對軟件項目的成功至關重要,但這需要嚴謹的態度和持續的練習。作為數據科學家,我們往往更關注在Jupyter Notebooks中運行代碼、尋找好的模型和獲取理想指標,而忽視了代碼的整潔度。但是,編寫簡潔代碼也是數據科學家的必修課,因為這能確保模型更快地投入生產環境。
當編寫需要重復使用的代碼時,我們應當堅持編寫簡潔代碼。起步可以從簡單開始,不要一開始就過于追求完美,而是要反復打磨代碼。永遠不要忘記為函數編寫單元測試,以確保功能的正常運行,避免將來擴展時出現重大問題。
堅持一些原則,比如消除重復代碼、體現代碼意圖等,能讓你遠離"永遠不要改變正在運行的系統"的思維定式。這些原則我正在學習并應用到日常工作中,它們確實很有幫助,但全面掌握需要漫長的過程和持續的努力。
最后,要盡可能自動化,利用集成開發環境提供的擴展功能,來幫助遵守清潔代碼規則,提高工作效率。
參考資料
[1]PEP 8 樣式指南: https://peps.python.org/pep-0008/
[2]Pylint: https://pylint.readthedocs.io/en/stable/
[3]autopep8: https://pypi.org/project/autopep8/