Python開發者寶典：10個有用的機器學習實踐！

譯文

作者：布加迪編譯 2020-05-29 07:00:00

開發后端機器學習

您可能是名數據科學家，但本質上仍是開發者。這意味著您的編程技巧應該很熟練。請遵循以下10條提示，確保快速交付沒有錯誤的機器學習解決方案。

[[327915]]

【51CTO.com快譯】

有時作為數據科學家，我們忘了自己是干什么的。我們主要是開發者，然后是研究者，最后可能是數學家。我們的首要責任是快速開發沒有錯誤的解決方案。

就因為我們能構建模型并不意味著我們就是神，這沒有給我們編寫垃圾代碼的自由。

自一開始，我犯過很多錯誤，想透露一下我認為是機器學習工程最常見的技能。我認為，這也是眼下業界最缺乏的技能。

我稱他們為不懂軟件的數據科學家，因為他們中很多人都是不是計算機專業科班出身的工程師。而我本人就是那樣。

如果要聘一位優秀的數據科學家和一位優秀的機器學習工程師，我會聘后者。

1. 學習編寫抽象類。

一旦你開始編寫抽象類，就知道可以如何讓你的代碼庫清晰許多。它們強制執行同樣的方法和方法名稱。如果很多人從事同一個項目，每個人會開始采用不同的方法。這會造成嚴重的混亂。

import os  
from abc import ABCMeta, abstractmethod  
class DataProcessor(metaclass=ABCMeta):  
"""Base processor to be used for all preparation."""  
def __init__(self, input_directory, output_directory):  
self.input_directory = input_directory  
self.output_directory = output_directory  
@abstractmethod  
def read(self):  
"""Read raw data."""  
@abstractmethod  
def process(self):  
"""Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""  
@abstractmethod  
def save(self):  
"""Saves processed data."""  
class Trainer(metaclass=ABCMeta):  
"""Base trainer to be used for all models."""  
def __init__(self, directory):  
self.directory = directory  
self.model_directory = os.path.join(directory, 'models')  
@abstractmethod  
def preprocess(self):  
"""This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""  
@abstractmethod  
def set_model(self):  
"""Define model here."""  
@abstractmethod  
def fit_model(self):  
"""This takes the vectorised data and returns a trained model."""  
@abstractmethod  
def generate_metrics(self):  
"""Generates metric with trained model and test data.""" 
@abstractmethod  
def save_model(self, model_name):  
"""This method saves the model in our required format."""  
class Predict(metaclass=ABCMeta):  
"""Base predictor to be used for all models.""" 
def __init__(self, directory):  
self.directory = directory  
self.model_directory = os.path.join(directory, 'models')  
@abstractmethod  
def load_model(self):  
"""Load model here."""  
@abstractmethod  
def preprocess(self):  
"""This takes the raw data and returns clean data for prediction.""" 
@abstractmethod  
def predict(self):  
"""This is used for prediction."""  
class BaseDB(metaclass=ABCMeta): 
""" Base database class to be used for all DB connectors."""  
@abstractmethod  
def get_connection(self):  
"""This creates a new DB connection."""  
@abstractmethod  
def close_connection(self):  
"""This closes the DB connection."""

2. 搞定最上面的seed。

試驗的可重現性很重要，而seed是大敵。處理好seed。不然，它會導致神經網絡中訓練/測試數據的不同分隔和權重的不同初始化。這會導致結果不一致。

def set_seed(args):  
random.seed(args.seed)  
np.random.seed(args.seed)  
torch.manual_seed(args.seed)  
if args.n_gpu > 0:  
torch.cuda.manual_seed_all(args.seed)

3. 從幾行入手。

如果你的數據太龐大，又處在編程的后期階段(比如清理數據或建模)，就使用nrows避免每次加載龐大數據。如果你只想測試代碼、不想實際運行全部代碼，就使用這招。

如果你的本地PC配置不足以處理數據大小，這一招很有用，但你喜歡在Jupyter/VS code/Atom上進行本地開發。

df_train = pd.read_csv(‘train.csv’, nrows=1000)

4. 預料失敗(這是成熟開發者的標志)。

始終檢查數據中的NA，因為這些會在以后給你帶來問題。即便你目前的數據沒有任何NA，也并不意味著它不會出現在將來的再訓練循環中。所以無論如何要檢查。

print(len(df))  
df.isna().sum()  
df.dropna()  
print(len(df))

5. 顯示處理的進度。

你在處理龐大數據時，知道要花多少時間、處于整個處理過程中的哪個環節，絕對讓人安心。

方法1 — tqdm

from tqdm import tqdm  
import time  
tqdm.pandas()  
df['col'] = df['col'].progress_apply(lambda x: x**2)  
text = ""  
for char in tqdm(["a", "b", "c", "d"]):  
time.sleep(0.25)  
text = text + char

方法2 — fastprogress

from fastprogress.fastprogress import master_bar, progress_bar  
from time import sleep  
mb = master_bar(range(10))  
for i in mb:  
for j in progress_bar(range(100), parent=mb):  
sleep(0.01)  
mb.child.comment = f'second bar stat'  
mb.first_bar.comment = f'first bar stat'  
mb.write(f'Finished loop {i}.')

6. Pandas可能很慢。

如果你接觸過pandas，就知道它有時會變得多慢，尤其是執行groupby操作時。不必絞盡腦汁為提速尋找“出色的”解決方案，只要更改一行代碼，使用modin就行。

import modin.pandas as pd

7. 為函數計時。

不是所有函數都是一樣的。

即便全部代碼沒問題，也不意味著你編寫的是出色的代碼。一些軟錯誤實際上會使代碼運行起來比較慢，有必要把它們找出來。使用這個裝飾器來記錄函數的時間。

import time  
def timing(f):  
"""Decorator for timing functions  
Usage:  
@timing  
def function(a):  
pass 
"""  
@wraps(f)  
def wrapper(*args, **kwargs):  
start = time.time()  
result = f(*args, **kwargs)  
end = time.time()  
print('function:%r took: %2.2f sec' % (f.__name__, end - start)) 
return result  
return wrapper

8. 別把錢耗費在云上。

沒人喜歡浪費云資源的工程師。

一些試驗可能持續數小時。很難跟蹤試驗、云實例用完后關閉。本人就犯過這種錯誤，也見過有人任由實例運行數天。

只是在執行結束時調用該函數，永遠不會有麻煩!

但用try包主代碼，并用except再采用這種方法，那樣如果出現了錯誤，服務器不會處于繼續運行的狀態。是的，我也處理過這種情況。

不妨負責任一點，別生成二氧化碳了。

import os  
def run_command(cmd):  
return os.system(cmd)  
def shutdown(seconds=0, os='linux'):  
"""Shutdown system after seconds given. Useful for shutting EC2 to save costs."""  
if os == 'linux':  
run_command('sudo shutdown -h -t sec %s' % seconds)  
elif os == 'windows':  
run_command('shutdown -s -t %s' % seconds)

9. 創建和保存報告。

建模中某個點之后，所有寶貴的信息只來自錯誤和度量分析。確保為你自己和你的經理創建和保存格式完好的報告。

不管怎樣，管理層愛看報告，不是嗎?

import json  
import os  
from sklearn.metrics import (accuracy_score, classification_report,  
confusion_matrix, f1_score, fbeta_score)  
def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):  
if y_encoder:  
y = y_encoder.inverse_transform(y)  
y_pred = y_encoder.inverse_transform(y_pred)  
return {  
'accuracy': round(accuracy_score(y, y_pred), 4),  
'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),  
'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),  
'report': classification_report(y, y_pred, output_dict=True),  
'report_csv': classification_report(y, y_pred, output_dict=False).replace('\n','\r\n')  
}  
def save_metrics(metrics: dict, model_directory, file_name):  
path = os.path.join(model_directory, file_name + '_report.txt')  
classification_report_to_csv(metrics['report_csv'], path)  
metrics.pop('report_csv')  
path = os.path.join(model_directory, file_name + '_metrics.json')  
json.dump(metrics, open(path, 'w'), indent=4)

10. 編寫出色的API。

所有結尾不好的代碼都是不好的。

你的數據清理和建模可能做得很好，但最后還是會造成大混亂。經驗告訴我，許多人不清楚如何編寫優秀的API、文檔和服務器配置。

以下是負載不太高(比如1000/分鐘)的典型的機器學習和深度學習部署的好方法。

不妨見識這對組合Fastapi + uvicorn

最快：用fastapi編寫API，因為就I/O型操作而言它是速度最快的(https://www.techempower.com/benchmarks/#section=test&runid=7464e520-0dc2-473d-bd34-dbdfd7e85911&hw=ph&test=query&l=zijzen-7)，原因在此(https://fastapi.tiangolo.com/benchmarks/)有解釋。
說明文檔：用fastapi編寫API為我們在http:url/docs提供了免費文檔和測試端點→我們更改代碼時，由fastapi自動生成和更新。
Workers：使用uvicorn部署API。

運行這些命令使用4個workers來部署。通過負載測試來優化workers的數量。

pip install fastapi uvicorn  
uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000

原文標題：10 Useful Machine Learning Practices For Python Developers，作者：Pratik Bhavsar

【51CTO譯稿，合作站點轉載請注明原文譯者和出處為51CTO.com】

責任編輯：龐桂玉來源： 51CTO

Python 機器學習編程語言

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

Python開發者寶典：10個有用的機器學習實踐！