Scikt-Learn是2007年Google Summer of Code的一个产物。后来经过大神的重写,在2010年重新发布。它集成了很多经典的机器学习算法。当然,Scikit-Learn不仅包含很多优秀的机器学习编程的设计思想,它的能力也很强,由于很多最底层的实现都是基于C语言的程序,因此Scikit-Learn在执行速度上也非常快。
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('dt', DecisionTreeClassifier(random_state=42))
]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf.fit(X_train, y_train).score(X_test, y_test)
基于排列的特征重要性计算
As the name suggests, this technique provides a way to assign importance to each feature by permuting each feature and capturing the drop in performance.
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import make_classification
# Use Sklearn make classification to create a dummy dataset with 3 important variables out of 7
X, y = make_classification(random_state=0, n_features=7, n_informative=3)
rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y,
n_repeats=10, # Number of times for which each feature must be shuffled
random_state=0, # random state fixing for reproducability
n_jobs=-1) # Parallel processing using all cores
fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
vert=False, labels=range(X.shape[1]))
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0)
rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)
print("Average number of nodes without pruning {:.1f}".format(
np.mean([e.tree_.node_count for e in rf.estimators_])))
rf = RandomForestClassifier(random_state=0, ccp_alpha=0.1).fit(X, y)
print("Average number of nodes with pruning {:.1f}".format(
np.mean([e.tree_.node_count for e in rf.estimators_])))