# ML用于特征选择的额外树分类器

2021年5月2日16:00:14 发表评论 1,253 次浏览

Extra Trees Forest中的每个决策树都是根据原始训练样本构建的。然后, 在每个测试节点处, 为每棵树提供来自特征集中的k个特征的随机样本, 每个决策树都必须从该样本中选择最佳特征以基于一些数学标准(通常是基尼系数)分割数据。特征的这种随机样本导致创建多个不相关的决策树。

1st Decision Tree获取具有Outlook和Temperature功能的数据：

strong>第三决策树获取具有Outlook和Humidity功能的数据：

4th决策树获取具有温度和湿度特征的数据：

5th决策树获取具有以下特征的数据：风和湿：

``````Total Info Gain for Outlook     =     0.246+0.246   = 0.492

Total Info Gain for Temperature = 0.029+0.029+0.029 = 0.087

Total Info Gain for Humidity    = 0.151+0.151+0.151 = 0.453

Total Info Gain for Wind        =     0.048+0.048   = 0.096``````

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier``````

``````# Changing the working location to the location of the file
cd C:\Users\Dev\Desktop\Kaggle

# Separating the dependent and independent variables
y = df[ 'Play Tennis' ]
X = df.drop( 'Play Tennis' , axis = 1 )

``````# Building the model
extra_tree_forest = ExtraTreesClassifier(n_estimators = 5 , criterion = 'entropy' , max_features = 2 )

# Training the model
extra_tree_forest.fit(X, y)

# Computing the importance of each feature
feature_importance = extra_tree_forest.feature_importances_

# Normalizing the individual importances
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_], axis = 0 )``````

``````# Plotting a Bar Graph to compare the models
plt.bar(X.columns, feature_importance_normalized)
plt.xlabel( 'Feature Labels' )
plt.ylabel( 'Feature Importances' )
plt.title( 'Comparison of different Feature Importances' )
plt.show()``````