# 使用30分钟快速入门机器学习时间

2021年3月22日14:59:21 发表评论 614 次浏览

1.下载, 安装和启动Python SciPy

1.1安装SciPy库

• 科学的
• Numpy
• matplotlib
• 大熊猫
• 斯克莱恩

1.2启动Python并检查版本

``````# Check the versions of libraries

# Python version
import sys
print ( 'Python: {}' . format (sys.version))
# scipy
import scipy
print ( 'scipy: {}' . format (scipy.__version__))
# numpy
import numpy
print ( 'numpy: {}' . format (numpy.__version__))
# matplotlib
import matplotlib
print ( 'matplotlib: {}' . format (matplotlib.__version__))
# pandas
import pandas
print ( 'pandas: {}' . format (pandas.__version__))
# scikit-learn
import sklearn
print ( 'sklearn: {}' . format (sklearn.__version__))``````

2.加载数据。

2.1导入库

``````# Load libraries

import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC``````

2.2加载数据集

``````url =
"https://raw.githubusercontent.com / jbrownlee / Datasets / master / iris.csv"
names = [ 'sepal-length' , 'sepal-width' , 'petal-length' , 'petal-width' , 'class' ]
dataset = pandas.read_csv(url, names = names)``````

3.汇总数据集

• 数据集的尺寸。
• 窥视数据本身。
• 所有属性的统计摘要。
• 通过类变量对数据进行分类。

3.1数据集维度

``````# shape
print (dataset.shape)``````
``(150, 5)``

3.2查看数据

``````# head
print (dataset.head( 20 ))``````
``````sepal-length  sepal-width  petal-length  petal-width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa``````

3.3统计摘要

``````# descriptions
print (dataset.describe())``````

``````sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000``````

3.4班级分布

``````# class distribution
print (dataset.groupby( 'class' ).size())``````
``````class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50``````

4.数据可视化

1. 单变量图可以更好地理解每个属性。
2. 多变量图可更好地了解属性之间的关系。

4.1单变量图

``````# box and whisker plots
dataset.plot(kind = 'box' , subplots = True , layout = ( 2 , 2 ), sharex = False , sharey = False )
plt.show()``````

``````# histograms
dataset.hist()
plt.show()``````

4.2多元图

``````# scatter plot matrix
scatter_matrix(dataset)
plt.show()``````

5.评估一些算法

1. 分离出验证数据集。
2. 设置测试工具以使用10倍交叉验证。
3. 建立5种不同的模型以根据花的测量预测物种
4. 选择最佳型号。

5.1创建验证数据集

``````# Split-out validation dataset
array = dataset.values
X = array[:, 0 : 4 ]
Y = array[:, 4 ]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(
X, Y, test_size = validation_size, random_state = seed)``````

X_train和Y_train是用于准备模型的训练数据, 以后可以使用X_validation和Y_validation集。

5.2测试线束

``````# Test options and evaluation metric
seed = 7
scoring = 'accuracy'``````

"准确性"指标用于评估模型。它是正确预测的实例数除以数据集中实例总数再乘以100得出的百分比(例如, 准确度为95％)。

5.3建立模型

• 逻辑回归(LR)
• 线性判别分析(LDA)
• K最近邻居(KNN)。
• 分类和回归树(CART)。
• 高斯朴素贝叶斯(NB)。
• 支持向量机(SVM)。

``````# Spot Check Algorithms
models = []
models.append(( 'LR' , LogisticRegression(solver = 'liblinear' , multi_class = 'ovr' )))
models.append(( 'LDA' , LinearDiscriminantAnalysis()))
models.append(( 'KNN' , KNeighborsClassifier()))
models.append(( 'CART' , DecisionTreeClassifier()))
models.append(( 'NB' , GaussianNB()))
models.append(( 'SVM' , SVC(gamma = 'auto' )))

# evaluate each model in turn
results = []
names = []

for name, model in models:
kfold = model_selection.KFold(n_splits = 10 , random_state = seed)
cv_results = model_selection.cross_val_score(
model, X_train, Y_train, cv = kfold, scoring = scoring)
results.append(cv_results)
names.append(name)
msg = "% s: % f (% f)" % (name, cv_results.mean(), cv_results.std())
print (msg)``````

5.4选择最佳模型

``````LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)``````

``````# Compare Algorithms
fig = plt.figure()
fig.suptitle( 'Algorithm Comparison' )
ax = fig.add_subplot( 111 )
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()``````

6.做出预测

KNN算法非常简单, 并且是根据我们的测试得出的准确模型。

``````# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print (accuracy_score(Y_validation, predictions))
print (confusion_matrix(Y_validation, predictions))
print (classification_report(Y_validation, predictions))``````

``````0.9
[[ 7  0  0]
[ 0 11  1]
[ 0  2  9]]
precision    recall  f1-score   support

Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.85      0.92      0.88        12
Iris-virginica       0.90      0.82      0.86        11

micro avg       0.90      0.90      0.90        30
macro avg       0.92      0.91      0.91        30
weighted avg       0.90      0.90      0.90        30``````