# 如何在Python中使用TensorFlow 2和Keras预测股票价格？

2021年11月11日17:28:52 发表评论 1,363 次浏览

TensorFlow 2和Keras预测股票价格：预测股价一直是对投资者和研究人员都很有吸引力的话题。投资者总是质疑一只股票的价格会不会上涨，因为有很多复杂的财务指标只有投资者和有良好金融知识的人才能理解，股市的走势是不一致的，在普通人看来很随意。

Python如何预测股票价格？本教程的目的是在TensorFlow 2和 Keras 中构建一个预测股市价格的神经网络。更具体地说，我们将使用LSTM 单元构建一个循环神经网络，因为它是当前时间序列预测中的最新技术。

``pip3 install tensorflow pandas numpy matplotlib yahoo_fin sklearn``

Python预测股票价格示例：完成所有设置后，打开一个新的 Python 文件（或笔记本）并导入以下库：

``````import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from yahoo_fin import stock_info as si
from collections import deque

import os
import numpy as np
import pandas as pd
import random``````

``````# set seed, so we can get the same results after rerunning several times
np.random.seed(314)
tf.random.set_seed(314)
random.seed(314)``````

## 准备数据集

``````def shuffle_in_unison(a, b):
# shuffle two arrays in the same way
state = np.random.get_state()
np.random.shuffle(a)
np.random.set_state(state)
np.random.shuffle(b)

def load_data(ticker, n_steps=50, scale=True, shuffle=True, lookup_step=1, split_by_date=True,
test_size=0.2, feature_columns=['adjclose', 'volume', 'open', 'high', 'low']):
"""
Loads data from Yahoo Finance source, as well as scaling, shuffling, normalizing and splitting.
Params:
ticker (str/pd.DataFrame): the ticker you want to load, examples include AAPL, TESL, etc.
n_steps (int): the historical sequence length (i.e window size) used to predict, default is 50
scale (bool): whether to scale prices from 0 to 1, default is True
shuffle (bool): whether to shuffle the dataset (both training & testing), default is True
lookup_step (int): the future lookup step to predict, default is 1 (e.g next day)
split_by_date (bool): whether we split the dataset into training/testing by date, setting it
to False will split datasets in a random way
test_size (float): ratio for test data, default is 0.2 (20% testing data)
feature_columns (list): the list of features to use to feed into the model, default is everything grabbed from yahoo_fin
"""
# see if ticker is already a loaded stock from yahoo finance
if isinstance(ticker, str):
# load it from yahoo_fin library
df = si.get_data(ticker)
elif isinstance(ticker, pd.DataFrame):
df = ticker
else:
raise TypeError("ticker can be either a str or a `pd.DataFrame` instances")
# this will contain all the elements we want to return from this function
result = {}
# we will also return the original dataframe itself
result['df'] = df.copy()
# make sure that the passed feature_columns exist in the dataframe
for col in feature_columns:
assert col in df.columns, f"'{col}' does not exist in the dataframe."
# add date as a column
if "date" not in df.columns:
df["date"] = df.index
if scale:
column_scaler = {}
# scale the data (prices) from 0 to 1
for column in feature_columns:
scaler = preprocessing.MinMaxScaler()
df[column] = scaler.fit_transform(np.expand_dims(df[column].values, axis=1))
column_scaler[column] = scaler
# add the MinMaxScaler instances to the result returned
result["column_scaler"] = column_scaler
# add the target column (label) by shifting by `lookup_step`
# last `lookup_step` columns contains NaN in future column
# get them before droping NaNs
last_sequence = np.array(df[feature_columns].tail(lookup_step))
# drop NaNs
df.dropna(inplace=True)
sequence_data = []
sequences = deque(maxlen=n_steps)
for entry, target in zip(df[feature_columns + ["date"]].values, df['future'].values):
sequences.append(entry)
if len(sequences) == n_steps:
sequence_data.append([np.array(sequences), target])
# get the last sequence by appending the last `n_step` sequence with `lookup_step` sequence
# for instance, if n_steps=50 and lookup_step=10, last_sequence should be of 60 (that is 50+10) length
# this last_sequence will be used to predict future stock prices that are not available in the dataset
last_sequence = list([s[:len(feature_columns)] for s in sequences]) + list(last_sequence)
last_sequence = np.array(last_sequence).astype(np.float32)
result['last_sequence'] = last_sequence
# construct the X's and y's
X, y = [], []
for seq, target in sequence_data:
X.append(seq)
y.append(target)
# convert to numpy arrays
X = np.array(X)
y = np.array(y)
if split_by_date:
# split the dataset into training & testing sets by date (not randomly splitting)
train_samples = int((1 - test_size) * len(X))
result["X_train"] = X[:train_samples]
result["y_train"] = y[:train_samples]
result["X_test"]  = X[train_samples:]
result["y_test"]  = y[train_samples:]
if shuffle:
# shuffle the datasets for training (if shuffle parameter is set)
shuffle_in_unison(result["X_train"], result["y_train"])
shuffle_in_unison(result["X_test"], result["y_test"])
else:
# split the dataset randomly
result["X_train"], result["X_test"], result["y_train"], result["y_test"] = train_test_split(X, y,
test_size=test_size, shuffle=shuffle)
# get the list of test set dates
dates = result["X_test"][:, -1, -1]
# retrieve test features from the original dataframe
result["test_df"] = result["df"].loc[dates]
# remove duplicated dates in the testing dataframe
result["test_df"] = result["test_df"][~result["test_df"].index.duplicated(keep='first')]
# remove dates from the training/testing sets & convert to float32
result["X_train"] = result["X_train"][:, :, :len(feature_columns)].astype(np.float32)
result["X_test"] = result["X_test"][:, :, :len(feature_columns)].astype(np.float32)
return result``````

• `ticker`参数是我们要加载的股票，例如，你可以使用TSLA为特斯拉股市，AAPL苹果，等等。它也可以是一个 Pandas Dataframe，条件是它包含列`feature_columns`以及日期作为索引。
• `n_steps`整数表示我们要使用的历史序列长度，有人称之为窗口大小，回忆一下我们要使用一个循环神经网络，我们需要向网络中输入一个序列数据，选择50表示我们将使用50天的股票价格来预测下一个查找时间步长。
• `scale`是一个布尔变量，指示是否将价格从0缩放到1 ，我们将其设置`True`为将高值从0缩放到1将有助于神经网络更快、更有效地学习。
• `lookup_step`是要预测的未来查找步骤，默认设置为1（例如第二天）。15表示接下来的15天，以此类推。
• `split_by_date`是一个布尔值，表示我们是否按日期拆分训练和测试集，将其设置为`False`意味着我们使用sklearn的`train_test_split()`函数将数据随机拆分为训练和测试。如果是`True`（默认），我们按日期顺序拆分数据。

TensorFlow 2和Keras预测股票价格：我们将使用此数据集中的所有可用功能，即open、high、low、volume和调整后的 close。请查看本教程以了解有关这些指标的更多信息。

• 首先，它使用yahoo_fin模块中的stock_info.get_data()函数加载数据集。
• `"date"`如果它不存在，它会添加索引中的列，这将有助于我们稍后获取测试集的功能。
• 如果scale参数作为True传递，它将使用sklearn的MinMaxScaler类将所有价格从0缩放到1（包括volume）。请注意，每列都有自己的缩放器。
• 然后通过将调整后的关闭列移动lookup_step来添加表示目标值（要预测的标签或 y 的标签）的未来列。
• 之后，它将数据打乱并拆分为训练集和测试集，最后返回结果。

## 模型创建

Python预测股票价格示例 - 现在我们有一个合适的函数来加载和准备数据集，我们需要另一个核心函数来构建我们的模型：

``````def create_model(sequence_length, n_features, units=256, cell=LSTM, n_layers=2, dropout=0.3,
loss="mean_absolute_error", optimizer="rmsprop", bidirectional=False):
model = Sequential()
for i in range(n_layers):
if i == 0:
# first layer
if bidirectional:
else:
elif i == n_layers - 1:
# last layer
if bidirectional:
else:
else:
# hidden layers
if bidirectional:
else:
# add dropout after each layer
model.compile(loss=loss, metrics=["mean_absolute_error"], optimizer=optimizer)
return model``````

## 训练模型

Python如何预测股票价格？现在我们已准备好所有核心功能，让我们训练我们的模型，但在此之前，让我们初始化所有参数（以便你稍后可以根据需要对其进行编辑）：

``````import os
import time
from tensorflow.keras.layers import LSTM

# Window size or the sequence length
N_STEPS = 50
# Lookup step, 1 is the next day
LOOKUP_STEP = 15
# whether to scale feature columns & output price as well
SCALE = True
scale_str = f"sc-{int(SCALE)}"
# whether to shuffle the dataset
SHUFFLE = True
shuffle_str = f"sh-{int(SHUFFLE)}"
# whether to split the training/testing set by date
SPLIT_BY_DATE = False
split_by_date_str = f"sbd-{int(SPLIT_BY_DATE)}"
# test ratio size, 0.2 is 20%
TEST_SIZE = 0.2
# features to use
FEATURE_COLUMNS = ["adjclose", "volume", "open", "high", "low"]
# date now
date_now = time.strftime("%Y-%m-%d")
### model parameters
N_LAYERS = 2
# LSTM cell
CELL = LSTM
# 256 LSTM neurons
UNITS = 256
# 40% dropout
DROPOUT = 0.4
# whether to use bidirectional RNNs
BIDIRECTIONAL = False
### training parameters
# mean absolute error loss
# LOSS = "mae"
# huber loss
LOSS = "huber_loss"
BATCH_SIZE = 64
EPOCHS = 500
# Amazon stock market
ticker = "AMZN"
ticker_data_filename = os.path.join("data", f"{ticker}_{date_now}.csv")
# model name to save, making it as unique as possible based on parameters
model_name = f"{date_now}_{ticker}-{shuffle_str}-{scale_str}-{split_by_date_str}-\
{LOSS}-{OPTIMIZER}-{CELL.__name__}-seq-{N_STEPS}-step-{LOOKUP_STEP}-layers-{N_LAYERS}-units-{UNITS}"
if BIDIRECTIONAL:
model_name += "-b"``````

• `TEST_SIZE`：测试集率。例如总数据集的`0.2`平均值`20%`
• `FEATURE_COLUMNS`：我们将用来预测下一个价格的特征。
• `N_LAYERS`：要使用的 RNN 层数。
• `CELL`：要使用的 RNN 单元，默认为 LSTM。
• `UNITS``cell`单位数。
• `DROPOUT`：dropout rate 是在一个层中没有训练给定节点的概率，其中 0.0 表示根本没有 dropout。这种类型的正则化可以帮助模型不过度拟合我们的训练数据。
• `BIDIRECTIONAL`：是否使用双向循环神经网络
• `LOSS`：用于此回归问题的损失函数，我们使用Huber 损失，你也可以使用平均绝对误差 ( `mae`) 或均方误差 ( `mse`)。
• `OPTIMIZER`：要使用的优化算法，默认为Adam
• `BATCH_SIZE`：每次训练迭代使用的数据样本数。
• `EPOCHS`：学习算法将通过整个训练数据集的次数，我们在这里使用了 500，但尝试进一步增加它。

Python预测股票价格示例：好的，让我们在训练之前确保results、logs和data文件夹存在：

``````# create these folders if they does not exist
if not os.path.isdir("results"):
os.mkdir("results")
if not os.path.isdir("logs"):
os.mkdir("logs")
if not os.path.isdir("data"):
os.mkdir("data")``````

``````# load the data
data = load_data(ticker, N_STEPS, scale=SCALE, split_by_date=SPLIT_BY_DATE,
shuffle=SHUFFLE, lookup_step=LOOKUP_STEP, test_size=TEST_SIZE,
feature_columns=FEATURE_COLUMNS)
# save the dataframe
data["df"].to_csv(ticker_data_filename)
# construct the model
model = create_model(N_STEPS, len(FEATURE_COLUMNS), loss=LOSS, units=UNITS, cell=CELL, n_layers=N_LAYERS,
dropout=DROPOUT, optimizer=OPTIMIZER, bidirectional=BIDIRECTIONAL)
# some tensorflow callbacks
checkpointer = ModelCheckpoint(os.path.join("results", model_name + ".h5"), save_weights_only=True, save_best_only=True, verbose=1)
tensorboard = TensorBoard(log_dir=os.path.join("logs", model_name))
# train the model and save the weights whenever we see
# a new optimal model using ModelCheckpoint
history = model.fit(data["X_train"], data["y_train"],
batch_size=BATCH_SIZE,
epochs=EPOCHS,
validation_data=(data["X_test"], data["y_test"]),
callbacks=[checkpointer, tensorboard],
verbose=1)``````

``````Train on 4696 samples, validate on 1175 samples
Epoch 1/500
4608/4696 [============================>.] - ETA: 0s - loss: 0.0011 - mean_absolute_error: 0.0211
Epoch 00001: val_loss improved from inf to 0.00011, saving model to results\2020-12-11_AMZN-sh-1-sc-1-sbd-0-huber_loss-adam-LSTM-seq-50-step-15-layers-2-units-256.h5
4696/4696 [==============================] - 7s 2ms/sample - loss: 0.0011 - mean_absolute_error: 0.0211 - val_loss: 1.0943e-04 - val_mean_absolute_error: 0.0071
Epoch 2/500
4544/4696 [============================>.] - ETA: 0s - loss: 4.3212e-04 - mean_absolute_error: 0.0146
Epoch 00002: val_loss did not improve from 0.00011
4696/4696 [==============================] - 2s 411us/sample - loss: 4.2579e-04 - mean_absolute_error: 0.0144 - val_loss: 1.5914e-04 - val_mean_absolute_error: 0.0104``````

``tensorboard --logdir="logs"``

## 测试模型

TensorFlow 2和Keras预测股票价格：现在我们已经训练了我们的模型，让我们评估它并看看它在测试集上的表现如何，下面的函数采用一个 Pandas 数据框并使用matplotlib在同一图中绘制真实价格和预测价格，我们稍后将使用它：

``````import matplotlib.pyplot as plt

def plot_graph(test_df):
"""
This function plots true close price along with predicted close price
with blue and red colors respectively
"""
plt.xlabel("Days")
plt.ylabel("Price")
plt.legend(["Actual Price", "Predicted Price"])
plt.show()``````

Python如何预测股票价格？下面的函数将`model``data`该被返回`create_model()``load_data()`功能分别，并构造一个数据帧中，它包括与真正的未来adjclose，以及计算买入和卖出获利沿预计adjclose，我们会看到它在行动中片刻：

``````def get_final_df(model, data):
"""
This function takes the `model` and `data` dict to
construct a final dataframe that includes the features along
with true and predicted prices of the testing dataset
"""
# if predicted future price is higher than the current,
# then calculate the true future price minus the current price, to get the buy profit
buy_profit  = lambda current, pred_future, true_future: true_future - current if pred_future > current else 0
# if the predicted future price is lower than the current price,
# then subtract the true future price from the current price
sell_profit = lambda current, pred_future, true_future: current - true_future if pred_future < current else 0
X_test = data["X_test"]
y_test = data["y_test"]
# perform prediction and get prices
y_pred = model.predict(X_test)
if SCALE:
test_df = data["test_df"]
# add predicted future prices to the dataframe
# add true future prices to the dataframe
# sort the dataframe by date
test_df.sort_index(inplace=True)
final_df = test_df
# since we don't have profit for last sequence, add 0's
)
# add the sell profit column
final_df["sell_profit"] = list(map(sell_profit,
# since we don't have profit for last sequence, add 0's
)
return final_df``````

``````def predict(model, data):
# retrieve the last sequence from data
last_sequence = data["last_sequence"][-N_STEPS:]
# expand dimension
last_sequence = np.expand_dims(last_sequence, axis=0)
# get the prediction (scaled from 0 to 1)
prediction = model.predict(last_sequence)
# get the price (by inverting the scaling)
if SCALE:
else:
predicted_price = prediction[0][0]
return predicted_price``````

``````# load optimal model weights from results folder
model_path = os.path.join("results", model_name) + ".h5"

``````# evaluate the model
loss, mae = model.evaluate(data["X_test"], data["y_test"], verbose=0)
# calculate the mean absolute error (inverse scaling)
if SCALE:
else:
mean_absolute_error = mae``````

``````# get the final dataframe for the testing set
final_df = get_final_df(model, data)``````

``````# predict the future price
future_price = predict(model, data)``````

Python如何预测股票价格？下面的代码通过计算正利润（买入利润和卖出利润）的数量来计算准确度分数：

``````# we calculate the accuracy by counting the number of positive profits
accuracy_score = (len(final_df[final_df['sell_profit'] > 0]) + len(final_df[final_df['buy_profit'] > 0])) / len(final_df)
# calculating total buy & sell profit
total_sell_profit = final_df["sell_profit"].sum()
# dividing total profit by number of testing samples (number of trades)

``````# printing metrics
print(f"Future price after {LOOKUP_STEP} days is {future_price:.2f}\$")
print(f"{LOSS} loss:", loss)
print("Mean Absolute Error:", mean_absolute_error)
print("Accuracy score:", accuracy_score)
print("Total sell profit:", total_sell_profit)
print("Total profit:", total_profit)

``````Future price after 15 days is 3232.24\$
huber_loss loss: 8.655239071231335e-05
Mean Absolute Error: 24.113272707281315
Accuracy score: 0.5884808013355592
Total sell profit: 2095.779877185823
Total profit: 12806.088417530062

• 平均绝对误差：我们得到大约 20 的误差，这意味着，平均而言，模型预测与真实价格相差20 多美元，这会因价格而异`ticker`，随着价格变大，误差也会增加。因此，你应该仅在股票代码稳定时（例如AMZN）使用此指标来比较你的模型。
• 买入/卖出利润：这是我们在所有测试样本上开仓时获得的利润，我们根据`get_final_df()`函数计算了这些利润。
• 总利润：这只是买入和卖出利润的总和。
• 每笔交易利润：总利润除以测试样本总数。
• 准确度分数：这是我们预测准确度的分数，此计算基于我们从测试样本的所有交易中获得的正利润。

``````# plot true/pred prices graph
plot_graph(final_df)``````

``````print(final_df.tail(10))
# save the final dataframe to csv-results folder
csv_results_folder = "csv-results"
if not os.path.isdir(csv_results_folder):
os.mkdir(csv_results_folder)
csv_filename = os.path.join(csv_results_folder, model_name + ".csv")
final_df.to_csv(csv_filename)``````

``````            open        high        low         close       adjclose    volume  ticker  adjclose_15 true_adjclose_15  buy_profit    sell_profit
2021-03-10	3098.449951	3116.459961	3030.050049	3057.639893	3057.639893	3012500	AMZN	3239.598633	3094.080078       36.440186     0.000000
2021-03-11	3104.010010	3131.780029	3082.929932	3113.590088	3113.590088	2776400	AMZN	3238.842773	3161.000000       47.409912     0.000000
2021-03-12	3075.000000	3098.979980	3045.500000	3089.489990	3089.489990	2421900	AMZN	3238.662598	3226.729980       137.239990    0.000000
2021-03-15	3074.570068	3082.239990	3032.090088	3081.679932	3081.679932	2913600	AMZN	3238.824219	3223.820068       142.140137    0.000000
2021-03-17	3073.219971	3173.050049	3070.219971	3135.729980	3135.729980	3118600	AMZN	3238.115234	3299.300049       163.570068    0.000000
2021-03-18	3101.000000	3116.629883	3025.000000	3027.989990	3027.989990	3649600	AMZN	3238.491943	3372.199951       344.209961    0.000000
2021-03-25	3072.989990	3109.780029	3037.139893	3046.260010	3046.260010	3563500	AMZN	3238.083740	3399.439941       353.179932    0.000000
2021-04-15	3371.000000	3397.000000	3352.000000	3379.090088	3379.090088	3233600	AMZN	3223.817627	3306.370117       0.000000      72.719971
2021-04-23	3319.100098	3375.000000	3308.500000	3340.879883	3340.879883	3192800	AMZN	3226.480957	3222.899902       0.000000      117.979980
2021-05-03	3484.729980	3486.649902	3372.699951	3386.489990	3386.489990	5875500	AMZN	3217.589844	3244.989990       0.000000      141.500000``````

• `adjclose_15`: 是`adjclose`15 天后的预测价格（因为`LOOKUP_STEP`设置为15），使用我们的训练模型。
• `true_adjclose_15`: 是`adjclose`15 天后的真实价格，我们通过移动测试数据集得到。
• `buy_profit`：这是我们在当天买入股票时获得的利润，负利润意味着我们亏损（应该是卖出交易，我们买入）。
• `sell_profit`：这是我们在该日期出售股票时获得的利润。

## Python TensorFlow 2和Keras预测股票价格总结

Python如何预测股票价格？你还可以更改模型参数，例如增加层数或LSTM单元的数量，甚至尝试使用GRU单元代替LSTM。