如何在Python中使用HuggingFac Transformers微调BERT以进行文本分类?

2021年11月11日02:35:42 发表评论 1,072 次浏览

Python如何对文本分类?了解如何使用 HuggingFace Transformers库为 Python 中的文本分类任务微调 BERT 和其他转换器模型。

HuggingFace Transformers实现文本分类:Transformer 模型在自然语言处理领域的大多数任务中都显示出令人难以置信的结果。迁移学习与大规模 Transformer 语言模型相结合的力量已成为最先进的 NLP 的标准。

NLP 演进过程中最大的里程碑之一是2018 年底谷歌 BERT模型的发布,被称为 NLP 新时代的开始。

在本HuggingFace Transformers文本分类教程中,我们将带你了解使用Huggingface Transformers 库 在你选择的数据集上微调 BERT(以及其他转换器模型)以进行文本分类的示例

请注意,本教程是关于在下游任务(例如文本分类)上对 BERT 模型进行微调,如果你想从头开始训练 BERT,这称为预训练,本教程肯定会对你有所帮助。

我们将使用20 个新闻组数据集作为本教程的演示,它是一个包含20 个不同主题的大约18,000 条新闻帖子的数据集。

首先,让我们与其他人一起安装 Huggingface 变压器库:

pip3 install transformers numpy torch sklearn

打开一个新的 notebook/Python 文件并导入必要的模块:

import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

接下来,让我们创建一个函数来设置种子,以便在不同的运行中获得相同的结果:

def set_seed(seed: int):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available
    if is_tf_available():
        import tensorflow as tf

        tf.random.set_seed(seed)

set_seed(1)

如前所述,我们将使用 BERT 模型。更具体地说,我们将使用bert-base-uncased库中预先训练好的权重。同样,如果你希望使用大型数据集进行预训练,本教程应该可以帮助你做到这一点。

此外,我们将使用max_length的512:

# the model we gonna train, base uncased BERT
# check text classification models here: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"
# max sequence length for each document/sentence sample
max_length = 512

max_length是我们序列的最大长度。换句话说,我们将仅从每个文档或帖子中选择前512 个标记,你可以随时将其更改为你想要的任何内容。但是,如果你增加它,请确保它在训练期间适合你的记忆,即使使用较小的批大小也是如此。

还学习:在 Python 中使用 Transformers 的对话式 AI 聊天机器人。

加载数据集

HuggingFace Transformers实现文本分类:接下来,让我们下载并加载负责将我们的文本转换为标记序列的标记器:

# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

我们还设置do_lower_caseTrue确保将所有文本小写(请记住,我们使用的是无大小写模型)。

Python如何对文本分类?以下代码下载并加载数据集:

def read_20newsgroups(test_size=0.2):
  # download & load 20newsgroups dataset from sklearn's repos
  dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
  documents = dataset.data
  labels = dataset.target
  # split into training & testing a return data as well as label names
  return train_test_split(documents, labels, test_size=test_size), dataset.target_names
  
# call the function
(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

每个train_textsvalid_texts为训练和验证集的文档的列表(字符串列表)分别同为train_labelsvalid_labels,它们中的每个整数,或从标签的列表0至19。target_names是我们20 个标签的列表,每个标签都有自己的名字。

现在让我们使用我们的标记器来编码我们的语料库:

# tokenize the dataset, truncate when passed `max_length`, 
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

我们设置truncationTrue使我们消除上面去令牌max_length,我们还设置paddingTrue到小于垫文档max_length与空令牌。

下面的代码将我们标记化的文本数据包装到一个火炬中Dataset

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

由于我们将使用Trainer来自 Transformers 库,它期望我们的数据集为 a torch.utils.data.Dataset,因此我们创建了一个简单的类来实现__len__()返回样本数量的__getitem__()方法和返回特定索引处的数据样本的方法。

训练模型

HuggingFace Transformers实现文本分类:现在我们已经准备好了数据,让我们下载并加载我们的 BERT 模型及其预训练的权重:

# load the model and pass to CUDA
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")

我们使用BertForSequenceClassificationTransformers 库中的类,我们设置num_labels为可用标签的长度,在本例中为20。

我们还将我们的模型投射到我们的 CUDA GPU,如果你使用的是 CPU(不建议),那么只需删除to()方法。

HuggingFace Transformers文本分类教程:在我们开始微调我们的模型之前,让我们创建一个简单的函数来计算我们想要的指标。在这种情况下,准确性:

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

你可以随意添加任何你想要的指标,我已经包括了准确性,但你可以添加精确度、召回率等。

下面的代码使用TrainingArgumentsclass 来指定我们的训练参数,例如时代数、批量大小和一些其他参数:

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=200,               # log & save weights each logging_steps
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

每个参数都在代码注释中进行了解释,我指定了16作为训练批次大小,这是因为它是我可以适应Google Colab 环境内存的最大值。

你还可以调整其他参数,例如增加 epoch 数以获得更好的训练。

然后我们将我们的训练参数、数据集和compute_metrics回调传递给我们的Trainer

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

训练模型:

# train the model
trainer.train()

这将需要几分钟/小时,具体取决于你的环境,这是我在Google Colab上的输出:

[######################] [2829/2829 58:39, Epoch 3/3]
Step	Training Loss	Validation Loss	Accuracy
200 	2.799619	    2.147746	    0.475066
400 	1.660876	    1.215588	    0.648011
600 	1.204610	    1.035250	    0.706101
800 	1.053862	    0.946825	    0.717507
1000	0.963572	    0.894024	    0.729973
1200	0.765880	    0.860701	    0.746419
1400	0.743791	    0.831061	    0.751989
1600	0.710643	    0.808310	    0.756233
1800	0.675188	    0.814872	    0.760477
2000	0.542912	    0.819911	    0.768700
2200	0.425509	    0.801369	    0.768435
2400	0.401201	    0.834266	    0.771883
2600	0.402379	    0.811787	    0.773210
2800	0.393575	    0.800710	    0.775862
TrainOutput(global_step=2829, training_loss=0.9052972534007089)

如你所见,验证损失逐渐减少,准确率提高到 77.5% 以上。

记住我们设置load_best_model_at_endTrue,这将在完成训练后自动加载性能最佳的模型,让我们确保使用evaluate()方法:

# evaluate the current model after training
trainer.evaluate()

这将需要几秒钟才能输出如下内容:

{'epoch': 3.0,
 'eval_accuracy': 0.7758620689655172,
 'eval_loss': 0.80070960521698}

现在我们训练了我们的模型,让我们保存它:

# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

执行推理

现在我们在数据集上有了一个训练有素的模型,让我们试着用它来玩玩吧!

下面的函数将文本作为字符串,使用我们的标记器对其进行标记,使用 softmax 函数计算输出概率,并返回实际标签:

def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

下面是一个例子:

# Example #1
text = """
The first thing is first. 
If you purchase a Macbook, you should not encounter performance issues that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you will need to make some adjustments. 
Having too many background apps running in the background is one of the most common causes. 
The same can be said about a lack of drive storage. 
For that, it helps if you uninstall xcode and other unnecessary applications, as well as temporary system junk like caches and old backups.
"""
print(get_prediction(text))

输出:

comp.sys.mac.hardware

Python如何对文本分类?正如预期的那样,我们谈论的是 Macbook。这是第二个例子:

# Example #2
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out. 
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a star is dying.
Because no light can get out, people can't see black holes. 
They are invisible. Space telescopes with special tools can help find black holes. 
The special tools can see how stars that are very close to black holes act differently than other stars.
"""
print(get_prediction(text))

输出:

sci.space

正如预期的那样,这是科学-> 空间的标签!

再举一个例子:

# Example #3
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment.  
Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
"""
print(get_prediction(text))

输出:

sci.med

HuggingFace Transformers实现文本分类总结

在本HuggingFace Transformers文本分类教程中,你学习了如何在数据集上使用Huggingface Transformers 库训练 BERT 模型。

请注意,你还可以使用其他转换器模型,例如GPT-2 with GPT2ForSequenceClassificationRoBERTa with GPT2ForSequenceClassificationDistilBERT withDistilBERTForSequenceClassification等等。请前往官方文档获取可用模型列表。

此外,如果你的数据集不是英语的语言,请确保为你的语言选择权重,这在训练过程中会有很大帮助。检查此链接并使用过滤器获取你需要的模型权重。

木子山

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: