如何用Python和机器学习生成文本摘要

2022-05-09   出处: nlpcloud.io  作/译者:Julien Salinas/lukeaxu

许多开发人员希望能够自动生成文本摘要。例如,自动创建每篇博客文章的摘要,或自动为员工汇总文档。

Bart Large CNN 等基于 Transformer 的模型可以很容易为文本生成摘要。这些机器学习模型易于使用但比较难扩展。下面一起来看看如何使用 Bart Large CNN 以及如何优化其性能。

Transformers 与 Bart Large CNN

Transformers 使高级自然语言处理(如生成文本摘要)成为可能。

在 Transformer 和神经网络提出之前,实际也有一些解决方案,但没有一个是真正令人满意的。

近年来,人们基于 Transformer 创建了许多表现优秀的预训练模型。其中就包含由 Facebook 发布的 Bart Large CNN,其在文本摘要生成方面表现出色。

以下是如何在 Python 代码中使用 Bart Large CNN。

在 Python 中生成摘要

使用 Bart Large CNN 的最简单方法是从 Hugging Face 存储库下载它,然后调用 Transformers 库中现有的摘要生成方法:

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18."""

summary = summarizer(article, max_length=130, min_length=30))

输出:

Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.

可以看到,虽然只有4行Python代码,但生成的摘要的质量非常好!但您可能也已经注意到了,第一次下载模型需要很长时间,因为模型很大。

min_lengthmax_length 参数表示摘要的 Token 长度范围。Token 可以是一个词,也可以是标点符号或词。一般来说,100 个 Token 大致等于 75 个单词。

重要提示:每次输入的文本不能多于 1024 个 Token(约为 800 个单词),这是模型的内部限制。如果你想为更长的文本生成摘要,一个可行的办法是将文本分成若干部分,分别生成摘要,然后将生成的摘要拼接起来,甚至你还可以为摘要生成摘要!

模型性能

Bart Large CNN 模型存在两个不能忽视的问题。

首先,与许多深度学习模型一样,它需要大量的磁盘空间和 RAM(大约 1.5GB)。与 GPT-3、GPT-J、T5 11B 等大型深度学习模型相比,Bart Large CNN 算是小巫见大巫。

其次,模型运行耗时严重。如果你想为一段由 800 个单词组成的文本生成摘要,在一个性能较强的 CPU 上大约也需要 20 秒……

解决方案是在 GPU 上部署 Bart large CNN。例如,在 NVIDIA Tesla T4 上,大概会有 10 倍的加速,为一段由 800 个单词组成的文本生成摘要只需要 2 秒。

现在 GPU 价格还是很贵,是否应当配备应取决于您的实际需要。

使用外部 API

使用 Bart Large CNN 为文本生成摘要容易用脚本实现,但是如果您想在生产环境中用来处理大量请求怎么办?

如上所述,第一个解决方案是使用 GPU,并进行一些优化以加快运行速度。

第二种解决方案是将此任务委托给例如 NLP Cloud 等第三方服务,类似的服务可以通过提供 API 为您提供 Bart Large CNN 模型的处理能力。

总结

借助 Transformers 和 Bart Large CNN,可以毫不费力地在 Python 中为文本生成摘要。

现在越来越多的公司在其应用程序中实现自动生成文本摘要。任务的难点在于模型过于复杂而带来的性能问题。当然,也有一些技术可以加速 Bart Large CNN 运行。

作者:Julien Salinas,是 NLPCloud.io 首席技术官


声明:本文为本站编辑转载,文章版权归原作者所有。文章内容为作者个人观点,本站只提供转载参考(依行业惯例严格标明出处和作译者),目的在于传递更多专业信息,普惠测试相关从业者,开源分享,推动行业交流和进步。 如涉及作品内容、版权和其它问题,请原作者及时与本站联系(QQ:1017718740),我们将第一时间进行处理。本站拥有对此声明的最终解释权!欢迎大家通过新浪微博(@测试窝)或微信公众号(测试窝)关注我们,与我们的编辑和其他窝友交流。
361° /3614 人阅读/0 条评论 发表评论

登录 后发表评论