我发现这个Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling?解释了如何将文本提供给文本,但是我无法实际返回由段落/主题更改标记的文本,如文本提示http://www.nltk.org/api/nltk.tokenize.html下所示。
当我将文本提供给文本时,我会收到相同的未加文化文本,但作为列表,这对我没用。
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
我所拥有的是遵循此基本结构的电子邮件
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
如果我们称这个电子邮件字符串为s,它看起来像
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
我想要做的是返回字符串s的这5个部分/段落 - LOGISTICS,INTRO,BODY,OUTRO,POST EMAIL DISCLAIMER - 单独所以我可以删除除文本的BODY之外的所有内容。如何使用nltk texttiling分别返回这5个部分?
***并非所有电子邮件都遵循相同的结构或具有相同的措辞,因此我无法使用正则表达式。
答案 0 :(得分:1)
使用splitlines
怎么样?或者你必须使用nltk包吗?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)
答案 1 :(得分:0)
我想做的是分别返回字符串s的这5个部分/段落-LOGISTICS,INTRO,BODY,OUTRO,POST EMAIL DISCLAIMER-以便分别删除文本正文以外的所有内容。如何使用nltk texttiling分别返回这5个部分?
文本平铺算法{1,4,5}并非旨在执行顺序文本分类{2,3}(这是您描述的任务)。而是从http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling是一种[无监督]技术,用于将文本自动细分为代表段落或副主题的多段单元。
参考: