Question

有问题的文字是http://pastebin.com/gD65sS22，是关于病毒媒体的论文的摘要。我使用http://www.nltk.org/api/nltk.tokenize.html中的示例代码，例如我读取文本文件，加载punkt / english.pickle标记器，然后打印生成的句子。

基本上输出很糟糕。几乎没有一个'例如'被正确忽略，几个引用变得糟糕......

这只是一般的NLTK弱点还是我做错了什么？我应该调查使用正则表达式吗？

Answer 1

首先，您的文字有点嘈杂，如果nltk.sent_tokenize看到换行符\r\n，它会将其分解并将其用作句子边界。接下来，sent_tokenize对于带有句子内句点的文本并不是很擅长。 E.g。

from urllib.request import urlopen, Request
from nltk import sent_tokenize

request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))

for sent in sent_tokenize(text):
    print (sent)

[OUT]：

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.
boom goes the dynamite) or quirky Web productions (eg.
nyan cat).
Others show extraordinary events caught on video (eg.
battle at Kruger) or contain political messages (eg.
kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.
1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.
1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al.
(2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al.
(2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al.
2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al.
2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al.
(2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al.
(2013) or Lazer et al.
(2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

现在让我们尝试一些黑客攻击：

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    return text.replace('et al. ', 'et_al._')   

def unhack(text):
    return text.replace('et_al._', 'et al. ')   


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[OUT]：

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.
boom goes the dynamite) or quirky Web productions (eg.
nyan cat).
Others show extraordinary events caught on video (eg.
battle at Kruger) or contain political messages (eg.
kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.
1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.
1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

现在看起来更好但(e.g. ...仍有问题。和Fig. ...。让我们继续黑客攻击：

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    text = text.replace('et al. ', 'et_al._')
    text = text.replace('eg. ', 'eg._')
    text = text.replace('Fig. ', 'Fig._')
    return text

def unhack(text):
    text = text.replace('et_al._', 'et al. ')
    text = text.replace('eg._', 'eg.')
    text = text.replace('Fig._', 'Fig.')
    return  text


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[OUT]：

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.boom goes the dynamite) or quirky Web productions (eg.nyan cat).
Others show extraordinary events caught on video (eg.battle at Kruger) or contain political messages (eg.kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

但是，是的，在句子标记器之前清理文本并不困难，只需寻找打破标记器的一些通用模式。我希望你能从上面的例子中得到一般的想法。

所以黑客工作，但它只适用于此数据集。 我如何推广黑客呢？唯一的解决方案是重新训练punkt tokenizer以获取专门针对学术文本的句子标记器，请参阅training data format for nltk punkt

但请注意，您可能需要使用一小组句子标记化文本来训练标记器。玩得开心！

Answer 2

当人们回答这些时，为什么不向我发送邮件...... 无论如何。我一直在调查这个问题，并最终偶然发现谷歌的答案，使用NLTK功能并添加更多“已知”的缩写词：

tokenizer._params.abbrev_types.update（extra_abbreviations）

extra_abbreviations是特定于语言和上下文的。但即使使用['eg'，'al'，'ie']也会大大改善我的情况，这让我想知道为什么“训练有素”的英式泡菜似乎不包含这些。

Python 2.7 x32 - NLTK Punkt Tokenizer无法正确检测句子

2 个答案: