Question

我希望在标点符号上分割文字而不是电子邮件 - 请考虑必须使用unicode，因为并非所有人都会说英语。

import re

example = 'My email is John@gmail.com. My name is John. Her email is Anna@gmail.com'
print re.split('[.]\s*', example, re.UNICODE)
# gives ['My email is John@gmail', 'com', 'My name is John', 'Her email is Anna@gmail', 'com']
# required ['My email is John@gmail.com', 'My name is John', 'Her email is Anna@gmail.com']

如何正确分离 - 我知道正则表达式但不知道如何解决 - 我认为look behind不起作用，因为字符数不固定。

我可以编写并发匹配第一封电子邮件的分隔符，并认为该电子邮件总是赢得分隔符。

考虑人类是不完美的，它是自然语言，所以这个例子可以 - 我们应该帮助解决他们的简单错误，但不是全部：

'My email is john@www.mysite.pl.I am teenager.'
'My email is john@www.mysite.pl. I am teenager.'

Top level domains ends可以学习并保存在某些词典中，例如'.com | .pl | ...'。

Answer 1

对于你当前的问题，你注意到这解决了它：

re.split('[.]\s+'

除此之外，人们做了几件事：

停止模式，词典，像博士这样的事情。等人a.k.a.你可以查看一个例子here。
机器学习算法。他们检测所有可能的句子结尾，比如。！？等并运行分类来猜测哪一个是句末。例如，请参阅python中的nltk。

Answer 2

它不是那么容易，但是对于提供的例子，它可能带有负面前瞻：

>>> import re
>>>
>>> print re.split('\.(?!com)', example, re.UNICODE)
['My email is John@gmail.com', ' My name is John', ' Her email is Anna@gmail.com']

假设只有.com顶级域名这足以找到解决方案。

<强>更新

另一个例子，john@www.mysite.pl.I上有一个失败，但您写道：

我们应该帮助解决他们的简单错误，但不是全部......

example = [
    'Hello John.Doe@gmail.com, Jane.Doe@mail.edu.pl and Anna_Karenina@mail.gov.pl',
    'My email is john@www.mysite.pl.I am teenager.',
    'My email is john@www.mysite.pl. I am teenager.']

for sentence in example:
    for token in re.split('[.,](?![\w.]+)', sentence, re.UNICODE):
        for word in filter(None, token.split(' ')):
            print word

>>> example = [
...     'Hello John.Doe@gmail.com, Jane.Doe@mail.edu.pl and Anna_Karenina@mail.gov.pl',
...     'My email is john@www.mysite.pl.I am teenager.',
...     'My email is john@www.mysite.pl. I am teenager.']
>>>
>>> for sentence in example:
...     for token in re.split('[.,](?![\w.]+)', sentence, re.UNICODE):
...         for word in filter(None, token.split(' ')):
...             print word
...
Hello
John.Doe@gmail.com
Jane.Doe@mail.edu.pl
and
Anna_Karenina@mail.gov.pl
My
email
is
john@www.mysite.pl.I
am
teenager
My
email
is
john@www.mysite.pl
I
am
teenager

;））））

Answer 3

在Java和C中执行此操作的常用方法是使用ICU库，它提供了一种名为Break Iterator的机制，可以通过正则表达式文件进行配置，以识别您要考虑的文本中的任意数量的常规模式整个代币（电子邮件，电话号码，电话号码等）

我可以看到https://pypi.python.org/pypi/PyICU

的Python版本

它也是应该用来处理Unicode文本的库。

如何在标点符号上拆分文本，而不是在电子邮件或其他表达式上拆分？

3 个答案: