Question

我正在尝试清理一些非常嘈杂的用户生成的网络数据。有些人在句子结束一段时间后没有添加空格。例如，

＆＃34;下订单。如果您有任何问题，请致电我们。＆＃34;

我想提取每个句子，但是当我尝试使用nltk解析句子时，它无法识别这些是两个单独的句子。我想使用正则表达式来查找包含＆＃34; some_word.some_other_word＆＃34;的所有模式。以及包含＆＃34; some_word：some_other_word＆＃34;的所有模式使用python。

同时我想避免找到类似＆＃34; U.S.A＆＃34;的模式。所以避免just_a_character.just_another_character

非常感谢你的帮助：）

Answer 1

最简单的解决方案：

>>> import re
>>> re.sub(r'([.:])([^\s])', r'\1 \2', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'

第一个参数 - 模式 - 告诉我们要匹配句点或冒号后跟非空格字符。第二个参数是替换，它将第一个匹配的符号，然后是空格，然后是第二个匹配的符号。

Answer 2

您似乎在问两个不同的问题：

1）如果你想找到所有模式，比如“some_word.some_other_word”或“some_word：some_other_word”

import re
re.findall('\w+[\.:\?\!]\w+', your_text)

这会查找文本your_text

中的所有模式

2）如果你想提取所有句子，你可以

import re
re.split('[\.\!\?]', your_text)

这应该返回一个句子列表。例如，

text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']

Answer 3

以下是您的文字中可能存在的一些案例：

sample = """
   Place order.Call us (period: split)  
   ever after.(The end) (period: split)  
   U.S.A.(abbreviation: don't split internally)
   1.3 How to work with computers (dotted numeral: don't split)  
   ever after...The end (ellipsis: don't split internally)
   (This is the end.)   (period inside parens: don't split)  
   """

所以：不要在数字之后，单个大写字母之后，或者在一个paren或其他时期之前为句点添加空格。否则增加空间。这将完成所有这些：

sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r"\1. \2", sample)

结果：

Place order. Call us (period: split)  
ever after. (The end) (period: split)  
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)  
ever after... The end (ellipsis: don't split internally)
(This is the end.)   (period inside parens: don't split)

这解决了样本中的所有问题，除了U.S.A.之后的最后一个句点，应该后面添加一个空格。我把它放在一边，因为条件的组合是棘手的。以下正则表达式将处理所有内容，但我不推荐它：

   sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r"\1. \2", sample)

像这样的复杂regexp是可维护性的噩梦 - 只是尝试添加另一个模式，或者限制它以省略更多的情况。相反，我建议使用单独的正则表达式来捕捉丢失的案例：单个大写字母后的一段时间，但不会跟随一个单一的资本，paren或其他时期。

sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r"\1 \2", sample)

对于像这样的复杂任务，为每种类型的替换使用单独的正则表达式是有意义的。我将原始文本拆分为子类，每个子类只为非常特定的模式添加空格。你可以拥有任意数量的东西，但它不会失控（至少，不会太多......）

Answer 4

您可以使用类似

的内容

import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

如何查找具有此模式的所有子字符串：some_word.some_other_word with python？

4 个答案: