Question

我正在尝试清理文本以供机器学习应用程序使用。基本上，这些都是“半结构化”的规范文档，我正在尝试删除与NLTK class GetUsersJob extends Job { public function __construct($user) { $this->user = $user; } public function handle( ) { return $user->all(); } }函数混淆的部分编号。

以下是我正在处理的文本示例：

sent_tokenize()

我试图删除所有分节符（例如2.3.3、2.4，（b）），但不删除日期数字。

这是我到目前为止使用的正则表达式：and a Contract for the work and/or material is entered into with some other person for a greater amount, the undersigned hereby agrees to forfeit all right and title to the aforementioned deposit, and the same is forfeited to the Crown. 2.3.3 ... (b) until thirty-five days after the time fixed for receiving this tender, whichever first occurs. 2.4 AGREEMENT Should this tender be accepted, the undersigned agrees to enter into written agreement with the Minister of Transportation of the Province of Alberta for the faithful performance of the works covered by this tender, in accordance with the said plans and specifications and complete the said work on or before October 15, 2019.

不幸的是，它与上一段中的部分日期（2019.变成201）匹配，我真的不知道如何解决此问题，这不是regex的专家。

感谢您的帮助！

Answer 1

您可以尝试将以下模式替换为空字符串

((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))

output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)

此模式通过将段号与\d+(?:\.\d+)*匹配来起作用，但前提是它出现在行的开头。它还将字母部分的标题匹配为$[a-z]+$。

Answer 2

您尝试过的模式[0-9]*\.[0-9]|[0-9]\.未锚定，将匹配0+位数字，点和一位数字或|位和一位数字

它没有考虑括号之间的匹配。

假设分节符位于字符串的开头，并且可能在空格或制表符的前面，则可以将alternation的模式更新为：

^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))

^字符串的开头
[\t ]*匹配0+次空格或制表符
(?:非捕获组
- \d+(?:\.\d+)+匹配1个以上的数字并重复1个以上的点，并重复1个以上的数字以匹配至少一个点以匹配2.3.3或2.4
- |
- $[a-z]+$在括号之间匹配a + z 1+次
)关闭非捕获组

Regex demo | Python demo

例如使用re.MULTILINE，s是您的字符串：

pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Answer 3

对于您的具体情况，我认为\n[\d+\.]+|\n$\w$应该有效。 \n有助于区分该部分。

正则表达式从文本中删除

3 个答案: