Question

我正在尝试格式化文本块以便于搜索。大多数文本格式正常，但偶尔会有一些单词将单词分为1个字符和2个字符的组（这些文本来自PDF）。我只想在那些块中删除所有空白字符。

目标语言是Python。我写了一个正则表达式来识别块：
((?:\s\b[a-zA-Z]{1,2}\b){3,})
但是我对如何从这些块中仅匹配/删除空白感到困惑。

如果可以找到使用Python regex函数（即re.sub）的解决方案，那将是很好的，尽管所有建议都受到欢迎。

这是文本的示例，其中包含一个分割字符：

Lorem ipsum dolor坐下，奉献己任，sius do eiusmod tempor incididunt ut Labore et dolore magna aliqua。尽量不要抽烟，不要因抽烟而锻炼。 Duis aute irure dolor in reprehenderit in volttable velit esse cillum dolore eu fugiat nulla pariatur。不擅长于圣人的情节，应在负责任的犯罪活动中动手。登记柜台-迎宾区1级09:00-12:00登记12:00-13:00午餐13:00-19:00登记程序一览表9月1日，2 0 1 8栏1栏2栏3六月1日栏2会议室1会议室2 07：0：0-08：0 0 R处的值是D的初始值，O的Pr es的值是3的O的Pr es的值是4 d时的波特率是2 15：1 0-1 6：20分1 0在洛雷姆ipsum dolor坐下，继续奉献爱德华，塞伊斯莫德tempor incididunt ut Labore et dolore magna aliqua。尽量不要抽烟，不要因抽烟而锻炼。 Duis aute irure dolor in reprehenderit in volttable velit esse cillum dolore eu fugiat nulla pariatur。不会出现意外的圣人，反而会在犯规的情况下动手动手。

以下是应用于示例文本的上述正则表达式的链接： https://regex101.com/r/utVcRy/1

当应用于示例文本块时，可接受的答案可能看起来与此类似（数字和标点符号可能会被忽略-到目前为止，我编写的正则表达式没有选择它们，这很好）：

Lorem ipsum dolor坐下，奉献己任，sius do eiusmod tempor incididunt ut Labore et dolore magna aliqua。尽量不要抽烟，不要因抽烟而锻炼。 Duis aute irure dolor in reprehenderit in volttable velit esse cillum dolore eu fugiat nulla pariatur。不擅长于圣人的情节，应在负责任的犯罪活动中动手。登记柜台-迎宾区1级登记09:00-12:00登记12:00-13:00午餐13:00-19:00登记PROGRAM AT GLANCE星期六，2018年9月1日，宴会厅1宴会厅2宴会厅3小型宴会厅2会议厅1会议厅207：00-08：00登记病态口头发言3正式发言：16模式20专题讨论会10小儿Lorem ipsum dolor坐位，私立学校，sius do eiusmod tempor incididunt ut Labore et dolore magna aliqua。尽量不要抽烟，不要因抽烟而锻炼。 Duis aute irure dolor in reprehenderit in volttable velit esse cillum dolore eu fugiat nulla pariatur。不会出现意外的圣人，反而会在犯规的情况下动手动手。

Answer 1

https://docs.python.org/3.3/howto/regex.html

正则表达式（称为RE或regexes或regex模式）本质上是一种嵌入在Python中的小型，高度专业化的编程语言，可通过re模块使用。使用这种小语言，您可以为要匹配的可能字符串集指定规则。该集合可能包含英语句子，电子邮件地址，TeX命令或您喜欢的任何内容。然后，您可以问诸如“此字符串是否与模式匹配？”或“该字符串中的任何位置是否与模式匹配？”之类的问题。您还可以使用RE修改字符串或以各种方式将其拆分。

Answer 2

我通过使用问题中的正则表达式来提取问题块，然后使用Python删除空格来解决此问题：

spaceless_blocks = set(re.findall(r"((?:\s\b[a-zA-Z]{1,2}\b){3,})", content))

# Need to sort by descending length so that shorter subsets don't prevent replace finding the longer blocks
spaceless_blocks = sorted(spaceless_blocks, key=len, reverse=True)

for block in spaceless_blocks:
    # Only attempt a fix if the block is large enough to be worth it
    if len(block) > 10:
        compressed = block.replace(' ', '')
        # then do stuff with the compressed block...

正则表达式以匹配/删除组内的空白

2 个答案: