如何删除文本中单词结尾处可能出现的数字

时间:2019-01-02 15:55:17

标签: python regex regex-group

我有一些要使用正则表达式清除的文本数据。但是,文本中的某些单词后面紧跟着我要删除的数字。

例如,文本的一行是:

  

序言2贡献者4缩写5致谢8佩斯   术语10从RUPES项目获得的经验教训12   越南的环境服务及其潜力和范例16   本章将生态系统服务付款纳入越南政策   和计划17章为三安流域创造激励   保护20章可持续的景观美化融资   巴马国家公园24章建立碳排放支付机制   的森林封存在和阿省草丰县的一个试点项目   越南平省26第5章地方收入分享芽庄湾   越南海洋保护区28综合和建议30   参考文献32

以上文本中的第一个单词应为“ preface”,而不是“ preface2”,依此类推。

line = re.sub(r"[A-Za-z]+(\d+)", "", line)

这会删除单词以及看到的内容

  

从RUPES支付环境服务中获得的经验教训   并在“集成付款”一章中进行了介绍。   生态系统服务纳入越南政策并创建激励机制   为Tri An分水岭章可持续景观融资   巴赫玛国家公园美景24章建立支付机制   林业碳封存的草坡试点项目   和平省第5章地方收益分成   董里湾海洋保护区综合与

我如何只捕获紧跟单词的数字?

5 个答案:

答案 0 :(得分:1)

您可以尝试先行断言来检查数字前的单词。在强制您的正则表达式仅匹配单词末尾的数字时,请尝试单词边界(\ b):

re.sub(r'(?<=\w+)\d+\b', '', line)

希望这会有所帮助

编辑: 很抱歉出现毛刺,注释中也提到了匹配的数字,这些数字也不是单词开头。那是因为(再次抱歉)\ w匹配字母数字字符,而不是字母字符。根据您要删除的内容,可以使用肯定版本

re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)

仅检查数字或否定版本之前的英文字母字符(可以在[a-zA-Z]列表中添加字符)

re.sub(r'(?<![\d\s])\d+\b', '', line)

匹配所需数字之前没有\ d(数字)或\ s(空格)的任何内容。不过,这也会匹配标点符号。

答案 1 :(得分:1)

您可以捕获文本部分,然后用捕获的部分替换单词。它只是写道:

re.sub(r"([A-Za-z]+)\d+", r"\1", line)

答案 2 :(得分:0)

尝试一下:

line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number    
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one    
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one

\\ 1将与单词\\ 2匹配该数字。参见:How to use python regex to replace using captured group?

答案 3 :(得分:0)

下面,我提出了一个可能解决您问题的代码示例。

以下是代码段:

import re

# I'will write a function that take the test data as input and return the
# desired result as stated in your question.

def transform(data):
    """Replace in a text data words ending with number.""""
    # first, lest construct a pattern matching those words we're looking for
    pattern1 = r"([A-Za-z]+\d+)"

    # Lest construct another pattern that will replace the previous in the final
    # output.
    pattern2 = r"\d+$"

    # Let find all matching words
    matches = re.findall(pattern1, data)

    # Let construct a list of replacement for each word
    replacements = []
    for match in matches:
        replacements.append(pattern2, '', match)

    # Intermediate variable to construct tuple of (word, replacement) for
    # use in string method 'replace'
    changers = zip(matches, replacements)

    # We now recursively change every appropriate word matched.
    output = data
    for changer in changers:
        output.replace(*changer)

    # The work is done, we can return the result
    return output

出于测试目的,我们对您的测试数据运行上述功能:

data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons     
learnt from the RUPES project12 Payment for environmental service and it potential and 
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams 
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter 
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao 
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang 
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""

result = transform(data)

print(result)

结果如下:

Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from 
the RUPES project Payment for environmental service and it potential and example in 
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and 
programmes Chapter Creating incentive for Tri An watershed protection Chapter 
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building 
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong 
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay 
Marine Protected Area Vietnam Synthesis and Recommendations References

答案 4 :(得分:-1)

您还可以创建一系列数字:

re.sub(r"[0-9]", "", line)