我有一些要使用正则表达式清除的文本数据。但是,文本中的某些单词后面紧跟着我要删除的数字。
例如,文本的一行是:
序言2贡献者4缩写5致谢8佩斯 术语10从RUPES项目获得的经验教训12 越南的环境服务及其潜力和范例16 本章将生态系统服务付款纳入越南政策 和计划17章为三安流域创造激励 保护20章可持续的景观美化融资 巴马国家公园24章建立碳排放支付机制 的森林封存在和阿省草丰县的一个试点项目 越南平省26第5章地方收入分享芽庄湾 越南海洋保护区28综合和建议30 参考文献32
以上文本中的第一个单词应为“ preface”,而不是“ preface2”,依此类推。
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
这会删除单词以及看到的内容
从RUPES支付环境服务中获得的经验教训 并在“集成付款”一章中进行了介绍。 生态系统服务纳入越南政策并创建激励机制 为Tri An分水岭章可持续景观融资 巴赫玛国家公园美景24章建立支付机制 林业碳封存的草坡试点项目 和平省第5章地方收益分成 董里湾海洋保护区综合与
我如何只捕获紧跟单词的数字?
答案 0 :(得分:1)
您可以尝试先行断言来检查数字前的单词。在强制您的正则表达式仅匹配单词末尾的数字时,请尝试单词边界(\ b):
re.sub(r'(?<=\w+)\d+\b', '', line)
希望这会有所帮助
编辑: 很抱歉出现毛刺,注释中也提到了匹配的数字,这些数字也不是单词开头。那是因为(再次抱歉)\ w匹配字母数字字符,而不是字母字符。根据您要删除的内容,可以使用肯定版本
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
仅检查数字或否定版本之前的英文字母字符(可以在[a-zA-Z]列表中添加字符)
re.sub(r'(?<![\d\s])\d+\b', '', line)
匹配所需数字之前没有\ d(数字)或\ s(空格)的任何内容。不过,这也会匹配标点符号。
答案 1 :(得分:1)
您可以捕获文本部分,然后用捕获的部分替换单词。它只是写道:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
答案 2 :(得分:0)
尝试一下:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\ 1将与单词\\ 2匹配该数字。参见:How to use python regex to replace using captured group?
答案 3 :(得分:0)
下面,我提出了一个可能解决您问题的代码示例。
以下是代码段:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
出于测试目的,我们对您的测试数据运行上述功能:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
结果如下:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
答案 4 :(得分:-1)
您还可以创建一系列数字:
re.sub(r"[0-9]", "", line)