我对Python相对较新,对nltk和regex也较新。我一直在寻找指导,但没有弄清楚。我只是想删除文本中整数(应始终为整数)之后的所有x或X,以最终仅得到数字。我有执行X或x删除后我需要做的代码的工作,因此现在我尝试添加到代码中以从数字中删除x或X而不是普通文本(下面是exited和matrix这样的词) 。
例如,如果我的文字字符串为:“这是美好的一天,有710羽鸟类离开其栖息地并飞过头顶。其中130X鸽子俯冲而降落在草地上,而其中21X被7名猎人射击。 9x鸟消失在矩阵中。其余的550x鸟一直飞走。'
我想要这样:
'这是美好的一天,有710只鸟离开它们的栖息地,飞过头顶。其中130只鸽子俯冲下来并降落在草地上,而其中21只被7名猎人射击。 9只鸟消失在矩阵中。其余550只鸟继续飞走。'
所以我不知道这是用regex(正则表达式)还是nltk(自然语言工具包)最好还是仅通过某种if语句来最好地处理。我对所有文本进行了标记化,这些文本可能会从我从中提取pdf文件的20,000至30,000个标记/词以上,但是我很乐意在仍是一个巨大的字符串或将它们制成标记之后删除那些x。没关系非常感谢您的协助...
答案 0 :(得分:5)
这将x与后面的断言(前一个字符是数字)相匹配,并将x替换为空。
re.sub('(?<=\d)[xX]', '', s)
答案 1 :(得分:1)
尝试一下。
import re
text = 'It was a beautiful day and 710x birds exited their habitats and flew overhead. 130X of them dove down and landed on the grass while 21X of them were shot by 7 hunters. 9x birds vanished into the matrix. The remaining 550x birds kept flying away.'
re.sub(r'(\d+)[xX]', r'\1', text)
# >>> 'It was a beautiful day and 710 birds exited their habitats and flew overhead. 130 of them dove down and landed on the grass while 21 of them were shot by 7 hunters. 9 birds vanished into the matrix. The remaining 550 birds kept flying away.'
re.sub
被正则表达式替换。第一个参数是要查找的正则表达式,第二个参数是要替换的正则表达式。
r'(\d+)[xX]'
由
\d+ <= 1 or more integer sequence
[xX] <= 1 x or X
() <= keep it to use afterwards
r'\1'
表示先保留的字符串。
答案 2 :(得分:0)
def parseNumeric(data):
for each in data:
noX =''
for i in each:
if i.isdigit():
noX+=i
if noX != '':
data[data.index(each)]=noX
return " ".join(str(x) for x in data)
theData = "It was a beautiful day and 710x birds exited their habitats and flew overhead. 130X of them dove down and landed on the grass while 21X of them were shot by 7 hunters. 9x birds vanished into the matrix. The remaining 550x birds kept flying away."
print("\n BEFORE \n")
print(theData)
print("\n AFTER \n")
print(parseNumeric(theData.split()))
检查DEMO,我知道这不是最好的解决方案,但希望能对您有所帮助。