Question

我正在使用python 2.4，我遇到了unicode正则表达式的一些问题。我试图把我的问题放在一个非常清晰简洁的例子中。看起来Python在识别不同的字符编码方面存在一些问题，或者我的理解存在问题。非常感谢您一起来看看！

#!/usr/bin/python
#
# This is a simple python program designed to show my problems with regular expressions and character encoding in python
# Written by Brian J. Stinar
# Thanks for the help! 

import urllib # To get files off the Internet
import chardet # To identify charactor encodings
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
print (chardet.detect(rawdata))
#print (rawdata)

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8
print(chardet.detect(UTF_8_encoded)) # Looks good

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE)
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8")
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data")

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE)
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!?
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8")

'''
# In additon, I tried this regular expression library much to the same unsatisfactory result
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8")

if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data")
'''

我正在进行替换项目，并且使用非ASCII编码文件时遇到困难。这个问题是一个更大的项目的一部分 - 最终我想用其他文本替换文本（我使用ASCII工作，但我无法识别其他编码中的事件。）再次感谢。

http://brian-stinar.blogspot.com

-Brian J. Stinar -

Answer 1

您可能想要启用DOTALL标志，或者想要使用search方法而不是match方法。即：

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或：

# search will find matches even if they aren't at the start of the string
... re_UNSUB_amsterdam.search(foo) ...

这些会给你不同的结果，但两者都应该给你匹配。（看哪个是你想要的类型。）

顺便说一句：您似乎正在获取编码文本（即字节）和解码文本（字符）混淆。这种情况并不罕见，尤其是在3.x之前的Python中。特别是，这是非常可疑的：

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

您使用ISO-8859-2进行 de 编码，而不是 en - 编码，因此请将此变量称为“已解码”。（为什么不“ISO_8859_2_decoded”？因为ISO_8859_2是一种编码。已解码的字符串不再具有编码。）

你的代码的其余部分试图在rawdata和UTF_8_encoded（两个编码的字符串）上进行匹配，而它应该使用解码的unicode字符串。

Answer 2

这可能会有所帮助：http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

Answer 3

使用默认标记设置时，。*与换行符不匹配。在第一个换行符后，UNSUBSCRIBE只出现一次。 Adobe出现在第一个换行符之前。您可以使用re.DOTALL来解决这个问题。

但是你还没有检查过Adobe匹配的内容：它是1478字节宽！打开re.DOTALL，它（和相应的UNSUBSCRIBE模式）将匹配整个文本!!

你肯定需要失去尾随。* - 你不感兴趣，它会减慢比赛的速度。你也应该失去领先。*并使用search（）而不是match（）。

在这种情况下，re.UNICODE标志对你没用 - 阅读手册并查看它的作用。

为什么要将数据转码为UTF-8并进行搜索？请留下Unicode。

其他人指出，在对数据进行任何认真工作之前，您需要先解码Ӓ等所有内容......但未提及与您的数据相关的«等内容胡椒： - ）

Answer 4

你的问题是关于正则表达式，但没有它们你的问题可能会得到解决;而是使用标准字符串replace方法。

import urllib
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
decoded = raw.decode('iso-8859-2')
type(decoded)    # decoded is now <type 'unicode'>
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

如果没有别的，上面显示了如何处理编码：简单地解码为unicode字符串并使用它。但请注意，这仅适用于您只有一个或极少数替换的情况（并且这些替换不是基于模式的），因为replace()一次只能处理一个替换。

对于基于字符串和模式的替换，您可以执行类似这样的操作，以便一次实现多个替换：

import re
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'),
                (u'UNS.*IBE', u'@wobble@'),
                (u'Dublin', u'Sydney'))

def replacer(m):
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1]

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS))
substituted = r.sub(replacer, decoded)

Python Unicode正则表达式

4 个答案: