正则表达式适用于pythex,但不适用于python2.7,通过正则表达式查找unicode表示

时间:2016-01-23 04:17:23

标签: python regex python-2.7

我有一个奇怪的正则表达式问题,我的正则表达式适用于pythex,但不适用于python本身。我现在正在使用2.7。我想删除所有类似\x92的unicode实例,其中有很多(例如'Thomas Bradley \x93Brad\x94 Garza',

import re, requests

def purify(string):
    strange_issue = r"""\\t<td><font size=2>G<td><a href="http://facebook.com/KilledByPolice/posts/625590984135709" target=new><font size=2><center>facebook.com/KilledByPolice/posts/625590984135709\t</a><td><a href="http://www.orlandosentinel.com/news/local/lake/os-leesburg-officer-involved-shooting-20130507"""
    unicode_chars_rgx = r"[\\][x]\d+"
    unicode_matches = re.findall(unicode_chars_rgx, string)
    bad_list = [strange_issue]
    bad_list.extend(unicode_matches)
    for item in bad_list:
        string = string.replace(item, "")
    return string

name_rgx = r"(?:[<][TDtd][>])|(?:target[=]new[>])(?P<the_deceased>[A-Z].*?)[,]"

urls = {2013: "http://www.killedbypolice.net/kbp2013.html",
        2014: "http://www.killedbypolice.net/kbp2014.html",
        2015: "http://www.killedbypolice.net/" }

names_of_the_dead = []

for url in urls.values():
    response = requests.get(url)
    content = response.content
    people_killed_by_police_that_year_alone = re.findall(name_rgx, content)
    for dead_person in people_killed_by_police_that_year_alone:
        names_of_the_dead.append(purify(dead_person))

dead_americans_as_string = ", ".join(names_of_the_dead)
print("RIP, {} since 2013:\n".format(len(names_of_the_dead))) # 3085! :)
print(dead_americans_as_string)



In [95]: unicode_chars_rgx = r"[\\][x]\d+"

In [96]: testcase = "Myron De\x92Shawn May"

In [97]: x = purify(testcase)

In [98]: x
Out[98]: 'Myron De\x92Shawn May'

In [103]: match = re.match(unicode_chars_rgx, testcase)

In [104]: match

如何获取这些\x00个字符?谢谢

1 个答案:

答案 0 :(得分:1)

当然不是试图找到类似于&#34; \\x00&#34;。

的内容。

如果你想破坏数据:

>>> re.sub('[\x7f-\xff]', '', "Myron De\x92Shawn May")
'Myron DeShawn May'

更多工作,但尽可能保留文本:

>>> import unidecode
>>> unidecode.unidecode("Myron De\x92Shawn May".decode('cp1251'))
"Myron De'Shawn May"