Question

我有一个让我疯狂的简单问题，似乎是由于unicode个字符的python处理。

我的latex表存储在我的磁盘上（非常类似于http://www.jwe.cc/downloads/table.tex），我想在其上应用一些正则表达式，以便连字符-（\u2212）被en-dashes –（alt 0150或\u2013）

取代

我正在使用以下函数执行两个不同的正则表达式和替换。

import re
import glob

def mychanger(fileName):
  with open(fileName,'r') as file:
    str = file.read()
    str = str.decode("utf-8")
    str = re.sub(r"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")
    str = re.sub(r"(^|[^0-9])\u2212(\d+)","\\1\u2013\\2", str).encode("utf-8")
  with open(fileName,'wb') as file:
    file.write(str)

myfile = glob.glob("C://*.tex")
for file in myfile: mychanger(file)

不幸的是，这并没有改变任何事情。

但是，如果我使用非$之类的非Unicode字符而不是\u2013，这意味着正则表达式代码是正确的。

我迷失在这里，我尝试使用re.sub(ur"((?:^|[^{])\d+)\u2212(\d+[^}])","\\1\u2013\\2", str).encode("utf-8")，但它仍然没有改变任何东西。

这里有什么问题？谢谢！

Answer 1

您的示例文件实际上包含HYPHEN-MINUS（U + 002D）而不是U + 2212。

即使它确实包含正确的字符，你也会遇到Python 2.x Unicode的所有n00b问题：

内联解码和编码。实际上你编码了两次！
使用不在Unicode字符串中的Unicode文字（\u2212）
不必要地使用r原始修饰符

我的建议是删除所有解码和编码，并允许Python为您完成。 io模块向后移植Python 3.x行为并为您解码文件。我还将str重命名为my_str，以避免与Python自己的str类冲突。

import re
import glob
import io

def mychanger(fileName):
    with io.open(fileName,'r', encoding="utf-8") as file:
        my_str = file.read()

        my_str = re.sub(u"((?:^|[^{])\d+)\u002d(\d+[^}])", u"\\1\u2013\\2", my_str)
        my_str = re.sub(u"(^|[^0-9])\u002d(\d+)",          u"\\1\u2013\\2", my_str)

    with io.open(fileName, 'w', encoding="utf-8") as file:
        file.write(my_str)

myfile = glob.glob(C://*.tex")

for file in myfile: mychanger(file)

有关Python 2.x和Unicode的详细说明，请参阅How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

python：使用regex re.sub将unicode替换为常规字符

1 个答案: