Question

这不是一个重复的问题，我搜索并实现了stackoverflow中的各种答案，但没有任何成功。

我正在将.docx文件转换为python中的纯文本，但是，CMD上的打印显示了字符“'”的有趣字符。（即canΓÇÖt）。以下是我的代码：

if file.endswith('.docx'):
        docx = zipfile.ZipFile(fullpath)
        content = docx.read('word/document.xml')
        cleaned = re.sub('<(.|\n)*?>','',content)
        text=unescape(cleaned)
        newtext = text.replace("'", " ")
        print newtext

来自stackoverflow的回答让我把“text = unescape（cleaning）”和“text.replace（”'“，”“）”放进去，但没有任何成功的结果。

如何从字符串变量中删除撇号？或者更好的是，我如何确保撇号正确显示？

Answer 1

我的猜测是你没有看'some_text_here'你在看‘some_text_here’。也就是说，单个卷曲（或“智能”）引用。

这样做：

if file.endswith('.docx'):
    ...
    cleaned = re.sub('<(.|\n)*?>','',content)
    cleaner = string.translate(cleaned, None, ["‘","’"])
    # python3 has to use:
    # # cleaner = cleaned.translate(str.maketrans({'‘':'','’':''}))

供参考：

>>> ord("‘") # left single smart quote
# 8216
>>> ord("’") # right single smart quote
# 8217
>>> ord("'") # single apostrophe
# 39

如何从文本中删除撇号？

1 个答案: