＆＃34;但那位先生，＆＃34;看着达西，＆＃34;似乎认为这个国家根本没什么。＆＃34;

Question

我有一个这样的字符串：

＆＃34;但那位先生，＆＃34;看着达西，＆＃34;似乎认为这个国家根本没什么。＆＃34;

我想要这个输出：

“但那位先生，”看着达西，“似乎认为这个国家根本就没什么。”

同样，愚蠢的单引号应该转换为它们的大写等价物。 Read about the typographic rules here if you are interested.

我的猜测是之前已经解决了这个问题，但我无法找到一个库或脚本来完成它。 SmartyPants（Perl）是所有图书馆的母亲，并且有一个python port。但它的输出是HTML实体：package main import ( "fmt" ) func main() { fmt.Println(*(something())) } func something() *string { s := "a" return &s }我只想要一个带有引号的纯字符串。有什么想法吗？

更新

我按照Padraig Cunningham的建议解决了这个问题：

使用smartypants进行排版修正
使用“But that gentleman,”将HTML实体转换回Unicode

如果您的输入文本包含您希望不转换的HTML实体，但在我的情况下它没有问题，则此方法可能会出现问题。

更新结束

输入可以信任吗？

到目前为止，输入只能被信任。该字符串可能包含非封闭双引号：HTMLParser().unescape。它还可以包含非封闭的单引号："But be that gentleman, looking at Dary。最后，它可能包含一个单引号，意思是撇号：'But be that gentleman, looking at Dary

我已经实现了一个算法，试图正确地关闭这些丢失的引号，所以这不是问题的一部分。为了完整起见，这里是关闭丢失的引号的代码：

Don't go there.

Answer 1

您可以使用HTMLParser取消浏览smartypants返回的html实体：

In [32]: from HTMLParser import HTMLParser

In [33]: s = "&#x201C;But that gentleman,&#x201D;"

In [34]: print HTMLParser().unescape(s)
“But that gentleman,”
In [35]: HTMLParser().unescape(s)
Out[35]: u'\u201cBut that gentleman,\u201d'

要避免编码错误，您应该在打开文件时使用io.open并指定encoding="the_encoding"或将字符串解码为unicode：

 In [11]: s
Out[11]: '&#x201C;But that gentleman,&#x201D;\xe2'

In [12]: print  HTMLParser().unescape(s.decode("latin-1"))
“But that gentleman,”â

Answer 2

自从最初提出该问题以来，Python smartypant因直接输出Unicode替换字符而获得an option：

u = 256

输出Unicode字符而不是数字字符引用，例如，从“到左双引号（“）（U + 201C）。

Answer 3

浏览文档时，看起来就像你在智能套装上遇到.replace一样：

smartypants(r'"smarty" \"pants\"').replace('&#x201C;', '“').replace('&#x201D;', '”')

如果您使用魔术字符串别名，可能会更好地阅读：

html_open_quote = '&#x201C;'
html_close_quote = '&#x201D;'
smart_open_quote = '“'
smart_close_quote = '”'
smartypants(r'"smarty" \"pants\"') \
    .replace(html_open_quote, smart_open_quote)  \
    .replace(html_close_quote, smart_close_quote)

Answer 4

假设输入良好，可以使用正则表达式完成：

# coding=utf8
import re
sample = '\'Sample Text\' - "But that gentleman," looking at Darcy, "seemed to think the \'country\' was nothing at all." \'Don\'t convert here.\''
print re.sub(r"(\s|^)\'(.*?)\'(\s|$)", r"\1‘\2’\3", re.sub(r"\"(.*?)\"", r"“\1”", sample))

输出：

‘Sample Text’ - “But that gentleman,” looking at Darcy, “seemed to think the ‘country’ was nothing at all.” ‘Don't convert here.’

我在这里将单引号分开，假设它们位于一行的开头/结尾或者周围有空格。

Answer 5

对于最简单的用例，不需要正则表达式：

quote_chars_counts = {
    '"': 0,
    "'": 0,
    "`": 0
}


def to_smart_quotes(s):
    output = []

    for c in s:
        if c in quote_chars_counts.keys():
            replacement = (quote_chars_counts[c] % 2 == 0) and '“' or '”'
            quote_chars_counts[c] = quote_chars_counts[c] + 1
            new_ch = replacement
        else:
            new_ch = c
        output.append(new_ch)

    return ''.join(output)

如果需要的话，修改以从替换映射中提取替换而不是使用文字是很简单的。

Python：替换＆＃34;哑引号＆＃34;字符串中的“卷曲的”

＆＃34;但那位先生，＆＃34;看着达西，＆＃34;似乎认为这个国家根本没什么。＆＃34;

“但那位先生，”看着达西，“似乎认为这个国家根本就没什么。”

5 个答案: