Question

我正在使用Beautiful Soup库解析网页的内容并将结果打印到.txt文件中。这通常可行，但是我无法摆脱出现在文本输出中的某些unicode字符代码。例如：

“无法调查客户终端的问题。”

我一直在使用“ io”库将输出编码为utf-8。我尝试将编码更改为ascii，但这也不起作用。

def open_file(file):
    with open((file), encoding='utf-8') as input_data:
        global soup
        soup = BeautifulSoup(input_data)
        return soup

# stuff happens here to parse the html and prepare a list of dictionaries containing the content I want to print.

# this prepares the output

def dict_writer(dict_list, filename):
    with io.open('%s.txt' % filename, 'w', encoding="utf-8") as f:
        for dict in dict_list:
            content = json.dumps(dict.get("content"))
            loc_no = json.dumps(dict.get("location_number"))
            page_no = json.dumps(dict.get("page_number"))
            f.write("\n")
            f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")
            f.write("\n")

我阅读了以下文章，以期全面了解字符编码的工作原理。看来，如果我在open_file函数中对内容进行了编码，则可以在dict_writer函数中以相同的标准对输出进行编码。

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Answer 1

您获得使用\u编码的非ASCII字符的原因是您正在使用json.dumps。从the docs中可以看到，ensure_ascii参数默认为True，如果为true，则“保证输出会转义所有传入的非ASCII字符”。

因此，您可以仅将ensure_ascii=False添加到所有dumps通话中。

但是，实际上，为什么首先要使用json.dumps？您输出的格式不是JSON文件。实际上，这似乎是为人类而不是为计算机使用而设计的。那么，为什么要额外的引号，转义字符等以使它的各个部分可进行JSON解析，即使整个过程不是这样呢？如果您不这样做的话，它将简单得多，并且可能会产生更好的输出：

content = dict.get("content")
loc_no = str(dict.get("location_number"))
page_no = str(dict.get("page_number"))
f.write("\n")
f.write(content + " " + "(" + page_no + ", " + loc_no + ")" +"\n")

…或者甚至更好：

content = dict.get("content")
loc_no = dict.get("location_number")
page_no = dict.get("page_number")
f.write("\n")
f.write("{} ({}, {})\n".format(content, page_no, loc_no)

在此过程中，调用您的字典dict令人困惑（并且这意味着您无法在函数的其余部分中访问dict构造函数，而不会得到那些会导致错误的错误之一你整夜调试，然后感觉像个白痴。

还有，为什么在这里使用get("content")？

如果您不必担心没有content的情况，只需使用["content"]-或更简单的是，将字典传递给format_map：

for ref in refs:
    f.write("\n{content} ({page_number}, {location_number})\n".format_map(ref))

如果您要做需要担心此类情况，请确定您要使用一些对人类有意义的适当字符串，而不是None。例如：

for ref in refs:
    content = ref.get("content", "-- content missing --")
    page_no = ref.get("page_number", "N/A")
    loc_no = ref.get("location_number", "N/A")
    f.write("\n{} ({}, {})\n".format(content, page_no, loc_no)

如何在TXT输出中替换Unicode字符代码

1 个答案: