Question

我有一个这种格式的文本文件：

b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'

我想阅读这些内容并将其转换为

Chapter 1 - BlaBla
Boy's Dead.

并将其替换为同一文件。我尝试使用print进行编码和解码（line.encode（“UTF-8”，“替换”）），但这不起作用

Answer 1

strings = [
    b'Chapter 1 \xe2\x80\x93 BlaBla',
    b'Boy\xe2\x80\x99s Dead.',
]

for string in strings:
    print(string.decode('utf-8', 'ignore'))

--output:--
Chapter 1 – BlaBla
Boy’s Dead.

并将它们替换为同一文件。

世界上没有可以做到这一点的计算机编程语言。您必须将输出写入新文件，删除旧文件，并将新文件重命名为oldfile。但是，python的fileinput模块可以为您执行该过程：

import fileinput as fi
import sys

with open('data.txt', 'wb') as f:
    f.write(b'Chapter 1 \xe2\x80\x93 BlaBla\n')
    f.write(b'Boy\xe2\x80\x99s Dead.\n')

with open('data.txt', 'rb') as f:
    for line in f:
        print(line)

with fi.input(
        files = 'data.txt', 
        inplace = True,
        backup = '.bak',
        mode = 'rb') as f:

    for line in f:
        string = line.decode('utf-8', 'ignore')
        print(string, end="")

~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla\n'
b'Boy\xe2\x80\x99s Dead.\n'

~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.

修改

import fileinput as fi import re pattern = r""" \\ #Match a literal slash... x #Followed by an x... [a-f0-9]{2} #Followed by any hex character, 2 times """ repl = '' with open('data.txt', 'w') as f: print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f) print(r"b'Boy\xe2\x80\x99s Dead.'", file=f) with open('data.txt') as f: for line in f: print(line.rstrip()) #Output goes to terminal window with fi.input( files = 'data.txt', inplace = True, backup = '.bak') as f: for line in f: line = line.rstrip()[2:-1] new_line = re.sub(pattern, "", line, flags=re.X) print(new_line) #Writes to file, not your terminal window

~/python_programs$ python3.4 prog.py b'Chapter 1 \xe2\x80\x93 BlaBla' b'Boy\xe2\x80\x99s Dead.' ~/python_programs$ cat data.txt Chapter 1 BlaBla Boys Dead.

您的文件不包含二进制数据，因此您可以在text mode中读取（或写入）它。这只是一个正确逃避事情的问题。

这是第一部分：

print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)

Python将字符串中的某些backslash escape sequences转换为其他内容。 python转换的反斜杠转义序列之一的格式为：

\xNN #=> e.g. \xe2

反斜杠转义序列长度为四个字符，但python将反斜杠转义序列转换为单个字符。

但是，我需要将四个字符中的每一个写入我创建的示例文件中。为了防止python将反斜杠转义序列转换为一个字符，您可以使用另一个'\'转义开头'\'：

\\xNN

但是懒惰，我不想通过你的字符串并且手动逃避每个反斜杠转义序列，所以我使用了：

r"...."

r string为你逃脱所有反斜杠。因此，python将\xNN序列的所有四个字符写入文件。

下一个问题是replacing a backslash in a string using a regex - 我认为这是你的问题。当文件包含\时，python将其作为\\读入字符串以表示文字反斜杠。因此，如果文件包含四个字符：

\xe2

python将其读入字符串：

"\\xe2"

打印时看起来像：

\xe2

底线是：如果您在打印出的字符串中看到'\'，则反斜杠将在字符串中进行转义。要查看字符串中的内容，应始终使用repr()。

string = "\\xe2" print(string) print(repr(string)) --output:-- \xe2 '\\xe2'

请注意，如果输出周围有引号，那么您将看到字符串中的所有内容。如果输出周围没有引号，那么你无法确定字符串中的确切内容。

要构造一个与字符串中的文字反斜杠匹配的正则表达式模式，简短的答案是：您需要使用您想到的反斜杠量的两倍。使用字符串：

"\\xe2"

你会认为模式是：

pattern = "\\x"

但基于 加倍规则 ，您确实需要：

pattern = "\\\\x"

还记得r字符串吗？如果你对模式使用r字符串，那么你可以编写看似合理的字符串，然后r字符串将转义所有斜杠，加倍：

pattern r"\\x" #=> equivalent to "\\\\x"

Python：将二进制文字文本文件转换为普通文本

1 个答案: