Question

我遇到一个奇怪的问题，我正在解析的一些文件在一行的中间喷出了奇怪的字符，破坏了我对readline（）的解析返回。在文本编辑器中阅读时，该行看起来很正常，但readline（）会在IP中间读取一个'='和两个'\ n'字符。

例如

Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"

Broken readline() return:
"IP Address: xxx.xxx.xxx="

The next two lines after that being:
""
".xxx"

有什么想法可以解决这个问题吗？我真的无法控制可能导致此问题的原因，我只是需要处理而不会太疯狂。

相关功能，以供参考（我知道那是一团糟）：

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw = ce.readline()
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
    if ip:
        return ip[0]
        ce.close()
    else:
        return ("No IP found in: " + ipraw)
        ce.close()

Answer 1

已解决，如果其他任何人都遇到类似的问题，请将每行另存为字符串，将它们合并在一起，然后重新生成（.sub（）），并记住\ r和\ n字符。我的解决方案有点意大利面，但可以防止对每个文件执行不必要的正则表达式：

df16.pivot(index="Fecha inicio", columns="Delito", 
values="No delitos").plot()

plt.show()

Answer 2

您正在处理的至少某些电子邮件似乎已被编码为quoted-printable。

此编码用于使8位字符数据可在7位（仅ASCII）系统上传输，但它也强制使用76个字符的固定行长。这是通过插入由“ =”和行尾标记组成的软换行符实现的。

Python提供了quopri模块来处理来自quoted-printable的编码和解码。从带引号的可打印数据中解码数据将删除这些换行符。

例如，让我们使用问题的第一段。

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

要正确解码，需要处理整个消息正文，这与您使用readline的方法相冲突。解决此问题的一种方法是将解码后的字符串加载到缓冲区中：

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

如果您的文件包含完整的电子邮件（包括标题），则使用email模块中的工具将自动处理此解码。

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

剥离掉破坏readline（）的多余字符

2 个答案: