Question

我正在尝试删除这两个分隔符之间的文本：'＆lt;' ＆安培; '＆GT;'。我正在阅读电子邮件内容，然后将该内容写入.txt文件。我在这两个分隔符之间得到了很多垃圾，包括我的.txt文件中的行之间的空格。我怎么摆脱这个？下面是我的脚本从写入我的.txt文件的数据中读取的内容：

 First Name</td>

                <td bgcolor='white' style='padding:5px

 !important;'>Austin</td>

                </tr><tr>

                <td bgcolor='#f9f9f9' style='padding:5px !important;'

 valign='top' width=170>Last Name</td>

下面是我目前从.txt文件中读取的代码，用于删除空行：

    # Get file contents
    fd = open('emailtext.txt','r')
    contents = fd.readlines()
    fd.close()

    new_contents = []

    # Get rid of empty lines
    for line in contents:
        # Strip whitespace, should leave nothing if empty line was just       "\n"
        if not line.strip():
            continue
        # We got something, save it
        else:
            new_contents.append(line)

    for element in new_contents:
        print element

以下是预期的内容：

 First Name     Austin      


 Last Name      Jones

Answer 1

    void iso14443a_crc(byte_t* pbtData, size_t szLen, byte_t* pbtCrc)
    {
      byte_t bt;
      uint32_t wCrc = 0x6363;

      do {
        bt = *pbtData++;
        bt = (bt^(byte_t)(wCrc & 0x00FF));
        bt = (bt^(bt<<4));
        wCrc = (wCrc >> 8)^((uint32_t)bt << 8)^((uint32_t)bt<<3)^((uint32_t)bt>>4);
      } while (--szLen);

      *pbtCrc++ = (byte_t) (wCrc & 0xFF);
      *pbtCrc = (byte_t) ((wCrc >> 8) & 0xFF);
    }

您可以使用markup = '<td bgcolor='#f9f9f9' style='padding:5px !important;' valign='top' width=170>Last Name</td>' soup = BeautifulSoup(markup) soup.get_text()

Answer 2

您应该考虑使用正则表达式和re.sub函数：

import re
print re.sub(r'<.*?>', '', text, re.DOTALL)

即使建议“不使用自定义解析器来解析HTML”也始终有效。

Answer 3

您需要将line.strip（）的结果分配给变量并将其添加到其他内容中。否则，您只需保存未剥离的行。

for line in contents:

    line = line.strip()

    if not line:
        continue
    # We got something, save it
    else:
        new_contents.append(line)

Answer 4

您似乎正在尝试从文本中删除所有HTML标记。您可以手动执行，但标签可能很复杂，甚至可以使用多行。

我的建议是使用BeautifulSoup，它专门用于处理xml和html：

import bs4

# extract content... then
new_content = bs4.BeautifoulSoup(content, 'html.parser').text
print new_content

bs4模块已经过广泛测试，可以应对许多极端情况并大大减少您自己的代码......

如何在两个分隔符（包括空行）之间删除文本？

4 个答案: