Question

我在使用Python 2.7解析IM聊天记录时遇到了一些问题。我目前正在使用BeautifulSoup.get_text。这通常有效，但有时会掩盖有趣的东西。例如：

<font color="#A82F2F"><font size="2">(3/11/2016 3:11:57 PM)</font> <b>user name:</b></font> <html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><p>Have you posted the key to <a href="https://___.edu/sshkeys/?">https://___.edu/sshkeys/?</a></p></body></html><br/>

在这种情况下，我得到了Have you posted the key to部分，但它删除了https:________部分。

大多数（不是全部）线的格式相同。即日期时间，用户，有趣的东西。

有没有更好的方法来解析这个以获取文本和所有有趣的东西？

Answer 1

您可以使用find_all：

for anchor in soup.find_all('a', href=True):
    print("The anchor url={} text={}".format(anchor['href'], anchor['text'])

根据您想要输出此信息的方式，您必须或多或少地变得聪明。

在python中解析聊天日志，目前正在使用BeautifulSoup

1 个答案: