Question

我一直在尝试从以前保存的HTML页面中抓取信息。

我一直在与BeautifulSoup和Selenium合作，试图使它自动化。我现在正在尝试使用PC上的HTML文件，尝试从论坛中提取数据。

from lxml import html
from lxml import etree

root = etree.parse(r'C:\...\testFile.html')
tree = html.fromstring(root)

comment = tree.xpath('//*[@id="region-main"]/div/div[3]/div[1]/div[2]/div[2]/div/div/p/text()')

print (comment)

我希望从论坛中的评论以文本形式获取数据，因此以后可以将其另存为文本。

这是我删除所有个人数据后的评论示例

<div class="indent"><a id="p170083"></a><div class="forumpost clearfix" role="region" aria-label="Re: JS por JSOR"><div class="row header clearfix"><div class="left picture"><a href="http://SiteExemplo/user/view.php?id=40297&amp;course=38000"><img src="http://SiteExemplo/theme/image.php/adaptable/core/1560540164/u/f1" alt="Imagem de JSOR" title="Imagem de JSOR" class="userpicture defaultuserpic" width="100" height="100" /></a></div><div class="topic"><div class="subject" role="heading" aria-level="2">Re:JS </div><div class="author" role="heading" aria-level="2">por <a href="http://SiteExemplo/user/view.php?id=40297&amp;course=38000">JSOR</a> - terça, 16 abr 2019, 20:54</div></div></div><div class="row maincontent clearfix"><div class="left"><div class="grouppictures">&nbsp;</div></div><div class="no-overflow"><div class="content"><div class="posting fullpost"><p>THIS IS THE TEXT, I WAS TRYING TO RETRIEVE.</p><div class="attachedimages"></div></div></div></div></div><div class="row side"><div class="left">&nbsp;</div><div class="options clearfix"><div class="commands"><a href="http://siteExample/mod/forum/discuss.php?d=42778#p170083">Link direto</a> | <a href="http://SiteExemplo/mod/forum/discuss.php?d=42778#p98677">Mostrar principal</a> | <a href="http://SiteExemplo/mod/forum/post.php?edit=170083">Editar</a> | <a href="http://SiteExemplo/mod/forum/post.php?delete=170083">Excluir</a> | <a href="http://SiteExemplo/mod/forum/post.php?reply=170083#mformforum">Responder</a></div></div></div></div>

使用lxml解析HTML时出现错误“规范要求属性异步的值”

0 个答案: