我一直在尝试从以前保存的HTML页面中抓取信息。
我一直在与BeautifulSoup和Selenium合作,试图使它自动化。我现在正在尝试使用PC上的HTML文件,尝试从论坛中提取数据。
from lxml import html
from lxml import etree
root = etree.parse(r'C:\...\testFile.html')
tree = html.fromstring(root)
comment = tree.xpath('//*[@id="region-main"]/div/div[3]/div[1]/div[2]/div[2]/div/div/p/text()')
print (comment)
我希望从论坛中的评论以文本形式获取数据,因此以后可以将其另存为文本。
这是我删除所有个人数据后的评论示例
<div class="indent"><a id="p170083"></a><div class="forumpost clearfix" role="region" aria-label="Re: JS por JSOR"><div class="row header clearfix"><div class="left picture"><a href="http://SiteExemplo/user/view.php?id=40297&course=38000"><img src="http://SiteExemplo/theme/image.php/adaptable/core/1560540164/u/f1" alt="Imagem de JSOR" title="Imagem de JSOR" class="userpicture defaultuserpic" width="100" height="100" /></a></div><div class="topic"><div class="subject" role="heading" aria-level="2">Re:JS </div><div class="author" role="heading" aria-level="2">por <a href="http://SiteExemplo/user/view.php?id=40297&course=38000">JSOR</a> - terça, 16 abr 2019, 20:54</div></div></div><div class="row maincontent clearfix"><div class="left"><div class="grouppictures"> </div></div><div class="no-overflow"><div class="content"><div class="posting fullpost"><p>THIS IS THE TEXT, I WAS TRYING TO RETRIEVE.</p><div class="attachedimages"></div></div></div></div></div><div class="row side"><div class="left"> </div><div class="options clearfix"><div class="commands"><a href="http://siteExample/mod/forum/discuss.php?d=42778#p170083">Link direto</a> | <a href="http://SiteExemplo/mod/forum/discuss.php?d=42778#p98677">Mostrar principal</a> | <a href="http://SiteExemplo/mod/forum/post.php?edit=170083">Editar</a> | <a href="http://SiteExemplo/mod/forum/post.php?delete=170083">Excluir</a> | <a href="http://SiteExemplo/mod/forum/post.php?reply=170083#mformforum">Responder</a></div></div></div></div>