我试图获取特定类的所有文本,但它返回一个空列表:
>>> soup.find_all(' dataRow odd')
[]
HTML:
<tr class=" dataRow odd" onblur="if (window.hiOff){hiOff(this);}"
onfocus="if (window.hiOn){hiOn(this);}" onmouseout="if (window.hiOff){hiOff(this);}"
onmouseover="if (window.hiOn){hiOn(this);}"><td class='actionColumn'> </td><th scope="row" class=" dataCell ">
<a href="/a0I9000000hHJIN?btdid=0019000001piFE9">textexttext</a></th><td class=" dataCell ">Active</td><td class=" dataCell ">
<a href="/a089000001nOvG8?btdid=0019000001piFE9">BIG TEXT/a></td>
<td class=" dataCell ">TEXTTEXTTEXT</td><td class=" dataCell ">TEXTTEXTTEXT</td>
<td class=" dataCell "> </td><td class=" dataCell "> </td><td class=" dataCell DateElement">8/02/2019</td></tr>
我试图抓取该代码中的所有文字。
但是当我运行我的代码时,它返回[]
,好像它没有找到任何东西。
import requests, bs4, re
html = open('2.html')
soup = bs4.BeautifulSoup(exampleFile, "lxml")
duh = soup .find_all(' dataRow odd')
print (duh)
我哪里错了? 此外,理想情况下,代码会吐出不同行上的所有单独文本
答案 0 :(得分:0)
查询dataRow odd
会产生周围的<tr>
,其中包含<td>
和<a>
等内的所有其他元素。您只需抓取文本通过像这样访问.text
属性,它只会给你一大堆文本而不是HTML:
for d in duh:
print d.text
您可以单独获取<td>
中的所有<tr>
元素,然后从每个元素中获取.text
,而不是这样。
import requests, bs4, re
html = open('test.html')
soup = bs4.BeautifulSoup(html, "html.parser") # use html parser instead of XML
duh = soup.find_all('tr', {'class':' dataRow odd'}) # using ktb's suggestion from comments
for d in duh:
tds = d.find_all()
for td in tds:
cleaned = td.text.strip().rstrip('\n') # remove newlines and spaces
if cleaned != '':
print cleaned