我有以下代码尝试从某些html返回数据,但是我无法返回我需要的内容...
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('h3')
for link in links:
print link
getData()
返回以下列表:
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (YES)
</a>
</h3>
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (MAYBE)
</a>
</h3>
我希望只能返回标题:TITLE STUFF HERE (YES)
和TITLE STUFF HERE (MAYBE)
我希望能够使用的另一件事
soup.find_all("a", limit=2)
功能,但不是“限制”而不是返回两个结果只有我希望它只返回第二个链接...所以选择功能不是限制? (这样的功能是否存在?)
答案 0 :(得分:5)
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('a')
for link in links:
if link.parent.name == 'h3':
print(link.text)
getData()
您也可以从一开始就查找所有链接并检查父级是否为h3,而父级的父级是带有类块的div