HTML:
<li class="dropdown menu-large">
<a href="/nephrology?cat=879" class="dropdown-toggle" data-toggle="dropdown" title="A">A<b class="caret"></b></a>
<ul class="dropdown-menu megamenu row">
<li class="col-sm-3 col-lg-2">
<ul>
<li class="dropdown-header">
<a href="javascript:void(0);" style="cursor:default;" title="A1">A1</a>
</li>
<li class="divider"></li>
<li><a href="/nephrology?p=3061" title="Apple">Apple</a></li>
<li><a href="/nephrology?p=3062" title="Alien">Alien</a></li>
<li><a href="/nephrology?p=3064" title="AI">AI</a></li>
<li><a href="/nephrology?p=3063" title="April">April</a></li>
</ul>
</li>
</ul>
</li>
<li class="dropdown menu-large">
<a href="/nephrology?cat=874" class="dropdown-toggle" data-toggle="dropdown" title="B">B<b class="caret"></b></a>
<ul class="dropdown-menu megamenu row">
<li class="col-sm-3 col-lg-2">
<ul>
<li class="dropdown-header">
<a href="javascript:void(0);" style="cursor:default;" title="B1">B1</a>
</li>
<li class="divider"></li>
<li><a href="/nephrology?p=3072" title="Banana">Banana</a></li>
<li><a href="/nephrology?p=3048" title="Babe">Babe</a></li>
<li><a href="/nephrology?p=3036" title="Bamboo">Bamboo</a></li>
<li><a href="/nephrology?p=2771" title="Berry">Berry</a></li>
</ul>
</li>
</ul>
</li>
我想刮掉Apple,Alien,AI和April的网址,但不知道怎么做。我下面的代码只删了A的网址"/nephrology?cat=879"
。如何让它在班级"divider"
内抓取网址?因为当我试图只使用类"divider"
时,它也会提取香蕉和其他网址。但我不需要它们。提前谢谢!
我的代码:
for item in soup.find_all(attrs={'class':'dropdown menu-large'}):
for link in item.find_all('a', {'title' : 'A'}):
href=link.get('href') #it gets "/nephrology?cat=879"
答案 0 :(得分:0)
您可以按照以下步骤执行此操作:
首先找到汤中的所有<li>
元素。
soup.find_all("li")
然后只使用一个a
len(list(soup_li.children)) == 1 and soup_li.a
完整程序可视化为:
from bs4 import BeautifulSoup
with open("./sample.html", "r") as f:
soup = BeautifulSoup(f.read(), 'html.parser')
for soup_li in soup.find_all("li"):
if len(list(soup_li.children)) == 1 and soup_li.a:
print soup_li.a["href"]
输出:
/肾脏病?P = 3061
/肾脏病?P = 3062
/肾脏病?P = 3064
/肾病?p = 3063
/肾病?p = 3072
/肾脏病?P = 3048
/肾病?p = 3036
/肾脏病?P = 2771
答案 1 :(得分:0)
试试这个。它将产生您上面提到的确切结果。
from lxml.html import fromstring
root = fromstring(html)
for title in root.cssselect(".dropdown:nth-child(1) .dropdown-header+.divider ~ li"):
item = ' '.join([title.text for title in title.cssselect("a")])
print(item)
结果:
Apple
Alien
AI
April