Question

我是Python的新手，我正在学习它用于抓取目的我使用BeautifulSoup来收集链接（即'a'标签的href）。我正在尝试收集网站http://allevents.in/lahore/的“即将开始的活动”标签下的链接。我正在使用Firebug来检查元素并获取CSS路径，但此代码没有返回任何内容。我正在寻找修复程序以及如何选择适当的CSS选择器以从任何站点检索所需链接的一些建议。我写了这段代码：

from bs4 import BeautifulSoup

import requests

url = "http://allevents.in/lahore/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
    print link.get('href')

Answer 1

该页面在使用类和标记方面并不是最友好的，但即便如此，您的CSS选择器也太具体而无法在这里使用。

如果你想要即将发生的事件，你只想要第一个<div class="events-horizontal">，那么只需抓住<div class="title"><a href="..."></div>标签，就可以看到标题上的链接：

upcoming_events_div = soup.select_one('div#events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
    print link['href']

请注意，不使用r.text;使用r.content并将Unicode解码为BeautifulSoup。见Encoding issue of a character in utf-8

Answer 2

soup.select('div')
All elements named <div>

soup.select('#author')
The element with an id attribute of author

soup.select('.notice')
All elements that use a CSS class attribute named notice

soup.select('div span')
All elements named <span> that are within an element named <div>

soup.select('div > span')
All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('input[name]')
All elements named <input> that have a name attribute with any value

soup.select('input[type="button"]')
All elements named <input> that have an attribute named type with value button

您可能也会对this book感兴趣。

Answer 3

import bs4 , requests

res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
    print link.get('href')

此代码可以正常工作!!

如何使用CSS选择器使用BeautifulSoup检索位于某个类中的特定链接？

3 个答案: