我想提取页面上所有div的所有href和src,这些div具有class =' news_item'
html看起来像这样:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
从这里我要提取的是:
www.link.com,这是link-heading和/ image / link
我的代码是:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
问题是它给了我错误:
**TypeError: 'NoneType' object is not callable**
这是Traceback:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")
答案 0 :(得分:1)
您的大多数错误都来自于news_link
找不到正确的事实。你没有收到你期望的tag
。
变化:
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
,看看是否有帮助:
news_links = soup.find_all("div", class="news_item")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links.find("a").get('href')
另请注意,return语句会为您提供类似http://www.web.comwww.link.com的内容,我认为您不会这样。
答案 1 :(得分:1)
您应该立即抓取所有新闻项目,然后迭代它们。这样可以轻松地将您获得的数据组织成可管理的块(在本例中为dicts)。试试这样的事情
url = "http://www.web.com"
r = requests.get(url)
soup = BeautifulSoup(r.text)
messages = []
news_links = soup.select("div.news_item") # selects all .news_item's
for l in news_links:
message = {}
message['heading'] = l.find("h2").text.strip()
link = l.find("a")
if link:
message['link'] = link['href']
else:
continue
image = l.find('img')
if image:
message['image'] = "http://www.web.com{}".format(image['src'])
else:
continue
messages.append(message)
print messages
答案 2 :(得分:0)
您将任务分成不同方法的想法非常好 -
很高兴阅读,改变和重用。
错误几乎已经解决并修复,在跟踪中有 select_all ,但它不在beautifulsoup中,也不在你的代码和其他一些东西......长话短说我会这样做
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urlparse import urljoin
import requests
def news_links(url, soup):
links = []
for text in soup.select("div.news_item"):
for x in text.find_all(href=True):
links.append(urljoin(url, x['href']))
return links
def news_headings(soup):
headings = []
for news_headings in soup.select("h2.link"):
heading.append(str(news_headings.string.strip()))
return headings
def news_images(url, soup):
sources = []
for image in soup.select("img[src]"):
sources.append(urljoin(url, image['src']))
return sources
def top_stories():
url = 'http://www.web.com/'
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content)
message = {'heading': news_headings(soup),
'link': news_links(url, soup),
'image': news_images(url, soup)}
return message
print top_stories()
汤是健壮的,你想找到或选择不存在的东西它返回一个空列表。它看起来像是在解析项目列表 - 代码非常接近于此。