通过Beautifulsoup循环元素

时间:2017-10-19 15:12:28

标签: python web-scraping beautifulsoup

我试图存储一些从网站上删除的数据。我需要的数据是元素中的文本,然后存储在csv中以便稍后查询。

在下面的代码中,我找到了对该课程的所有引用' vip'。然后我想循环遍历那些去除不必要的HTML以仅获取文本数据。最后我用utf-8编码,准备插入csv。

# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')

# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}

print title_box

# loop through each iteration
for each in title_box:
    if each.find('title_box'):
        title = title_box.text.strip().encode('utf-8')

# print the result
print title

但是每当我打印出标题'的结果时我收到以下错误:

Traceback (most recent call last):
  File "/Users/XXXX/Projects/project-kitchenaid/scaper.py", line 28, in <module>
    print title
NameError: name 'title' is not defined

根据我的理解,title超出了范围。如何从循环中检索数据并将其写入打印调用?

对于上下文,这只是print title_box

的一个结果
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>]

3 个答案:

答案 0 :(得分:0)

正如我在评论中所说的那样,使用each.find('title_box')无法获取任何内容,因为没有title_box标记。

由于您需要a class属性为vip的{​​{1}}元素,因此您需要检查:

if 'vip' in each['class']:

此外,当代码的这一行运行时:

title_box = soup.findAll('a', attrs={'class': 'vip'}}

title_box列表已填充aclass属性为vip的元素。因此,您不必在for循环中再次检查相同的条件。

这是你应该尝试的代码:

for each in title_box:
    title = each.text.strip().encode('utf-8')
    print title

当然,您可以完全取消将文本分配给变量并直接打印出来:

print each.text.strip().encode('utf-8')

答案 1 :(得分:0)

以下是步骤:

  1. title_box = soup.findAll('a', attrs={'class': 'vip'}} 此行找到所有带有&#34; a&#34; 标记的html,并使用所需的类 vip 进一步过滤。
  2. 您无法执行if each.find('title_box'):,因为没有名为title_box
  3. 的html标记
  4. 您可以使用

    获取文字

    for each in soup: print(each.text.strip().encode('utf-8'))

  5. 无需进一步使用参考上述摘录的任何条件陈述

答案 2 :(得分:0)

我制作了一个HTML文件,其中包含五个a元素副本,并将其命名为“temp.htm”:

<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>

然后我运行此代码以获取这些链接中的文本:

>>> page = open('temp.htm').read()
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.select('.vip'):
...     link.text
... 
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'

您可能仍需要对这些文本进行编码以存入您的csv文件。