我试图存储一些从网站上删除的数据。我需要的数据是元素中的文本,然后存储在csv中以便稍后查询。
在下面的代码中,我找到了对该课程的所有引用' vip'。然后我想循环遍历那些去除不必要的HTML以仅获取文本数据。最后我用utf-8编码,准备插入csv。
# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')
# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}
print title_box
# loop through each iteration
for each in title_box:
if each.find('title_box'):
title = title_box.text.strip().encode('utf-8')
# print the result
print title
但是每当我打印出标题'的结果时我收到以下错误:
Traceback (most recent call last):
File "/Users/XXXX/Projects/project-kitchenaid/scaper.py", line 28, in <module>
print title
NameError: name 'title' is not defined
根据我的理解,title
超出了范围。如何从循环中检索数据并将其写入打印调用?
对于上下文,这只是print title_box
:
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>]
答案 0 :(得分:0)
正如我在评论中所说的那样,使用each.find('title_box')
无法获取任何内容,因为没有title_box
标记。
由于您需要 a
class
属性为vip
的{{1}}元素,因此您需要检查:
if 'vip' in each['class']:
此外,当代码的这一行运行时:
title_box = soup.findAll('a', attrs={'class': 'vip'}}
title_box
列表已填充a
个class
属性为vip
的元素。因此,您不必在for循环中再次检查相同的条件。
这是你应该尝试的代码:
for each in title_box:
title = each.text.strip().encode('utf-8')
print title
当然,您可以完全取消将文本分配给变量并直接打印出来:
print each.text.strip().encode('utf-8')
答案 1 :(得分:0)
以下是步骤:
title_box = soup.findAll('a', attrs={'class': 'vip'}}
此行找到所有带有&#34; a&#34; 标记的html,并使用所需的类 vip 进一步过滤。if each.find('title_box'):
,因为没有名为title_box
您可以使用
获取文字 for each in soup:
print(each.text.strip().encode('utf-8'))
无需进一步使用参考上述摘录的任何条件陈述
答案 2 :(得分:0)
我制作了一个HTML文件,其中包含五个a
元素副本,并将其命名为“temp.htm”:
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
然后我运行此代码以获取这些链接中的文本:
>>> page = open('temp.htm').read()
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.select('.vip'):
... link.text
...
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
您可能仍需要对这些文本进行编码以存入您的csv文件。