我才刚刚开始涉足Python,正如许多人所做的那样,我从网络抓取示例开始尝试该语言。 我正在尝试收集某种标记类型的所有内容并作为列表返回。 为此,我正在使用BeautifulSoup和请求。 用于此测试的网站是一个名为“ Staxel”的小游戏的博客
我可以使用[soup.find]和[print]使我的代码输出第一次出现的标签,但是当我将代码更改为下面的代码时,会收到有关将列表作为固定变量打印的警告。
有人可以指出我应该为此使用什么吗?
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
name = name_box.text.strip() #strip() is used to remove the starting and trailing
print ("Title {}".format(name))
答案 0 :(得分:1)
通过使用.find_all()
,您将创建list
所有出现的h1
。您只需要将打印语句包装在for
循环中即可。具有这种结构的代码如下:
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
for name in name_box:
print ("Title {}".format(name.text.strip()))
输出:
Title Magic update – feature preview
Title New Years
Title Staxel Changelog for 1.3.52
Title Staxel Changelog for 1.3.49
Title Staxel Changelog for 1.3.48
Title Halloween Update & GOG
Title Staxel Changelog for 1.3.44
Title Staxel Changelog for 1.3.42
Title Staxel Changelog for 1.3.40
Title Staxel Changelog for 1.3.34 to 1.3.39
答案 1 :(得分:0)
这是因为soup.find_all返回的列表不是字符串,例如soup.find
下面的代码段应避免错误并打印在python 2.7和3. *中找到的所有标题:
Python 3。*:
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box] # loop over results and strip extract space
for title in titles: # loop over titles and print
print ("Title {}".format(title))
Python 2.7:
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box] # loop over results and strip extract space
for title in titles: # loop over titles and print
print ("Title {}".format(title.encode('utf-8')))
正如@Vantagilt的评论中所提到的,他的输出是在字符串之前添加“ b”。这是由于python 2.7和python 3之间的字符串解释方式不同。这是一个很好的blog。
要点是默认情况下python 3中的字符串是unicode,因此可以删除编码部分。在python 2.7中,字符串存储为字节,需要显式编码,否则我们将看到类似以下错误:
UnicodeEncodeError:'ascii'编解码器无法在位置13编码字符u'\ u2013':序数不在范围内(128)
答案 2 :(得分:0)
您可以使用attrs
来代替class
。
由于find_all
将返回列表,因此您必须遍历并格式化每个值。
Python 2.7
name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value
for name in name_box:
title = name.text.strip()
print ("Title {}".format(title.encode('utf-8')))
Python 3。*
name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value
for name in name_box:
title = name.text.strip()
print ("Title {}".format(title))