Question

我才刚刚开始涉足Python，正如许多人所做的那样，我从网络抓取示例开始尝试该语言。我正在尝试收集某种标记类型的所有内容并作为列表返回。为此，我正在使用BeautifulSoup和请求。用于此测试的网站是一个名为“ Staxel”的小游戏的博客

我可以使用[soup.find]和[print]使我的代码输出第一次出现的标签，但是当我将代码更改为下面的代码时，会收到有关将列表作为固定变量打印的警告。

有人可以指出我应该为此使用什么吗？

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
name = name_box.text.strip() #strip() is used to remove the starting and trailing
print ("Title {}".format(name))

Answer 1

通过使用.find_all()，您将创建list所有出现的h1。您只需要将打印语句包装在for循环中即可。具有这种结构的代码如下：

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
for name in name_box:
    print ("Title {}".format(name.text.strip()))

输出：

Title Magic update – feature preview
Title New Years
Title Staxel Changelog for 1.3.52
Title Staxel Changelog for 1.3.49
Title Staxel Changelog for 1.3.48
Title Halloween Update & GOG
Title Staxel Changelog for 1.3.44
Title Staxel Changelog for 1.3.42
Title Staxel Changelog for 1.3.40
Title Staxel Changelog for 1.3.34 to 1.3.39

Answer 2

这是因为soup.find_all返回的列表不是字符串，例如soup.find

下面的代码段应避免错误并打印在python 2.7和3. *中找到的所有标题：

Python 3。*：

name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
for title in titles:  # loop over titles and print
    print ("Title {}".format(title))

Python 2.7：

   name_box = soup.find_all('h1',attrs={'class':'entry-title'})
    titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
    for title in titles:  # loop over titles and print
        print ("Title {}".format(title.encode('utf-8')))

正如@Vantagilt的评论中所提到的，他的输出是在字符串之前添加“ b”。这是由于python 2.7和python 3之间的字符串解释方式不同。这是一个很好的blog。

要点是默认情况下python 3中的字符串是unicode，因此可以删除编码部分。在python 2.7中，字符串存储为字节，需要显式编码，否则我们将看到类似以下错误：

UnicodeEncodeError：'ascii'编解码器无法在位置13编码字符u'\ u2013'：序数不在范围内（128）

Answer 3

您可以使用attrs来代替class。

由于find_all将返回列表，因此您必须遍历并格式化每个值。

Python 2.7

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title.encode('utf-8')))

Python 3。*

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title))

Python网站使用'soup.findall'抓取所有标签

3 个答案: