Python网站使用'soup.findall'抓取所有标签

时间:2019-02-06 01:30:07

标签: python python-3.x beautifulsoup

我才刚刚开始涉足Python,正如许多人所做的那样,我从网络抓取示例开始尝试该语言。 我正在尝试收集某种标记类型的所有内容并作为列表返回。 为此,我正在使用BeautifulSoup和请求。 用于此测试的网站是一个名为“ Staxel”的小游戏的博客

我可以使用[soup.find]和[print]使我的代码输出第一次出现的标签,但是当我将代码更改为下面的代码时,会收到有关将列表作为固定变量打印的警告。

有人可以指出我应该为此使用什么吗?

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
name = name_box.text.strip() #strip() is used to remove the starting and trailing
print ("Title {}".format(name))

3 个答案:

答案 0 :(得分:1)

通过使用.find_all(),您将创建list所有出现的h1。您只需要将打印语句包装在for循环中即可。具有这种结构的代码如下:

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
for name in name_box:
    print ("Title {}".format(name.text.strip()))

输出:

Title Magic update – feature preview
Title New Years
Title Staxel Changelog for 1.3.52
Title Staxel Changelog for 1.3.49
Title Staxel Changelog for 1.3.48
Title Halloween Update & GOG
Title Staxel Changelog for 1.3.44
Title Staxel Changelog for 1.3.42
Title Staxel Changelog for 1.3.40
Title Staxel Changelog for 1.3.34 to 1.3.39

答案 1 :(得分:0)

这是因为soup.find_all返回的列表不是字符串,例如soup.find

下面的代码段应避免错误并打印在python 2.7和3. *中找到的所有标题:

Python 3。*:

name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
for title in titles:  # loop over titles and print
    print ("Title {}".format(title))

Python 2.7:

   name_box = soup.find_all('h1',attrs={'class':'entry-title'})
    titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
    for title in titles:  # loop over titles and print
        print ("Title {}".format(title.encode('utf-8'))) 

正如@Vantagilt的评论中所提到的,他的输出是在字符串之前添加“ b”。这是由于python 2.7和python 3之间的字符串解释方式不同。这是一个很好的blog

要点是默认情况下python 3中的字符串是unicode,因此可以删除编码部分。在python 2.7中,字符串存储为字节,需要显式编码,否则我们将看到类似以下错误:

  

UnicodeEncodeError:'ascii'编解码器无法在位置13编码字符u'\ u2013':序数不在范围内(128)

答案 2 :(得分:0)

您可以使用attrs来代替class

由于find_all将返回列表,因此您必须遍历并格式化每个值。

Python 2.7

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title.encode('utf-8')))

Python 3。*

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title))