BeautifulSoup,忽略<a> </a>标记并获取<p> </p>

时间:2017-08-18 10:56:15

标签: python-2.7 web-scraping beautifulsoup

我想获取属于<p>的每个news1标记内的所有文字

import requests
from bs4 import BeautifulSoup
r1  = requests.get("http://www.metalinjection.net/shocking-revelations/machine-heads-robb-flynn-addresses-controversial-photo-from-his-past-in-the-wake-of-charlottesville")
data1 = r1.text
soup1 = BeautifulSoup(data1, "lxml")
news1 = soup1.find_all("div", {"class": "article-detail"})

for x in news1:
    print x.find("p").text

这会得到第一个<p>文本而且只有..当调用find_all时它会出现以下错误

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

所以我做了一个list.but仍然得到同样的错误??

text1 = []
for x in news1:
    text1.append(x.find_all("p").text)

print text1

1 个答案:

答案 0 :(得分:1)

运行代码时出现的错误是:AttributeError: 'ResultSet' object has no attribute 'text',这是合理的,因为bs4 ResultSet基本上是Tag元素的列表。你可以得到每个&#39; p&#39;标记,如果你循环遍历该iterable。

text1 = []
for x in news1:
    for i in x.find_all("p"):
        text1.append(i.text)

或者作为单行使用列表推导:

text1 = [i.text for x in news1 for i in x.find_all("p")]