Question

我想获取属于<p>的每个news1标记内的所有文字

import requests
from bs4 import BeautifulSoup
r1  = requests.get("http://www.metalinjection.net/shocking-revelations/machine-heads-robb-flynn-addresses-controversial-photo-from-his-past-in-the-wake-of-charlottesville")
data1 = r1.text
soup1 = BeautifulSoup(data1, "lxml")
news1 = soup1.find_all("div", {"class": "article-detail"})

for x in news1:
    print x.find("p").text

这会得到第一个<p>文本而且只有..当调用find_all时它会出现以下错误

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

所以我做了一个list.but仍然得到同样的错误??

text1 = []
for x in news1:
    text1.append(x.find_all("p").text)

print text1

Answer 1

运行代码时出现的错误是：AttributeError: 'ResultSet' object has no attribute 'text'，这是合理的，因为bs4 ResultSet基本上是Tag元素的列表。你可以得到每个＆＃39; p＆＃39;标记，如果你循环遍历该iterable。

text1 = []
for x in news1:
    for i in x.find_all("p"):
        text1.append(i.text)

或者作为单行使用列表推导：

text1 = [i.text for x in news1 for i in x.find_all("p")]

BeautifulSoup，忽略<a> </a>标记并获取<p> </p>

1 个答案: