Question

我正在使用BeautifulSoup从html文件中提取内容。我有数千个提取的html文件，并希望提取所有文件中p标记之间的内容。这是相关代码：

for line in text:
    soup = bs(line, 'html.parser')
    autor = soup.find_all('p').text
    s = autor.replace('\\n', '')
    l.append(s)

我想使用find_all（）。text提取所有p标签之间的文本，但是出现此错误：

ResultSet对象没有属性“文本”。您可能正在将项目列表像单个项目一样对待。当您打算致电find（）时，您是否致电过find_all（）？

如果我仅使用find（）。text

autor = soup.find('p').text

我只是得到每个文件的第一个p标签。

有人可以帮忙吗？

Answer 1

用新行自然分隔的文本：

paragraph_text = '\n\n'.join(p.text for p in soup.find_all('p'))

或者，例如，如果您想用空格连接段落：

paragraph_text = ' '.join(p.text for p in soup.find_all('p'))

<p>中所有文本的列表：

paragraphs = [p.text for p in soup.find_all('p')]