我正在尝试提取任何NYTimes文章的内容,并将其放入字符串中以计算某些单词。所有文章内容都在HTML“p”标签中找到。我能够逐个获得段落(在代码中注释)但我无法迭代变量段落,因为我不断收到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-ccc2f7cf5763> in <module>()
16
17 for i in paragraphs:
---> 18 article = article + paragraphs[i].get_text()
19
20 print(article)
TypeError: list indices must be integers, not Tag
以下是代码:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = "http://www.nytimes.com/2015/01/02/world/europe/turkey-police-thwart-attack-on-prime-ministers-office.html"
req = session.get(url)
soup = BeautifulSoup(req.text)
paragraphs = soup.find_all('p', class_='story-body-text story-content')
#article = paragraphs[0].get_text()
#article = article + paragraphs[1].get_text()
#article = article + paragraphs[2].get_text()
#article = article + paragraphs[3].get_text()
#article = article + paragraphs[4].get_text()
#article = article + paragraphs[5].get_text()
#article = article + paragraphs[6].get_text()
for i in paragraphs:
article = article + paragraphs[i].get_text()
print(article)
非常感谢你的帮助。我是一名经济学家,刚开始学习如何编码。感谢您耐心帮助我解决这个问题。
答案 0 :(得分:1)
你想:
for p in paragraphs:
article = article + p.get_text()
或:
for i in range(len(paragraphs)):
article = article + paragraphs[i].get_text()
答案 1 :(得分:0)
p_tags = soup.find_all(class_="story-body-text story-content")
# method 1
article = ''
for p_tag in p_tags:
p_text = p_tag.get_text()
article += p_text
print(article)
# method 2
article2 = ''.join(p_tag.get_text() for p_tag in p_tags)
print(article2)