如何仅获取与该文章有关的文本?我不要随便的东西。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
test1 = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'
# Opening up the connection, grabbing the page
uClient = uReq(test1)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
#print(page_soup.prettify())
# text of article
text = page_soup.find_all('p')
print(text)
答案 0 :(得分:1)
您需要做的是遍历page_soup.find_all('p')。
for p in page_soup.find_all('p'):
print (p.text, p.next_sibling)