我是网络抓取的新手,我真的很难从网址中提取一些段落。 From the following link我正在尝试打印封面和简短摘要标题下的所有段落。但我的计划无效。
这是我的代码:
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import requests
import bs4
url = 'http://onepiece.wikia.com/wiki/Chapter_863'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('div',attrs={"mw-content-ltr mw-content-text"})
for x in table:
if (x.get(id) == "Cover Page"):
print (x.get('p').get_text())
elif(x.get(id) == "Short Summary"):
print (x.get('p').get_text())
当我运行程序时,它不会打印任何内容,也不会显示错误消息。有什么方法可以只打印封面和简短摘要部分下的段落。
答案 0 :(得分:1)
如果我们分析页面的HTML源代码,我们可以看到我们需要获取封面和简短摘要:
在代码中,我们需要找到所有h2和p标签,然后找到每个h2的索引作为我们的标记。当我们得到标记时,我们然后重新循环树,并能够获得h2标记之间所需的所有段落。
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import requests
import bs4
url = 'http://onepiece.wikia.com/wiki/Chapter_863'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('div',attrs={"mw-content-ltr mw-content-text"})
for x in table:
i = 0
cover_page_mark = 0
short_summary_mark = 0
long_summary_mark = 0
cover_page = ''
short_summary = ''
for el in x.find_all(['h2', 'p']):
if el.name == 'h2':
if "Cover Page" in el.get_text() and el.name == 'h2':
cover_page_mark = i
if "Short Summary" in el.get_text() and el.name == 'h2':
short_summary_mark = i
if "Long Summary" in el.get_text() and el.name == 'h2':
long_summary_mark = i
i += 1
i = 0
for el in x.find_all(['h2', 'p']):
if el.name == 'p':
if cover_page_mark < i < short_summary_mark:
cover_page += el.get_text()
if short_summary_mark < i < long_summary_mark:
short_summary += el.get_text()
i += 1
print cover_page
print short_summary
答案 1 :(得分:1)
为了获得所需的结果,使您的脚本简洁,您也可以做这样的事情。运行它,看看魔术。
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://onepiece.wikia.com/wiki/Chapter_863").text,"html.parser")
for item in soup.select("#mw-content-text"):
required_data = [p_item.text.strip() for p_item in item.select("p")][1:4]
print('\n'.join(required_data))