网页抓新文章

时间:2017-12-20 06:40:02

标签: python beautifulsoup

过去几个月我一直在学习python和BeautifulSoup函数,试图主要用于网页抓取新闻文章,用于我自己的研究目的。

然而,我一直在努力将内容打印出中文网站的文本。

我应该使用哪个标签来获取文章的内容?

<<div class="w980 wbnav clear"><a 
href="http://english.peopledaily.com.cn/" 
target="_blank">English</a>&gt;&gt;</div>
<div class="w980 wb_10 clear">
<h1>DPRK launches ballistic missile 'capable of hitting US 
mainland'</h1>
<div> (<a 



</div>
<div class="wb_12 clear">
<p style="text-align: center;">
<img alt="" src="/NMediaFile/2017/1129/FOREIGN201711291331000220555852915.jpg" style="width: 900px; height: 783px;" /></p>
<p>
The Democratic Peopleâs Republic of Korea (DPRK) has confirmed that it successfully tested a âHwasong 15â intercontinental ballistic missile (ICBM) on Wednesday.</p>
<p>
A Korean Central News Agency (KCNA) statement, which confirms earlier assessments from the United States and the Republic of Korea (ROK), claims the new type of ICBM "is capable of striking the whole mainland of the US."
</p>
<p>
It was Pyongyang's first test launch since a missile was fired in mid-September, days after its sixth-nuclear test.</p>
<p>
The ICBM was launched at 02:48 local time on Wednesday, according to the KCNA statement, and flew to an altitude of 4,475 km and then a distance of 950 km.</p>
<p>
It was launched from Sain Ni in the DPRK and flew for 53 minutes before splashing down into the Sea of Japan, said Pentagon spokesman Robert Manning.</p>

1 个答案:

答案 0 :(得分:0)

我打开了网站链接(http://en.people.cn/index.html)并查看了文章。

如果您只想从特定文章(例如此http://en.people.cn/n3/2017/1220/c90000-9306707.html

)中删除数据

然后您可以使用以下代码 -

import requests
from bs4 import BeautifulSoup
import sys

r=requests.get('http://en.people.cn/n3/2017/1220/c90000-9306707.html')

c=r.content
soup=BeautifulSoup(c,'html.parser')

all=soup.find("div",{"class":"d2p3_left wb_left fl"})

d={}
d["heading"]=all.find("h2").text




d["content"]=all.find_all("p")

p=''
for item in d["content"]:
    p=p+item.text


p.replace("\t","")
d["content"]=p
f=open('article1.txt','w')

for item in d.values():
    f.write(item)

f.close()

现在我也检查了其他文章,他们似乎都在使用d2p3_left wb_left fl类来分类包含实际文章内容的html div标签。

所以我从这个特定的标签中取出了内容,然后把它们放在一个字典中,上面有按键&#39;标题&#39;和&#39;内容&#39;这样如果你愿意,可以将它们格式化。

然后我将字典的所有值都导出到文本文件中。

如果你想从主页上删除所有文章,那么你可以只获取列表中的链接,然后循环遍历列表项作为requests.get()方法的参数。

希望这有帮助。