我正在尝试使用NewsWire API和python 2.7访问NYT文章的摘要。这是代码:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
我得到的是很好的总结。但有时候我会发现一些奇怪的句子作为摘要的一部分。以下是示例:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
被认为是某篇文章的摘要。请帮助我从NYT新闻中提取文章的完美摘要。如果出现这种情况,我想过使用这些标题,但标题也很奇怪。
答案 0 :(得分:0)
所以,我看了一下摘要结果。
可以删除重复的陈述,例如Corrections appearing in print on Monday, August 28, 2017.
,其中只有日期不同。
最简单的方法是检查语句是否存在于vairable本身中。 例如,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
然后,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
对于剩余的不需要的陈述,除非您在res
内搜索要忽略的关键字,否则它们无法区分,或者它们是重复的。如果您发现任何问题,只需将它们添加到我创建的列表中即可。