NYT摘要提取器python2

时间:2017-08-29 06:22:26

标签: python python-2.7 api

我正在尝试使用NewsWire API和python 2.7访问NYT文章的摘要。这是代码:

from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper

posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
    if(len(posts)>=30000): break
    if(700<offset<800):
        offset=offset + 100
    #for p in xrange(100):    
    try:
        url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"    
        data= loads(urlopen(url).read())
        print str(len(posts) )+ "  offset=" + str(offset) 
        if posts and articles and keys:
            outfile= open("articles_next.tsv", "w")
            for s in articles:
                outfile.write(s.encode("utf-8") + "\n")
            outfile.close()

            outfile= open("summary_next.tsv", "w")
            for s in posts:
                outfile.write(s.encode("utf-8") + "\n")
            outfile.close()    

            indexfile=open("ind2_next.tsv", "w")
            for x in keys.keys():
                indexfile.write('\n' + str(x) + "    " + str(keys[x]))
            indexfile.close()

        for item in data["results"]:
            if(('url' in item) & ('abstract' in item)) :

                url= item["url"]
                abst=item["abstract"]
                if(url not in keys.values()):
                    keys[count]=url
                    article = newspaper.Article(url)
                    article.download()
                    article.parse()
                    try:
                        el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
                    except XMLSyntaxError, e:
                        continue                    
                    articles.append(el_post)
                    count=count + 1
                    res= abst # url + "    " + abst 
                    # print res.encode("utf-8")               
                    posts.append(res) # Here is the appending statement.

            if(len(posts)>=30000): 
                break

    except urllib2.HTTPError, e:
        print e
        time.sleep(1)
        offset=offset + 21
        continue
    except urllib2.URLError,e:
        print e
        time.sleep(1)
        offset=offset + 21
        continue

    offset=offset + 19
print str(len(posts))
print str(len(keys))

我得到的是很好的总结。但有时候我会发现一些奇怪的句子作为摘要的一部分。以下是示例:

Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.

被认为是某篇文章的摘要。请帮助我从NYT新闻中提取文章的完美摘要。如果出现这种情况,我想过使用这些标题,但标题也很奇怪。

1 个答案:

答案 0 :(得分:0)

所以,我看了一下摘要结果。

可以删除重复的陈述,例如Corrections appearing in print on Monday, August 28, 2017.,其中只有日期不同。

最简单的方法是检查语句是否存在于vairable本身中。 例如,

# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"] 

然后,

if (statement not in res for statement in REMOVE_STATEMENTS):
      posts.append(res)

对于剩余的不需要的陈述,除非您在res内搜索要忽略的关键字,否则它们无法区分,或者它们是重复的。如果您发现任何问题,只需将它们添加到我创建的列表中即可。