Question

我的python脚本解析来自多个RSS提要的标题和链接。我将这些标题存储在一个列表中，我想确保我从不打印重复项。我该怎么做？

    #!/usr/bin/python
 from twitter import *
 from goose import Goose
 import feedparser
 import time
 from pyshorteners import Shortener
 import pause
 import newspaper

 dr = feedparser.parse("http://www.darkreading.com/rss_simple.asp") 
 sm =feedparser.parse("http://www.securitymagazine.com/rss/topic/2654-cyber-tactics.rss")



dr_posts =["CISO Playbook: Games of War & Cyber Defenses",
         "SWIFT Confirms Cyber Heist At Second Bank; Researchers Tie Malware Code to Sony Hack","The 10 Worst Vulnerabilities of The Last 10 Years",
         "GhostShell Leaks Data From 32 Sites In 'Light Hacktivism' Campaign",
          "OPM Breach: 'Cyber Sprint' Response More Like A Marathon",
        "Survey: Customers Lose Trust In Brands After A Data Breach",
       "Domain Abuse Sinks 'Anchors Of Trust'",
       "The 10 Worst Vulnerabilities of The Last 10 Years",
]

sm_posts = ["10 Steps to Building a Better Cybersecurity Plan"]

x = 1

while True:

    try:

        drtitle = dr.entries[x]["title"]
        drlink = dr.entries[x]["link"]
        if drtitle in dr_posts:
            x += 1
            drtitle = dr.entries[x]["title"]
            drtitle = dr.entries[x]["link"]
            print drtitle + "\n" + drlink
            dr_posts.append(drtitle)
            x -= 1
            pause.seconds(10)
        else:
            print drtitle + "\n" + drlink
            dr_posts.append(drtitle)
            pause.seconds(10)

        smtitle = sm.entries[x]["title"]
        smlink = sm.entries[x]["link"]
        if smtitle in sm_posts:
            x +=1
            smtitle = sm.entries[x]["title"]
            smtitle = sm.entries[x]["title"]
            print smtitle + "\n" + smlink
            sm_posts.append(smtitle)
            pause.seconds(10)
    else:
        print smtitle + "\n" + smlink
        sm_posts.append(smtitle)
        x+=1
        pause.seconds(10)



except IndexError:
    print "FAILURE"
    break

暂时我只跳过条目。这将是一个问题，因为如果在RSS提要中还有另外一个副本，那么我将会有更多重复。

Answer 1

您可以利用数据结构set作为＆＃34; uniqueness＆＃34;的属性。会为你做的工作。基本上我们可以使您的列表成为一个集合，然后再次设置一个列表，这可以确保您的列表现在填充了严格唯一的值。

如果您有一个列表l，那么您可以通过

使其成为唯一

l = list(set(l))

Answer 2

如果您不想打印重复链接，可以使用counter或defaultdict

sm_posts = defaultdict(int)
sm_posts[sm_links] += 1
print sm_posts.keys() #will print all the unique links

好的是，您还可以通过

来获取重复链接的次数

sm_posts[sm_links]
>>> link_counts

试一试。

如何告诉python不打印列表中的项目？

2 个答案: