如何使用Python从输出中删除重复项?

时间:2015-10-01 11:42:48

标签: python web-crawler

在这里遇到问题:

以下示例:

for item in g_data:
        Header = item.find_all("div", {"class": "InnprodInfos"})
        print(Header[0].contents[0].text.strip())

输出:

DMZ 3rd Tunnel - Korean Demilitarized Zone Day Tour from Seoul
Panmunjeom Day Tour
Seoul City Half Day Private Tour
The Soul of Seoul - Small Group Tour
Seoul Helicopter Tour
Seoul City Full Day Tour
Seoul City Half Day Tour
The Street Museum in the Urban Core - Small Group Tour
Korean Folk Village Day Tour
DMZ 3rd Tunnel - Korean Demilitarized Zone Day Tour from Seoul
Panmunjeom Day Tour
Seoul City Half Day Private Tour
The Soul of Seoul - Small Group Tour
Seoul Helicopter Tour
Seoul City Full Day Tour
Seoul City Half Day Tour
The Street Museum in the Urban Core - Small Group Tour
Korean Folk Village Day Tour

如上所示,它为我提供了两次输出。因此,只应删除第二个重复项。

结果如下:

DMZ 3rd Tunnel - Korean Demilitarized Zone Day Tour from Seoul
Panmunjeom Day Tour
Seoul City Half Day Private Tour
The Soul of Seoul - Small Group Tour
Seoul Helicopter Tour
Seoul City Full Day Tour
Seoul City Half Day Tour
The Street Museum in the Urban Core - Small Group Tour
Korean Folk Village Day Tour

任何人都可以向我提供反馈如何删除重复项吗?任何反馈都表示赞赏。

4 个答案:

答案 0 :(得分:0)

您可以使用列表或集合(如果订单无关紧要):

使用列表:

result = []
for item in g_data:
    header = item.find_all("div", {"class": "InnprodInfos"})
    item = header[0].contents[0].text.strip()
    if item not in result:
        result.append(item)

print '\n'.join(result)

使用set:

result = set()
for item in g_data:
    header = item.find_all("div", {"class": "InnprodInfos"})
    result.add(header[0].contents[0].text.strip())

print '\n'.join(result)

答案 1 :(得分:0)

您应该将输出存储在一个集合中,以验证它是否已经“打印”过。之后,您将打印出该组的元素。

g_data = ["foo", "bar", "foo"]
g_unique = set()
for item in g_data:
        g_unique.add(item) # ensures the element will only be copied if not already in the set

for item in g_unique:
    print(item) # {'foo', 'bar'}

答案 2 :(得分:0)

您可以使用set来跟踪您打印的项目。这保留了原始订单

already_printed = set()
for item in g_data:
    header = item.find_all("div", {"class": "InnprodInfos"})
    item = header[0].contents[0].text.strip()
    if item not in already_printed:
        print(item)
        already_printed.add(item)

答案 3 :(得分:0)

使用列表推导有一种简单的方法:)

s = set()
[s.add(text) for d_text in Header[0].contents[0].text.strip().split('\n')]
print('\n'.join([text for text in s]))