Question

我有一个网址列表（unicode），并且有很多重复。例如，网址http://www.myurlnumber1.com和http://www.myurlnumber1.com/foo+%bar%baz%qux会占据相同的位置。

所以我需要清除所有这些重复项。

我的第一个想法是检查元素的子字符串是否在列表中，如下所示：

for url in list:
    if url[:30] not in list:
        print(url)

然而，它试图将url[:30]文字url[:30]转换为列表元素并显然返回所有这些元素，因为没有与Sub poiuyt() Dim r As Range, st As String, boo As Boolean Dim L As Long, i As Long For Each r In Selection st = r.Text boo = False L = Len(st) For i = 1 To L If Mid(st, i, 1) = "!" Then boo = Not boo Else If boo Then r.Characters(i, 1).Font.Bold = True End If Next i Next r End Sub完全匹配的元素。

有没有一种简单的方法可以解决这个问题？

编辑：

网址中的主机和路径通常保持不变，但参数不同。就我的目的而言，具有相同主机名和路径但不同参数的URL仍然是相同的URL并构成重复。

Answer 1

如果您认为任何netloc是相同的，您可以使用urllib.parse解析

from urllib.parse import  urlparse # python2 from urlparse import  urlparse 

u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"

print(urlparse(u).netloc)

哪会给你：

www.myurlnumber1.com

因此，要获得独特的netlocs，您可以执行以下操作：

unique  = {urlparse(u).netloc for u in urls}

如果你想保留网址方案：

urls  = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]

unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)

假设他们都有方案，并且你没有为同一个netloc提供http和https，并认为它们是相同的。

如果您还想添加路径：

unique = {u.netloc, u.path) for u in map(urlparse, urls)}

属性表列在文档中：

Attribute   Index   Value   Value if not present
scheme  0   URL scheme specifier    scheme parameter
netloc  1   Network location part   empty string
path    2   Hierarchical path   empty string
params  3   Parameters for last path element    empty string
query   4   Query component empty string
fragment    5   Fragment identifier empty string
username        User name   None
password        Password    None
hostname        Host name (lower case)  None
port        Port number as integer, if present  None

你只需要使用你认为的任何独特部分。

In [1]: from urllib.parse import  urlparse

In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux",  "www.url.com/foo-bar?t=baz"]


In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}

In [4]: 

In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}

Answer 2

你可以尝试添加另一个for循环，如果你对它好的话。类似的东西：

for url in list:  
    for i in range(len(list)):  
      if url[:30] not in list[i]:  
          print(url)

这会将每个单词与其他单词进行比较以检查相同性。这只是一个例子，我相信你可以让它更强大。

通过子字符串检查列表中的元素

2 个答案: