Question

我有大量的网址。一些彼此相似，即它们代表相似的页面集。例如。

    http://example.com/product/1/
    http://example.com/product/2/
    http://example.com/product/40/
    http://example.com/product/33/

类似。类似地

    http://example.com/showitem/apple/
    http://example.com/showitem/banana/
    http://example.com/showitem/grapes/

也很相似。所以我需要将它们表示为http://example.com/product/(Integers)/ 其中(Integers) = 1,2,40,33和http://example.com/showitem/(strings)/ strings = apple,banana,grapes ...等等。

在python中是否有任何内置函数或库可以从大量混合URL中找到这些类似的URL？如何更有效地完成这项工作？请建议。提前谢谢。

Answer 1

使用字符串存储URL的第一部分并只处理ID，例如：

In [1]: PRODUCT_URL='http://example.com/product/%(id)s/'

In [2]: _ids = '1 2 40 33'.split() # split string into list of IDs

In [3]: for id in _ids:
   ...:     print PRODUCT_URL % {'id':id}
   ...:     
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/

语句print PRODUCT_URL % {'id':id}使用Python string formatting格式化产品网址，具体取决于传递的变量id。

<强>更新

我看到你改变了你的问题。您的问题的解决方案是特定于域的，取决于您的数据集。有几种方法，一些比其他方法更手动。一种这样的方法是获取顶级URL，即检索域名：

In [7]: _url = 'http://example.com/product/33/' # url we're testing with

In [8]: ('/').join(_url.split('/')[:3]) # get domain
Out[8]: 'http://example.com'

In [9]: ('/').join(_url.split('/')[:4]) # get domain + first URL sub-part
Out[9]: 'http://example.com/product'

上面的

[:3]和[:4]只是对split('/')

生成的列表进行切片

您可以将结果设置为dict上的一个键，每次遇到URL部分时都会计算该值。然后继续前进。同样，解决方案取决于您的数据。如果它变得比上面更复杂，那么我建议你看看其他答案建议的正则表达式。

Answer 2

您可以使用正则表达式来处理这种情况。您可以转到Python documentation查看此处理方式。

你也可以看到Django如何在routings system

上实现它

Answer 3

我不确定你在寻找什么。听起来你正在寻找匹配网址的东西。如果这确实是你想要的，那么我建议你使用使用正则表达式构建的东西。可以找到一个例子here。

我还建议您查看Django及其routing system。

Answer 4

不是在Python中，但我已经创建了一个Ruby库（以及一个附带的应用程序） -

https://rubygems.org/gems/LinkGrouper

它适用于所有链接（不需要知道任何模式）。

在python中分组类似网址的列表

4 个答案: