在python中从URL中提取片段

时间:2014-09-21 21:23:12

标签: python urllib2

我正在迭代csv文件中的多个URL; URL具有下一个结构:

http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21
http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil-  boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml

等,

我需要获得文章类别(在第四个斜杠之后,在这种情况下为“AMSTERDAM-CENTRUM”和“POLITIEK”),并将它们附加到列表中。

我正在使用urllib2:

reader=CsvUnicodeReader(open("my.csv","r"))
for row in reader:
    url = row[0]
    req=urllib2.Request(url)

有没有办法解析网址?

4 个答案:

答案 0 :(得分:2)

您可以使用urlparse.urlparse将网址拆分为其组件并可靠地提取路径组件,然后使用regular expression提取您感兴趣的路径的类别部分:

from urlparse import urlparse
import re


URLS = ["http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21",
        "http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil-boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml"]

pattern = re.compile("/parool/nl/\d*/(.*?)/article/detail/.*$")


for url in URLS:
    parsed = urlparse(url)
    match = pattern.match(parsed.path)
    if match:
        category = match.group(1)
        print category

输出:

AMSTERDAM-CENTRUM
POLITIEK

关于正则表达式的注释:

  • \d*匹配任意数字(0-9)零到多次
  • /(.*?)/匹配两个斜杠之间的任意字符零到多次,非贪婪,并为斜杠之间的部分创建一个组

答案 1 :(得分:1)

如果所有网址都具有相似的结构,您只需使用

即可
url.rsplit('/')[6]

答案 2 :(得分:0)

你真的不需要正则表达式。

>>> a=[]
>>> with open('in','r') as f:
...     r=csv.reader(f,delimiter='/')
...     for row in r:
...             a.append(row[6])
... 
>>> a
['AMSTERDAM-CENTRUM', 'POLITIEK']



>>> a=[]
>>> with open('in','r') as f:
...     r=csv.reader(f)
...     for row in r:
...             a.append(row[0].split('/')[6])
... 
>>> a
['AMSTERDAM-CENTRUM', 'POLITIEK']

答案 3 :(得分:0)

您可以使用urlparse模块找出它并使用path方法获取文章类别,然后使用split('/')函数,我们将路径与' \'并使用索引[4]访问第5个字段。

演示:

>>> from urlparse import urlparse
>>> your_url=['http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21','http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil-  boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml']
>>> [urlparse(ul).path.split('/')[4] for ul in your_url]
['AMSTERDAM-CENTRUM', 'POLITIEK']