我正在迭代csv文件中的多个URL; URL具有下一个结构:
http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21
http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil- boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml
等,
我需要获得文章类别(在第四个斜杠之后,在这种情况下为“AMSTERDAM-CENTRUM”和“POLITIEK”),并将它们附加到列表中。
我正在使用urllib2:
reader=CsvUnicodeReader(open("my.csv","r"))
for row in reader:
url = row[0]
req=urllib2.Request(url)
有没有办法解析网址?
答案 0 :(得分:2)
您可以使用urlparse.urlparse
将网址拆分为其组件并可靠地提取路径组件,然后使用regular expression提取您感兴趣的路径的类别部分:
from urlparse import urlparse
import re
URLS = ["http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21",
"http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil-boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml"]
pattern = re.compile("/parool/nl/\d*/(.*?)/article/detail/.*$")
for url in URLS:
parsed = urlparse(url)
match = pattern.match(parsed.path)
if match:
category = match.group(1)
print category
输出:
AMSTERDAM-CENTRUM
POLITIEK
关于正则表达式的注释:
\d*
匹配任意数字(0-9)零到多次/(.*?)/
匹配两个斜杠之间的任意字符零到多次,非贪婪,并为斜杠之间的部分创建一个组答案 1 :(得分:1)
如果所有网址都具有相似的结构,您只需使用
即可url.rsplit('/')[6]
答案 2 :(得分:0)
你真的不需要正则表达式。
>>> a=[]
>>> with open('in','r') as f:
... r=csv.reader(f,delimiter='/')
... for row in r:
... a.append(row[6])
...
>>> a
['AMSTERDAM-CENTRUM', 'POLITIEK']
>>> a=[]
>>> with open('in','r') as f:
... r=csv.reader(f)
... for row in r:
... a.append(row[0].split('/')[6])
...
>>> a
['AMSTERDAM-CENTRUM', 'POLITIEK']
答案 3 :(得分:0)
您可以使用urlparse模块找出它并使用path
方法获取文章类别,然后使用split('/')
函数,我们将路径与' \'并使用索引[4]访问第5个字段。
演示:
>>> from urlparse import urlparse
>>> your_url=['http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21','http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil- boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml']
>>> [urlparse(ul).path.split('/')[4] for ul in your_url]
['AMSTERDAM-CENTRUM', 'POLITIEK']