我有一个包含10,000行的csv文件。每行有8列。这些列之一包含类似于以下内容的文本:
this is a row: http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row: http://yetanotherdomain.net
this is a row: https://hereisadomain.org | some_text
我目前正在以这种方式访问此列中的数据:
for row in csv_reader:
the_url = row[3]
# this regex is used to find the hrefs
href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
for link in href_regex:
print (link)
打印语句的输出:
http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text
如何仅获取URL?
http://somedomain.com
http://someanotherdomain.com
http://yetanotherdomain.net
https://hereisadomain.org
答案 0 :(得分:2)
答案 1 :(得分:1)
不要在末尾重复任何字符
'(?:http|ftp)s?://.*'
^
重复除空格以外的所有字符 ,以确保该模式在URL末尾不再匹配:
'(?:http|ftp)s?://[^ ]*'
^^^^