Question

我有一个包含10,000行的csv文件。每行有8列。这些列之一包含类似于以下内容的文本：

this is a row:   http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row:   http://yetanotherdomain.net
this is a row:   https://hereisadomain.org | some_text

我目前正在以这种方式访问此列中的数据：

for row in csv_reader:
    the_url = row[3]

    # this regex is used to find the hrefs
    href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
    for link in href_regex:
         print (link)

打印语句的输出：

http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text

如何仅获取URL？

http://somedomain.com
http://someanotherdomain.com 
http://yetanotherdomain.net
https://hereisadomain.org

Answer 1

只需将您的模式更改为：

\b(?:http|ftp)s?://\S+

与其匹配.*，不匹配任何非空白字符，而不是\S+。您可能也想在非捕获组之前添加单词边界。

实时检查here。

Answer 2

不要在末尾重复任何字符

'(?:http|ftp)s?://.*'
                  ^

重复除空格以外的所有字符，以确保该模式在URL末尾不再匹配：

'(?:http|ftp)s?://[^ ]*'
                  ^^^^

仅输出匹配的正则表达式模式

2 个答案: