Question

我有包含大量网址的文本文件，但最后会有时间戳，这对我来说是多余的。

    http://techcrunch.com/2012/02/10/vevo-ceo-tries-to-explain-their-hypocritical-act-of-piracy-at-sundance/)16:55:40
    http://techcrunch.com/2012/04/30/edmodo-hits-7m/)15:18:45

我原以为在python中使用正则表达式会帮助我摆脱它，但同时我可以使用Python split and replace操作来删除最后的时间戳，这样的输出类似于以下给定网址

    >>> url.split(")")[0]
    http://techcrunch.com/2012/04/30/edmodo-hits-7m

现在我的问题是，在空间和时间方面，其性能会更好的是正则表达式样式还是python字符串方法，还是有其他更好的方法。

Answer 1

我不会将RegEx用于这样的任务，这对于

来说太容易了

for line in lines:
    print line.split(')')[0]

或url包含)：

for line in lines:
    print ')'.join(line.split(')')[:-1])

Answer 2

这应该比循环遍历每一行更快：

import re

my_str = "http://techcrunch.com/2012/04/30/edmodo-hits-7m/)15:18:45"
re.findall(r'([\w./:\d-]+)/\)\d\d:\d\d:\d\d', my_str)

Answer 3

另一种可能性：

for line in lines:
    url = line.rsplit('/', 1)[0]

Answer 4

如果您要删除的部分具有固定长度，那么为什么不仅仅是

L[:-9]

在Python中L[a:b]表示从索引a到索引b的L（list，string，tuple）的一部分（排除在外）。

如果省略a，则从一开始就表示如果b为负数，则表示从结尾开始计算。

所以L[:-9]表示“L的所有内容，但最后九个元素”。

Answer 5

import re

f = open('urls.txt')

# If you want to remove the extra / at the end of the url us this regex instead:
# r"^(?P<url>.*[^/])/?\)(?P<timestamp>\d{2}:\d{2}:\d{2})$"
url_timestamp_pattern = re.compile(r"^(?P<url>.*)\)(?P<timestamp>\d{2}:\d{2}:\d{2})$")

for line in f.readlines():
    match = url_timestamp_pattern.match(line)
    if match:
        print(match.group('url'))

如何在Python的url末尾设置正则表达式来删除时间戳？

5 个答案: