Question

我将以下正则表达式放在一起，从网址中提取图片ID：

''' Parse the post details from the full story page '''
def parsePostFromPermalink(session, permalink):

    r = session.get('https://m.facebook.com{0}'.format(permalink))
    dom = pq(r.content)

    # Parse the images, extract the ID's, and construct large image URL
    images = []
    for img in dom('a img[src*="jpg"]').items():
        if img.attr('src'):
            m = re.match(r'/([0-9_]+)n\.jpg/', img.attr('src'))
            images.append(m)
    return images

网址示例：

https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224

我想要这一点：

13645330_275977022775421_8826465145232985957

我已经在regex101上对其进行了测试，它可以正常运行：https://regex101.com/r/eS6eS7/2

img.attr('src')包含正确的网址且不为空。我测试了这个。当我尝试使用m.group(0)时，我得到group不是函数的异常。 m为None。

我做错了吗？

Answer 1

两个问题：

封闭/.../的人不属于Python regex syntax
您应该使用search代替match

工作示例：

>>> url = "https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/s200x200/13645330_275977022775421_8826465145232985957_n.jpg?efg=eyJpIjoiYiJ9&oh=ed5b4593ed9c8b6cfe683f9c6932acc7&oe=57EE1224"
>>> re.search(r'([0-9_]+)n\.jpg', url).group(0)
'13645330_275977022775421_8826465145232985957_n.jpg'

如果您只想要数字部分，请使用此部分（group(1)，并记下其他_）：

>>> re.search(r'([0-9_]+)_n\.jpg', url).group(1)
'13645330_275977022775421_8826465145232985957'

Answer 2

这是Regex101的正确python代码。（左边有一个代码生成器）。请注意正则表达式外部没有斜线...

import re
p = re.compile(r'([\d_]+)n\.jpg')
test_str = u"https://scontent-lhr3-1.xx.fbcdn.net/v/t1.0-0/cp0/e15/q65/c3.0.103.105/p110x80/13700209_937389626383181_6033441713767984695_n.jpg?efg=eyJpIjoiYiJ9&oh=a0b90ec153211eaf08a6b7c4cc42fb3b&oe=581E2EB8"

re.findall(p, test_str)

我不确定你如何得到m为无，但你可能需要编译模式并使用它来匹配。否则，尝试先修复表达式

从URL中提取[0-9 _] +

2 个答案: