Question

我需要一个匹配

的正则表达式

re.compile('userpage')


href="www.example.com?u=userpage&as=233&p=1"
href="www.example.com?u=userpage&as=233&p=2"

我希望获得所有包含u = userpage和p = 1

的网址

如何修改上面的正则表达式以找到u = userpage和p = 1？

Answer 1

如果您想在我看来使用比正则表达式更合适的方法：

from urlparse import *
urlparsed = urlparse('www.example.com?u=userpage&as=233&p=1')
# -> ParseResult(scheme='', netloc='', path='www.example.com', params='', query='u=userpage&as=233&p=1', fragment='')
qdict = dict(parse_qsl(urlparsed.query))
# -> {'as': '233', 'p': '1', 'u': 'userpage'}
qdict.get('p') == '1' and qdict.get('u') == 'userpage'
# -> True

Answer 2

import lxml.html, urlparse

d = lxml.html.parse(...)
for link in d.xpath('//a/@href'):
    url = urlparse.urlparse(link)
    if not url.query:
        continue
    params = urlparse.parse_qs(url.query)
    if 'userpage' in params.get('u', []) and '1' in params.get('p', []):
        print link

Answer 3

正则表达式不是一个好的选择，因为1）params可能以任何顺序出现，2）你需要对查询分隔符进行额外的检查，这样就不会像“flu = userpage”那样匹配潜在的怪异， “sp = 1”，“u = userpage％20haha”或“s = 123”。（注意：我在第一次传递中错过了其中两个案例！其他人也是如此。）另外：3）你已经在Python中有一个很好的URL解析库，可以为你工作。

使用正则表达式，你需要一些笨拙的东西：

q = re.compile(r'([?&]u=userpage&(.*&)?p=1(&|$))|([?&]p=1&(.*&)?u=userpage(&|$))')
return q.search(href) is not None

使用urlparse，你可以做到这一点。 urlparse比你想要的更多，但是你可以使用辅助函数来保持结果的简单：

def has_qparam(qs, key, value):
    return value in qs.get(key, [])

qs = urlparse.parse_qs(urlparse.urlparse(href).query)
return has_qparam(qs, 'u', 'userpage') and has_qparam(qs, 'p', '1')

Answer 4

/((u=userpage).*?(p=1))|((p=1).*?(u=userpage))/

这将获得包含您正在寻找的两个位的所有字符串。

Answer 5

为确保您不会意外地匹配bu=userpage，u=userpagezap，p=111或zap=1等部分，您需要充分利用\b“字边界“RE模式元素。即：

re.compile(r'\bp=1\b.*\bu=userpage\b|\bu=userpage\b.*\bp=1\b')

RE模式中的单词边界元素阻止了上述可能不合需要的“意外”匹配。当然，如果在您的应用程序中，他们不“不受欢迎”，即，如果您肯定希望匹配p=123等，您可以轻松删除上面的部分或全部字边界元素！ - ）

Answer 6

可以通过字符串黑客来做到这一点，但你不应该这样做。它已经在标准库中了：

>>> import urllib.parse
>>> urllib.parse.parse_qs("u=userpage&as=233&p=1")
{'u': ['userpage'], 'as': ['233'], 'p': ['1']}

因此

import urllib.parse
def filtered_urls( urls ):
    for url in urls:
        try:
            attrs = urllib.parse.parse_qs( url.split( "?" )[ 1 ] )
        except IndexError:
            continue

        if "userpage" in attrs.get( "u", "" ) and "1" in attrs.get( "p", "" ):
            yield url

foo = [ "www.example.com?u=userpage&as=233&p=1", "www.example.com?u=userpage&as=233&p=2" ]

print( list( filtered_urls( foo ) ) )

请注意，这是Python 3 - 在Python中parse_qs代替urlparse。

正则表达式匹配包含一些文本的字符串

6 个答案: