Question

我有一个如下所示的URL列表：

http://example.com/php?id=2
https://example.com/?
http://example.com/ip/admin/navigate?
http://example.com/admin?page=2&id=3
https://www.google.com/#q=query

我需要做的是扫描这些URL以获取查询字符串，并仅输出包含该查询的URL。例如，预期输出为：

http://example.com/php?id=2
http://example.com/admin?page=2&id=3

我想出了这样做的想法：

res = []

with open('textfile.txt', 'a+') as data:
    for line in data.readlines():
        if '?' in line:
            res.append(line)
return res

但是，这会抓住其中包含?的所有内容，包括此网址：https://example.com/?是否有一种方法可以抓取所有带有查询字符串的链接，避免只有一个问号？

Answer 1

一种简单的方法是检查问号是否在字符串中，但不是最后一个字符：

<div class="mybox">Text Inside</div>

您也可以使用正则表达式或其他解决方案，我认为这是最简单的解决方案。

Answer 2

使用正则表达式：

import re

query_regex = re.compile("(.*)[?|#](.*){1}\=(.*)")
urls = """http://example.com/php?id=2
https://example.com/?
http://example.com/ip/admin/navigate?
http://example.com/admin?page=2&id=3
https://www.google.com/#q=query""".split("\n")

for url in urls:
    match = query_regex.match(url)
    if match:
        print(match.group())

Answer 3

也许这会导致错误的解决方案，但您也可以测试= sign

res = []

with open('textfile.txt', 'a+') as data:
    for line in data.readlines():
        if '=' in line:
            res.append(line)
return res

查找包含查询字符串的所有网址

3 个答案: