Question

我在python3.7中有一个小脚本（请参阅相关问题here），该脚本会抓取来自网站（http://digesto.asamblea.gob.ni/consultas/coleccion/）的链接并将其保存在列表中。不幸的是，它们只是局部的，我必须修剪它们才能将它们用作链接。

这是脚本的相关部分：

list_of_links = []    # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url   # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
    row.find_element_by_css_selector('button').click()
    link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # href
    print(list_of_links)# trim
    if link.startswith('/'):
        list_of_links.append(tld + link)
    else:
        list_of_links.append(current_url + link)
    row.find_element_by_css_selector('button').click()

print(list_of_links)

我该如何操作列表（作为示例，此处仅包含三个条目）

["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]

看起来像

["http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D"]

简而言之：在第一个链接的示例中，我从网站上获得的链接基本上是

http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;

，并需要将其修剪为

http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D。

如何在整个列表中使用python实现这一目标？

Answer 1

这应该可以解决问题：

s = "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"
s = s.replace("/consultas/coleccion/window.open('", "").replace("');return false;", "")

Answer 2

一种方法是在字符串split上/consultas/coleccion/window.open('，除去第二个字符串的多余部分，然后将两个已处理的字符串连接起来以获得结果。

这应该做到：

new_links = []

for link in list_of_links:

    current_strings = link.split("/consultas/coleccion/window.open('")
    current_strings[1] = current_strings[1].split("');return")[0]
    new_link = current_strings[0] + current_strings[1]
    new_links.append(new_link)

Answer 3

您可以使用正则表达式来拆分列表中的URL，并让urllib.parse.urljoin()为您完成其余工作：

import re
from urllib.parse import urljoin

PATTERN = r"^([\S]+)window.open\('([\S]+)'"

links = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"]
result = "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D"

for link in links:
    m = re.match(PATTERN, link, re.MULTILINE).groups()
    #  m is now: ('http://digesto.asamblea.gob.ni/consultas/coleccion/', '/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D')
    if len(m) == 2:
        newLink = urljoin(*m)
        print(newLink)
        assert newLink == result

返回：

http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D

Answer 4

为此，您可以使用正则表达式：

考虑以下代码：

import re
out = list()
lst = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]

for el in lst:
    temp = re.sub(r"(.*?)/window.open\('(.*?)'\).*", r"\1\2", el)
    out.append(temp)
    print(temp)

函数sub允许替换与指定模式匹配的部分字符串。基本上是说：

(.*?)：将所有字符保留在/window.open...之前
/window.open\(输入字符串必须具有模式/window.open(，但不会保留
(.*?)将所有字符保留在上一个模式之后，直到找到)（由\(表示）

使用python3.7修剪列表中的链接

4 个答案: