使用python3.7修剪列表中的链接

时间:2019-01-12 19:21:28

标签: python python-3.x

我在python3.7中有一个小脚本(请参阅相关问题here),该脚本会抓取来自网站(http://digesto.asamblea.gob.ni/consultas/coleccion/)的链接并将其保存在列表中。不幸的是,它们只是局部的,我必须修剪它们才能将它们用作链接。

这是脚本的相关部分:

list_of_links = []    # will hold the scraped links
tld = 'http://digesto.asamblea.gob.ni'
current_url = driver.current_url   # for any links not starting with /
table_id = driver.find_element(By.ID, 'tableDocCollection')
rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
for row in rows:
    row.find_element_by_css_selector('button').click()
    link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # href
    print(list_of_links)# trim
    if link.startswith('/'):
        list_of_links.append(tld + link)
    else:
        list_of_links.append(current_url + link)
    row.find_element_by_css_selector('button').click()

print(list_of_links)

我该如何操作列表(作为示例,此处仅包含三个条目)

["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]

看起来像

["http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D", "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D"]

简而言之:在第一个链接的示例中,我从网站上获得的链接基本上是

http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;

,并需要将其修剪为

http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D

如何在整个列表中使用python实现这一目标?

4 个答案:

答案 0 :(得分:1)

这应该可以解决问题:

s = "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"
s = s.replace("/consultas/coleccion/window.open('", "").replace("');return false;", "")

答案 1 :(得分:1)

一种方法是在字符串split/consultas/coleccion/window.open(',除去第二个字符串的多余部分,然后将两个已处理的字符串连接起来以获得结果。

这应该做到:

new_links = []

for link in list_of_links:

    current_strings = link.split("/consultas/coleccion/window.open('")
    current_strings[1] = current_strings[1].split("');return")[0]
    new_link = current_strings[0] + current_strings[1]
    new_links.append(new_link)

答案 2 :(得分:1)

您可以使用正则表达式来拆分列表中的URL,并让urllib.parse.urljoin()为您完成其余工作:

import re
from urllib.parse import urljoin

PATTERN = r"^([\S]+)window.open\('([\S]+)'"

links = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;"]
result = "http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D"

for link in links:
    m = re.match(PATTERN, link, re.MULTILINE).groups()
    #  m is now: ('http://digesto.asamblea.gob.ni/consultas/coleccion/', '/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D')
    if len(m) == 2:
        newLink = urljoin(*m)
        print(newLink)
        assert newLink == result

返回:

http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D

答案 3 :(得分:1)

为此,您可以使用正则表达式:

考虑以下代码:

import re
out = list()
lst = ["http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=p2%2FHzlqau8A%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=Z%2FgLeZxynkg%3D');return false;", "http://digesto.asamblea.gob.ni/consultas/coleccion/window.open('/consultas/util/pdf.php?type=rdd&rdd=9rka%2BmYwvYM%3D');return false;"]

for el in lst:
    temp = re.sub(r"(.*?)/window.open\('(.*?)'\).*", r"\1\2", el)
    out.append(temp)
    print(temp)

函数sub允许替换与指定模式匹配的部分字符串。基本上是说:

  • (.*?):将所有字符保留在/window.open...之前
  • /window.open\(输入字符串必须具有模式/window.open(,但不会保留
  • (.*?)将所有字​​符保留在上一个模式之后,直到找到)(由\(表示)