python抓取一页

时间:2019-01-23 14:06:44

标签: python regex beautifulsoup python-requests

我试图提取以特定单词开头的提取链接(href),但即使页面源中有很多满足条件的链接,它也返回空列表,但我肯定缺少某些内容,以下是我的代码:

01/11/2019 06:00 PM  USO-FOX-USO  E10           8.9929     0.0000
01/11/2019 06:00 PM  USO-FOX-USO  CON8HE10      1.3212    -0.0244
01/11/2019 06:00 PM  USO-FOX-USO  CON8HE10TT    1.3232    -0.0244

1 个答案:

答案 0 :(得分:0)

尝试一下:

import requests 
from bs4 import BeautifulSoup 
import string 
import os 
import re 
def extract_href_page(page): 
    soup = BeautifulSoup(page)  
    all_links = [] 
    links = soup.find_all('a', href=True) 
    # pattern = re.compile(r'\w*recette') 
    print(links) 
    for link in links: 
        if re.match(r"\w*first_word", link["href"], re.I):
            all_links.append(link.get("href"))
...