使用 Beautiful Soup 抓取网页

时间:2021-05-29 16:29:08

标签: web-scraping beautifulsoup

我想从网页中“自动”提取一些信息(例如“日期”、“法院”、“街道”...)。 我想用美丽的汤来提取这些信息。

但是,我在使用以下代码时遇到了一些问题:

from urllib.request import urlopen as uReq 
from bs4 import BeautifulSoup as soup

my_url =' https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
uClient = uReq(my_url) 
page_html = uClient.read()
uClient.close()
page_soupe = soup(page_html,"html.parser")
page_soupe.findAll("article", {"class":"LegalAd"})

结果是

[<article class="LegalAd"></article>]

并且不会显示“文章”标签内的所有内容。 enter image description here 任何想法如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

<块引用>

哥们儿来了

import xrzz
import re

url = 'https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
req = xrzz.http("GET", url=url,
    headers={
        "Host": "www.licitor.com",
        "Connection": "Close",
        "User-Agent": "Mozilla/5.0 (Linux; Android 10; SM-J400F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.66 Mobile Safari/537.36"
    }, tls=True).body()

print(re.findall("<h3>(.*?)</h3>", a.decode())[1])
<块引用>

输出

裁决:285 000 欧元

Output

<块引用> <块引用>

使用beautifulsoup

import requests
import re
import bs4

url = 'https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
req= requests.get(url,
    headers={
        "Host": "www.licitor.com",
        "User-Agent": "Mozilla/5.0 (Linux; Android 10; SM-J400F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.66 Mobile Safari/537.36"
    })

pg = bs4.BeautifulSoup(a.text, 'lxml')
page_soup = pg.findAll("article", {"class":"LegalAd"})
for i in page_soup:
    print(i.find("h3").text)
<块引用> <块引用>

Output - (bs4)