我想从网页中“自动”提取一些信息(例如“日期”、“法院”、“街道”...)。 我想用美丽的汤来提取这些信息。
但是,我在使用以下代码时遇到了一些问题:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url =' https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soupe = soup(page_html,"html.parser")
page_soupe.findAll("article", {"class":"LegalAd"})
结果是
[<article class="LegalAd"></article>]
答案 0 :(得分:2)
哥们儿来了
import xrzz
import re
url = 'https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
req = xrzz.http("GET", url=url,
headers={
"Host": "www.licitor.com",
"Connection": "Close",
"User-Agent": "Mozilla/5.0 (Linux; Android 10; SM-J400F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.66 Mobile Safari/537.36"
}, tls=True).body()
print(re.findall("<h3>(.*?)</h3>", a.decode())[1])
<块引用>
输出
裁决:285 000 欧元
<块引用> <块引用>使用beautifulsoup
import requests
import re
import bs4
url = 'https://www.licitor.com/annonce/08/45/23/vente-aux-encheres/un-pavillon-a-usage-d-habitation/epinay-sur-seine/seine-saint-denis/084523.html'
req= requests.get(url,
headers={
"Host": "www.licitor.com",
"User-Agent": "Mozilla/5.0 (Linux; Android 10; SM-J400F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.66 Mobile Safari/537.36"
})
pg = bs4.BeautifulSoup(a.text, 'lxml')
page_soup = pg.findAll("article", {"class":"LegalAd"})
for i in page_soup:
print(i.find("h3").text)
<块引用>
<块引用>