我的代码不想输出提取的数据

时间:2019-03-26 15:12:23

标签: python web-scraping

我想从一个网站(德国黄页)中提取多个链接,但是当我单击“运行”按钮时,我的代码什么也不做。我的刮板没有反应,也没有错误警告。我该如何解决?问题出在哪里?

我在reddit的首页上尝试了该代码,该代码工作正常,并获得了数据输出。但是在感兴趣的网页https://www.gelbeseiten.de/arzt/heilbronn-neckar上它不会成功。

在此screenshot中,您可以看到我要提取的内容。

在具有id =“ gs_treffer”的div标签中,我想从商品标签中提取data-href链接。

import urllib.request
from bs4 import BeautifulSoup

url = "https://www.gelbeseiten.de/arzt/heilbronn-neckar/"

#download the URL and extract the content to the variable html 
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()

#pass the HTML to Beautifulsoup.
soup = BeautifulSoup(html,'html.parser')

#get the HTML of the table called site Table where all the links are                 displayed
main_table = soup.find("div",attrs={'id':'gs_treffer'})

#Now we go into main_table and get every a element in it which has a     class "title" 
links = main_table.find_all("article", class_="data-href")

#from each link extract the link 
#List to store a dict of the links we extracted

extracted_records = []
for link in links:
url = link['data-href']
record = {
    'url':url
    }
extracted_records.append(record)
print(extracted_records)

2 个答案:

答案 0 :(得分:1)

您要摆脱find_all方法中的 class _ =“ data-href” 参数,因为“ data-href”不是类。

links = main_table.find_all("article")

我现在得到带有网址的字典列表:

[{'url': 'https://www.gelbeseiten.de/gsbiz/b1f40122-810e-4e51-9915-0e5ac98e32a5'}, {'url': 'https://www.gelbeseiten.de/gsbiz/44beddcf-a428-452c-ade1-a2e4e7807b23'}, {'url': 'https://www.gelbeseiten.de/gsbiz/d3268940-07f3-41c4-bcbd-e33d341ba379'}, {'url': 'https://www.gelbeseiten.de/gsbiz/3fe695df-8695-4940-81f5-bee17fbdf168'}, {'url': 'https://www.gelbeseiten.de/gsbiz/f8a8f769-6806-4742-b62b-b46753bcebe0'}, {'url': 'https://www.gelbeseiten.de/gsbiz/aa19c150-da60-4ef6-ba00-ef672fbf34da'}, {'url': 'https://www.gelbeseiten.de/gsbiz/3e7b5aa8-7ae0-4779-a4ad-e2a51b4d7315'}, {'url': 'https://www.gelbeseiten.de/gsbiz/5d9e76b0-85ea-4316-88b2-b25f417b6d58'}, {'url': 'https://www.gelbeseiten.de/gsbiz/ca1d47eb-22e3-44bf-95de-0cf93f39761a'}, {'url': 'https://www.gelbeseiten.de/gsbiz/caf662da-d8ad-43b0-83c5-8b6c962195ba'}, {'url': 'https://www.gelbeseiten.de/gsbiz/346bf41b-e415-47cc-9609-788311322ab6'}, {'url': 'https://www.gelbeseiten.de/gsbiz/9f73cee9-a1dc-47b8-ab9e-e1855512cdc6'}, {'url': 'https://www.gelbeseiten.de/gsbiz/057ba124-aa45-40b9-a033-bf83ecc7c3ef'}, {'url': 'https://www.gelbeseiten.de/gsbiz/69b0e77e-9ae4-4f8f-82f7-9aa7cbab1a75'}, {'url': 'https://www.gelbeseiten.de/gsbiz/7a3de200-08c3-48ee-ac0c-fcfc183d35c3'}]

答案 1 :(得分:0)

您的行exports.addDefaultUserRole = functions.auth.user().onCreate((user) => { let uid = user.uid; //add custom claims return admin.auth().setCustomUserClaims(uid,{ isAdmin: true }) .then(() => { //Interesting to note: we need to re-fetch the userRecord, as the user variable **does not** hold the claim return admin.auth().getUser(uid); }) .then(userRecord => { console.log(uid); console.log(userRecord.customClaims.isAdmin); return null; }); }); 不起作用。我认为是因为您不是在根据其类搜索元素。

如果用links = main_table.find_all("article", class_="data-href")替换该行,则脚本将起作用。