如何使用“请求”模块在网站中进行搜索?

时间:2019-03-03 07:40:16

标签: python web-scraping beautifulsoup request

我想在网站上搜索其他公司名称。网站链接:https://www.firmenwissen.de/index.html

在此网站上,我想使用搜索引擎和搜索公司。这是我要使用的代码:

from bs4 import BeautifulSoup as BS
import requests
import re

companylist = ['ABEX Dachdecker Handwerks-GmbH']

url = 'https://www.firmenwissen.de/index.html'

payloads = {
        'searchform': 'UFT-8',
        'phrase':'ABEX Dachdecker Handwerks-GmbH',
        "mainSearchField__button":'submit'
        }

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

html = requests.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'html.parser')
link_list= []

links = soup.findAll('a')

for li in links:
    link_list.append(li.get('href'))
print(link_list)

此代码将带我下一页公司信息。但不幸的是,它仅返回主页。我该怎么办?

1 个答案:

答案 0 :(得分:0)

更改您要搜索的初始URL。仅抓住适当的href并添加到集合中以确保没有重复(或在可能的情况下更改选择器以仅返回一个匹配项);将这些项目添加到最终的循环集中,以确保仅循环所需数量的链接。我用Session来假设您会对许多公司重复。

使用硒对集合进行迭代,以导航到每个公司的url并提取所需的任何信息。

这是一个大纲。

from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver

d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH']

url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}

finalLinks = set()

## searches section; gather into set

with requests.Session() as s:
    for company in companyList:
        payloads = {
        'searchform': 'UFT-8',
        'phrase':company,
        "mainSearchField__button":'submit'
        }

        html = s.post(url, data=payloads, headers=headers)
        soup = BS(html.content, 'lxml')

        companyLinks = {baseUrl + item['href'] for item in soup.select("[href*='firmeneintrag/']")}
        # print(soup.select_one('.fp-result').text)
        finalLinks = finalLinks.union(companyLinks)

for item in finalLinks:
    d.get(item)
    info  = d.find_element_by_css_selector('.yp_abstract_narrow')
    address =  d.find_element_by_css_selector('.yp_address')
    print(info.text, address.text)

d.quit()

只是第一个链接:

from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver

d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH', 'aktive Stuttgarter']

url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}

finalLinks = []

## searches section; add to list

with requests.Session() as s:
    for company in companyList:
        payloads = {
        'searchform': 'UFT-8',
        'phrase':company,
        "mainSearchField__button":'submit'
        }

        html = s.post(url, data=payloads, headers=headers)
        soup = BS(html.content, 'lxml')

        companyLink = baseUrl + soup.select_one("[href*='firmeneintrag/']")['href']
        finalLinks.append(companyLink)

for item in set(finalLinks):
    d.get(item)
    info  = d.find_element_by_css_selector('.yp_abstract_narrow')
    address =  d.find_element_by_css_selector('.yp_address')
    print(info.text, address.text)
d.quit()