在beautifulSoup中使用Webdriver进行Web抓取

时间:2020-01-16 03:05:05

标签: python beautifulsoup webdriver

我正在尝试使用beautifulSoup进行分页web抓取,所以我使用了webdriver来分页到其他页面。但是我真的不知道有其他方法可以使用webdriver从动态网页中获取内容并与我的代码匹配。以下是我尝试实现WebDriver的完整代码,但WebDriver无法正常工作。我要抓取的网络即时消息是[link here] [1]

for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    raw = requests.get('').text

driver.get(raw)

raw = raw.replace("</br>", "")

soup = BeautifulSoup(raw, 'html.parser')

name = soup.find_all('div', {'class' :'cbp-vm-companytext'})
phone = [re.findall('\>.*?<',d.find('span')['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
addresses = [x.text.strip().split("\r\n")[-1].strip() for x in soup.find_all("div", class_='cbp-vm-address')]

print(addresses)
print(name)

num_page_items = len(addresses)
with open('results.csv', 'a') as f:
    for i in range(num_page_items):
        f.write(name[i].text + "," + phone[i] + "," + addresses[i] + "," +  "\n")

当然,我在代码中错误地添加了webdriver。我该如何解决才能使网络驱动程序正常工作?

1 个答案:

答案 0 :(得分:1)

如果您使用Selenium阅读页面,那么您也可以使用Selenium搜索页面上的元素。

某些元素没有companytext,因此如果您分别获得companytextaddress / phone,则可能会创建错误的对:(second name, first phone, first address)(third name, second phone, second address)等。更好地找到将namephoneaddress分组的元素,然后搜索namephoneaddress在此元素内-如果找不到名称,则必须在此组内放入空名称或使用名称搜索其他元素。我发现某些元素显示带有徽标的图像而不是名称,并且它们在<img alt="...">

中具有名称

使用标准write()在文件中写入CSV数据不是一个好主意,因为address可能有很多,并且可能会创建很多列。使用模块csv会将地址放在" "中作为单列。

from selenium import webdriver
import csv

MAX_PAGE_NUM = 5

#driver = webdriver.Chrome()
driver = webdriver.Firefox()

with open('results.csv', 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(["Business Name", "Phone Number", "Address"])

    for page_num in range(1, MAX_PAGE_NUM+1):
        #page_num = '{:03}'.format(page_num)
        url = 'https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen={}'.format(page_num)
        driver.get(url)
        for item in driver.find_elements_by_xpath('//div[@id="content_listView"]//li'):
            try:
                name = item.find_element_by_xpath('.//div[@class="cbp-vm-companytext"]').text
            except Exception as ex:
                #print('ex:', ex)
                name = item.find_element_by_xpath('.//a[@class="cbp-vm-image"]/img').get_attribute('alt')

            phone = item.find_element_by_xpath('.//div[@class="cbp-vm-cta"]//span[@data-original-title="Phone"]').get_attribute('data-content')
            phone = phone[:-4].split(">")[-1]

            address = item.find_element_by_xpath('.//div[@class="cbp-vm-address"]').text
            address = address.split('\n')[-1]

            print(name, '|', phone, '|', address)
            csv_writer.writerow([name, phone, address])

顺便说一句:,您不必将页码转换为三位数-即001-与1一起使用。但是,如果要进行转换,请使用字符串格式

page_num = '{:03}'.format(i)

也可以仅使用requestsBeautifulSoup而不使用Selenium

如果您必须从Selenium获取HTML,那么您就拥有driver.page_source-但是driver.get()需要url,因此您不需要requests

driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

编辑:只有当我使用requests而不是BeautifulSoup时,才能使用Selenium"lxml"而不使用"html.parser"来获取它。 HTML似乎有一些错误,"html.parser"不能正确解析

import requests
from bs4 import BeautifulSoup as BS
import csv
#import webbrowser

MAX_PAGE_NUM = 5

#headers = {
#  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
#}

with open('results.csv', 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(["Business Name", "Phone Number", "Address"])

    for page_num in range(1, MAX_PAGE_NUM+1):
        #page_num = '{:03}'.format(page_num)
        url = 'https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen={}'.format(page_num)

        response = requests.get(url) #, headers=headers)
        soup = BS(response.text, 'lxml')
        #soup = BS(response.text, 'html.parser')

        #with open('temp.html', 'w') as fh:
        #    fh.write(response.text)
        #webbrowser.open('temp.html')

        #all_items = soup.find('div', {'id': 'content_listView'}).find_all('li')
        #print('len:', len(all_items))

        #for item in all_items:
        for item in soup.find('div', {'id': 'content_listView'}).find_all('li'):
            try:
                name = item.find('div', {'class': 'cbp-vm-companytext'}).text
            except Exception as ex:
                #print('ex:', ex)
                name = item.find('a', {'class': 'cbp-vm-image'}).find('img')['alt']

            phone = item.find('div', {'class': 'cbp-vm-cta'}).find('span', {'data-original-title': 'Phone'})['data-content']
            phone = phone[:-4].split(">")[-1].strip()

            address = item.find('div', {'class': 'cbp-vm-address'}).text
            address = address.split('\n')[-1].strip()

            print(name, '|', phone, '|', address)
            csv_writer.writerow([name, phone, address])