从 URL 列表中使用 Selenium 进行网页抓取

时间:2021-06-10 14:52:19

标签: python selenium web-scraping

我在 csv 文件中有一个 url 列表,我想从中抓取内容。 csv 有 200 多个网址。我正在运行的代码是选择第一个 url 然后失败。代码如下:

import csv
from selenium import webdriver

with open('Godzilla1.csv', 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(["Title", "Content"])

f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
driver = webdriver.Firefox()

for url in urls:
    
    driver.get(url)
    
    titles = driver.find_elements_by_xpath('//h2[@class="entry-title"]')
    contents = driver.find_elements_by_class_name("et_pb_post")
    
    num_page_items = len(titles)
    with open('Godzilla1.csv', 'a') as f:
        for i in range(num_page_items):
            f.write(titles[i].text + "," + contents[i].text + "\n")

# Clean up (close browser once completed task).
driver.close()

当该代码运行时,报告的错误是: f.write(titles[i].text + "," + contents[i].text + "\n") 索引错误:列表索引超出范围

1 个答案:

答案 0 :(得分:-1)

问题是当您在 titles 中获得 2 个项目时,contents 中只有 1 个元素。因此,当您迭代到 title 中的第二项时,content 超出范围(因此出现错误)。看起来标题内容重复了两次,所以不要用 entry-title 类获取所有元素,只获取第一个元素。

您在此处使用 , 作为分隔符也会遇到问题,因为内容中有逗号。我可以建议只使用熊猫吗?

import pandas as pd
from selenium import webdriver


f = open("links.csv")
urls = [url.strip() for url in f.readlines()]

driver = webdriver.Firefox()


rows = []
for url in urls:
    
    driver.get(url)
    
    title = driver.find_element_by_xpath('//h2[@class="entry-title"]')
    content = driver.find_element_by_class_name("et_pb_post")
    
    row = {'Title':title.text,
           'Content':content.text}

  
    
    rows.append(row)
    
# Clean up (close browser once completed task).
driver.close()

df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)

还可以选择避免使用 Selenium 而只需使用 requestsBeautifulSoup

import pandas as pd
import requests
from bs4 import BeautifulSoup


f = open("links.csv")
urls = [url.strip() for url in f.readlines()]

headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}

rows = []
for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    title = soup.find('h2',{'class':"entry-title"})
    content = soup.find('div',{'class':'entry-content'}).find('p')
    
    post_meta = soup.find('p', {'class':'post-meta'})

    try:
        category = post_meta.find('a',{'rel':'category tag'}).text.strip()
    except:
        category = ''

    row = {'Title':title.text,
       'Content':content.text,
       'Category':category}
    
    print(row)
    rows.append(row)
    

df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)