我在 csv 文件中有一个 url 列表,我想从中抓取内容。 csv 有 200 多个网址。我正在运行的代码是选择第一个 url 然后失败。代码如下:
import csv
from selenium import webdriver
with open('Godzilla1.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Title", "Content"])
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
driver = webdriver.Firefox()
for url in urls:
driver.get(url)
titles = driver.find_elements_by_xpath('//h2[@class="entry-title"]')
contents = driver.find_elements_by_class_name("et_pb_post")
num_page_items = len(titles)
with open('Godzilla1.csv', 'a') as f:
for i in range(num_page_items):
f.write(titles[i].text + "," + contents[i].text + "\n")
# Clean up (close browser once completed task).
driver.close()
当该代码运行时,报告的错误是: f.write(titles[i].text + "," + contents[i].text + "\n") 索引错误:列表索引超出范围
答案 0 :(得分:-1)
问题是当您在 titles
中获得 2 个项目时,contents
中只有 1 个元素。因此,当您迭代到 title
中的第二项时,content
超出范围(因此出现错误)。看起来标题内容重复了两次,所以不要用 entry-title 类获取所有元素,只获取第一个元素。
您在此处使用 ,
作为分隔符也会遇到问题,因为内容中有逗号。我可以建议只使用熊猫吗?
import pandas as pd
from selenium import webdriver
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
driver = webdriver.Firefox()
rows = []
for url in urls:
driver.get(url)
title = driver.find_element_by_xpath('//h2[@class="entry-title"]')
content = driver.find_element_by_class_name("et_pb_post")
row = {'Title':title.text,
'Content':content.text}
rows.append(row)
# Clean up (close browser once completed task).
driver.close()
df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)
还可以选择避免使用 Selenium 而只需使用 requests
和 BeautifulSoup
:
import pandas as pd
import requests
from bs4 import BeautifulSoup
f = open("links.csv")
urls = [url.strip() for url in f.readlines()]
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
rows = []
for url in urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('h2',{'class':"entry-title"})
content = soup.find('div',{'class':'entry-content'}).find('p')
post_meta = soup.find('p', {'class':'post-meta'})
try:
category = post_meta.find('a',{'rel':'category tag'}).text.strip()
except:
category = ''
row = {'Title':title.text,
'Content':content.text,
'Category':category}
print(row)
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv('Godzilla1.csv', index=False)