Question

使用以下在提取方面“有效”的代码，输出将覆盖主 html 输出“文件”中的每个新页面。我是新手，我确信这是一个愚蠢的编码错误，但我只是没有看到。

换句话说，它正在处理页面并提取信息，但每次完成页面时，它都会覆盖 html 中已有的内容，因此在任何给定时间我只有 p。 2 或 p。 16 等。我需要它继续添加到页面或为每个页面创建一个 html 文件（我认为后者是首选？）。

任何帮助将不胜感激。

这只是更大脚本的一部分，但在运行整个脚本之前，我会努力确保每个部分都能正常工作。

感谢您的时间！

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
import os

allpages=[]
for i in range(2,1575): *** the main page is a different url so starting on p. 2
    allpages.append("url here"+str(i))

completedlist=[]

for eachpage in allpages[0:2]: *** just testing; will change to :1575
#options = Options()
options.headless = True
driver = webdriver.Chrome(options=options, executable_path='mypath')
driver.get(eachpage)
print ('Headless Chrome Initialized: '+eachpage)

with open("./capture/filenamehere"+str(i)+".html", "w") as f:
    f.write(driver.page_source)

completedlist.append(eachpage)

Answer 1

您正在以写入模式打开文件，因此您的输出每次都会被覆盖。将 open 中的 'w' 更改为 'a'，这意味着追加模式，现在您的文件不会被覆盖，新内容将被追加到末尾。

Python（Selenium）脚本覆盖文件

1 个答案: