如何使用python Web爬行在Excel文件中一一写入行

时间:2019-02-21 06:14:50

标签: python web-crawler

我想编写一个Excel文件,其中包含通过网站(cnn.com)获得的数据。我确实写了一个excel文件,但是它不能按照我想要的方式工作。

因为我想将数据一页一页地保存为行。

所以我现在得到的结果看起来像这样- screenshot

理想的结果看起来像这样- screenshot2

这是我的代码。谢谢!

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import os
from bs4 import BeautifulSoup as soup
from bs4 import NavigableString
import re

path = "/Users/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)

# select tag in several pages

a =['world','politics','business','entertainment','sport','health','videos']
nl = []
for i in a:
    driver.get("https://edition.cnn.com/"+str(i))
    driver.implicitly_wait(3)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    find_ingre = soup.select("div.cd__content")
    for i in find_ingre:
        nl.append(i.get_text())


from openpyxl import Workbook

wb = Workbook()
ws = wb.active

# Append all results as row
ws.append(nl)
wb.save("newstopic.xlsx")

1 个答案:

答案 0 :(得分:0)

不是在最后将列表添加到工作表中,而是在每次迭代中添加它。试试下面的代码,它将起作用

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import os
from bs4 import BeautifulSoup as soup
from bs4 import NavigableString
import re
import xlsxwriter
from openpyxl import Workbook

path = "/Users/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)

# select tag in several pages

a =['world','politics','business','entertainment','sport','health','videos']
wb = Workbook()
ws = wb.active
for i in a:
    nl = []
    driver.get("https://edition.cnn.com/"+str(i))
    driver.implicitly_wait(3)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    find_ingre = soup.select("div.cd__content")
    for i in find_ingre:
        nl.append(i.get_text())
    ws.append(nl)

wb.save("newstopic.xlsx")