我正在制作一个抓取此website表格的网络抓取工具。正如您所看到的,我让它在循环中运行,为每个网页创建一个新的CSV文件。
问题在于,由于它正在创建一个新网页,我最终有100个CSV文件。我应该如何编译这些?我想要的是它将是一个包含所有列的单个CSV文件(第一个CSV文件中的列将是列A,第二个列中的列将位于列B中,等等)。每个CSV文件只有一列,所以我只想合并所有文件。这是我的代码:
import csv
import requests
import bs4
count = 1
while count < 1000:
url = "https://www.techpowerup.com/gpudb/" + str(count)
response = requests.get(url)
html = response.content
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")
tableMatrix = []
for table in tables:
#Here you can do whatever you want with the data! You can findAll table row headers, etc...
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
tableMatrix.append((list_of_rows, list_of_cells))
placeHolder = 0
excelTable = []
for table in tableMatrix:
for row in table:
if placeHolder == 0:
for entry in row:
excelTable.append(entry)
placeHolder = 1
else:
placeHolder = 0
excelTable.append('\n')
for value in excelTable:
print value
print '\n'
count += 1
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:
writer.writerow(values)
fl.close()
答案 0 :(得分:0)
您可以使用pyexcel。首先将列中的数据存储在列表中,然后将此列数据动态存储在另一个列表中。以下代码将帮助您动态创建列表。将所有数据存储在final_list中后,可以将其转储到pyexcel临时表中,然后将其保存到csv文件中。
import pyexcel as pe
Final_list = []
for i in range(6): # number of column you want to create
Final_list.append([])
for n in range(6): # number of data for a particular column
Final_list[i].append('col'+str(n)) # data for column
print(Final_list)
sheet = pe.Sheet(Final_list)
print(sheet)
sheet.save_as("Final.csv")