如何通过此链接抓取子标题?

时间:2017-07-24 17:46:04

标签: python excel web-scraping

我制作了一个网络抓取工具,可以从看起来像这样的页面中抓取数据(它会刮擦表格):https://www.techpowerup.com/gpudb/2/

问题在于,由于某些原因,我的程序只是在抓取值,而不是副标题。例如,(点击链接),它只会刮掉“R420”,“130nm”,“160万”等,而不是“GPU名称”,“工艺尺寸”,“晶体管”等。

我要在代码中添加什么才能让它刮掉子标题?这是我的代码:

import csv
import requests
import bs4
url = "https://www.techpowerup.com/gpudb/2"


#obtain HTML and parse through it
response = requests.get(url)
html = response.content
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")

#reading every value in every row in each table and making a matrix 
tableMatrix = []
for table in tables:
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace(' ', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    tableMatrix.append((list_of_rows, list_of_cells))

#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list 
placeHolder = 0
excelTable = []
for table in tableMatrix:
    for row in table:
        if placeHolder == 0:
            for entry in row:
                excelTable.append(entry)
            placeHolder = 1
        else:
            placeHolder = 0
    excelTable.append('\n')

for value in excelTable:
    print value
    print '\n'


#create excel file and write the values into a csv 
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:
    writer.writerow(values)
fl.close()   

1 个答案:

答案 0 :(得分:0)

如果您检查页面源,那些单元格是标题单元格。所以他们没有使用TD标签,而是TH标签。您可能希望更新循环以在TD单元旁边包含TH单元。