将刮取的python索引数据重组为html表

时间:2018-03-24 18:37:48

标签: python html css python-3.x beautifulsoup

以下脚本从网页中检索图像和文本。到目前为止,脚本的输出类似于原始页面,具有一些图像大小和CSS修改。我希望表中的每一行尽可能多的项目/列,直到行的末尾,然后用下一行填充下一行,而不是将每个项目放在不同的行和单个列中。输出html页面显示包含多个列和行的完整页面和表,而不是仅包含一列和多行。任何帮助将不胜感激。

from bs4 import BeautifulSoup
import requests

urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4764&siteid=1"

# add header
mozila_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)\
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
headers = {'User-Agent': mozila_agent}
r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")


############################################################

the_whole_table = soup.find('table', width='97%')

datalist = []

for tr in the_whole_table.find_all('tr')[1:]:
    # you want to start from the 1st item not the 0th so [1:]
    # Because the first is the thead i.e. Lot no, Picture, Lot Title...
    index_num = tr.find('td', width='8%')
    picture_link = index_num.next_sibling.a['data-img']
    text_info = tr.find('td', width='41%')
    current_bid = tr.find('td', width='13%')
    time_left = tr.find('td', width='19%')
    datalist.append([index_num.text, picture_link,
                     text_info.text, current_bid.text, time_left.text])

    # for pic do ... print(picture_link) as for partial text only first 20
    # characters


index = datalist[0][0]
picture = datalist[0][1]
info = datalist[0][2]
bid = datalist[0][3]
time = datalist[0][4]


df = ['Index Number', 'Picture', 'Informational text',
      'Current BID', 'Time Left now']

theads = BeautifulSoup('<table style="width:50%; color: blue; font-family: verdana; font-size: 60%;"></table>', 'lxml')
thekeys = BeautifulSoup('<thead style="color: blue; font-family: verdana; font-size: 60%;"></thead>', 'html.parser')


for i in df:
    tag = theads.new_tag('th')
    tag.append(i)
    thekeys.thead.append(tag)

theads.table.append(thekeys)
###############################################################
# The code above will initiate a table
# after that the for loop will create and populate the first row (thead)

for i in datalist:
#    thedata = BeautifulSoup('<tr style="color: blue; font-family: verdana; font-size: 50%;"></tr>', 'html.parser')
    thedata = BeautifulSoup('<tr></tr>', 'html.parser')
    # we loop through the data we collected
    # initiate a <td> </td> tag everytime we finish with one collection
    for j in i:
        if j.startswith('https'):
            img_tag = theads.new_tag('img', src=j, height='300', width='300')
            td_tag = theads.new_tag('td')
            td_tag.append(img_tag)
            thedata.tr.append(td_tag)

        else:
  #            tag = theads.new_tag('td', style="color: blue; font-family: verdana; font-size: 50%;")
            tag = theads.new_tag('td')
            tag.append(j)
            thedata.tr.append(tag)

    theads.table.append(thedata)

css = "<style>{color: blue; font-family: verdana; font-size: 50%;}</style>"
#css.string = css

with open('asdf.html', 'w+') as f:
    f.write(theads.prettify())

print(css)

# each of these if you print them you'll get a information that you can store
# to test do print(index_num.text, text_info.text)

输出的快速模型看起来像这样(显然不是重复相同的图像): enter image description here

0 个答案:

没有答案