Question

我正在将某些特定标签从网页保存到Excel文件中，因此我有以下代码：

`import requests
from bs4 import BeautifulSoup
import openpyxl

url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")

wb = openpyxl.Workbook()
ws = wb.active

tagiterator = soup.h2

row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()

while tagiterator.find_next():
    if tagiterator.name == 'h2':
        row += 1
        col = 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
    elif tagiterator.name == 'span':
        col += 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()

wb.save('DG3test.xlsx')`

它有效，但我想要排除一些标签。我想只获得那些具有'product-name'类的h2标签，以及具有'attribute-value'类的span标签。我尝试这样做：

tagiterator['class'] == 'product-name'

tagiterator.hasClass('product-name')

tagiterator.get

还有一些也没用。

我想要的值在我创建的这张糟糕图片中可见：https://ibb.co/eWLsoQ 和url在代码中。

Answer 1

我所做的不包括将其写入excel文件，希望，这是您可以做的事情，但是，只需写一条评论，我就会包含这个代码。逻辑适用，写入产品信息，添加行+ = 1和列然后重置列...（为什么我们这样做？所以产品保持在同一行:)。 你已经完成的事情

from bs4 import BeautifulSoup

import requests

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}


url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
soup = BeautifulSoup(url, 'lxml')

find_products = soup.findAll('div',{'class':'product-row'})

for item in find_products:
    title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
    # print(title_text)
    display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
    # print(display)
    functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
    list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
    #Now you can store them or do-smt...

    for funcs in list_of_funcs:
        print(funcs.text.strip())

算法：

我们找到每个产品
我们在每个产品中找到标签并提取相关信息
我们使用.text仅提取文字部分
我们使用for循环遍历每个产品，然后遍历其功能或包含产品功能的标签。

使用BeautifulSoup通过标记类迭代html

1 个答案: