beautifulsoup python类解析

时间:2018-11-26 07:57:44

标签: python beautifulsoup

我想通过解析网站获取product_list

soup = bs(product_list_get.text, 'html.parser')
productlist = soup.find_all('td',{'class':'txtCode'})

部分结果如下

[<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>, <td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>

我想要得到的是product_no的列表

所以最佳结果将是

[42,41]

我尝试了

productlist = soup.find_all('td',{'class':'txtCode'}).get('product_no')

但结果是

AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

有人可以指导我如何处理吗?

2 个答案:

答案 0 :(得分:1)

方法find_all返回Tag元素列表。因此,您的代码productlist = soup.find_all('td',{'class':'txtCode'})返回了<td>元素的列表。您想要为找到的每个number_no获取内部<a>元素的属性<td>

遍历productlist并访问number_no

productlist = soup.find_all('td', {'class':'txtCode'})
product_nos = [int(p.find('a').get('product_no')) for p in productlist]

或者,您可以找到<a>个元素,其中包含属性product_no

results = soup.find_all('a', {'product_no':True})
product_nos = [int(r.get('product_no')) for r in results]

答案 1 :(得分:1)

product_no包含在href中,因此您需要提取href。然后,您可以使用正则表达式来匹配product_no

from bs4 import BeautifulSoup
import re

lists = [
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>""", 
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>"""]

for each in lists:
    soup = BeautifulSoup(each,"lxml")
    href = soup.a.get("href")
    product_no = re.search(r"(?<=product_no=)\w+",href).group(0)
    print(product_no)
#42
#41