我想通过解析网站获取product_list
soup = bs(product_list_get.text, 'html.parser')
productlist = soup.find_all('td',{'class':'txtCode'})
部分结果如下
[<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>, <td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>
我想要得到的是product_no的列表
所以最佳结果将是
[42,41]
我尝试了
productlist = soup.find_all('td',{'class':'txtCode'}).get('product_no')
但结果是
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
有人可以指导我如何处理吗?
答案 0 :(得分:1)
方法find_all
返回Tag元素列表。因此,您的代码productlist = soup.find_all('td',{'class':'txtCode'})
返回了<td>
元素的列表。您想要为找到的每个number_no
获取内部<a>
元素的属性<td>
。
遍历productlist
并访问number_no
。
productlist = soup.find_all('td', {'class':'txtCode'})
product_nos = [int(p.find('a').get('product_no')) for p in productlist]
或者,您可以找到<a>
个元素,其中包含属性product_no
。
results = soup.find_all('a', {'product_no':True})
product_nos = [int(r.get('product_no')) for r in results]
答案 1 :(得分:1)
product_no
包含在href
中,因此您需要提取href
。然后,您可以使用正则表达式来匹配product_no
from bs4 import BeautifulSoup
import re
lists = [
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>""",
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>"""]
for each in lists:
soup = BeautifulSoup(each,"lxml")
href = soup.a.get("href")
product_no = re.search(r"(?<=product_no=)\w+",href).group(0)
print(product_no)
#42
#41