如何使用python抓取嵌套的两个元素

时间:2020-10-30 19:35:07

标签: python html json web-scraping

嗨,我想获取位于 标签下面的一些信息,但是我找不到任何解决方案,因为有人可以对此报废有所了解吗?任何获取这些信息的

这是我的python代码 汇入要求 导入json 从bs4导入BeautifulSoup

  header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
  
  base_url = "https://www.n11.com/super-firsatlar"
  
  r = requests.get(base_url,headers=header)
  
  if r.status_code == 200:
    soup = BeautifulSoup(r.text, 'html.parser')
    books = soup.find_all('li',attrs={"class":"column"})
    result=[]
    for book in books:
      title=book.find('h3').text
      link=base_url +book.find('a')['href']
      picture = base_url + book.find('img')['src']
      price=book.find('p', {'class': 'del'})
      single ={'title':title,'link':link,'picture':picture,'price':price}
      result.append(single)
      with open('book.json','w', encoding='utf-8') as f:
        json.dump(result,f,indent=4,ensure_ascii=False)
  else:
    print(r.status_code)

这是我的html页面

<li class="column">
    <script type="text/javascript">
var customTextOptionMap = {};
    </script>
    <div id="p-457010862" class="columnContent ">
        <div class="pro">
            <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
               title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="plink" data-id="457010862">
                <img data-original="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
                     width="215" height="215"
                     src="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
                     alt="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="lazy" style="">
                <h3 class="productName ">
                    Oppo A73 128 GB (Oppo Türkiye Garantili)</h3>
                <span class="loading"></span>
            </a>
        </div>
        <div class="proDetail">
            <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
               class="oldPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">


                <del>2.999, 00 TL</del>


            </a> <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
                    class="newPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">


            <ins>2.899, 00<span content="TRY">TL</span></ins>



        </a>
            <div class="discount discountS">
                <div>
                    <span class="percent">%</span>
                    <span class="ratio">3</span>
                </div>
            </div>
            <span class="textImg freeShipping"></span>
            <p class="catalogView-hover-separate"></p>
            <div class="moreOpt">
                <a title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="textImg moreOptBtn"
                   href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"></a>
            </div>
        </div>
    </div>
</li>

2 个答案:

答案 0 :(得分:0)

除非我不理解您的问题,否则它应该像这样做一样简单:

del_data = soup.find_all("del")
ins_data = soup.find_all("ins")

这不是您要实现的目标吗?如果没有,请澄清您的问题

答案 1 :(得分:0)

delins不是class名称,而是tags。您只需使用Soup.find_all('del')

即可找到它们
price = book.find_all('del')
        for p in price:
            print(p.text)

给予 2.999,00土耳其里拉 189,90 TL TL 8.308,44 499,90 TL 6.999,00土耳其里拉 99,00 TL 18,00 TL 499,00土耳其里拉 169,99土耳其里拉 1.499,90 TL 3.010,00 TL 2.099,90 TL ...... 我想这就是你想要的。您必须在此处访问text属性。因此,该元素已定位。您要序列化的方式是另一个问题。