Question

在此post中，alecxe提供了有关如何刮取Amazon.com产品信息/产品详细信息表的解决方案。但是，该描述表的格式与亚马逊上列出的许多新项目不同。

您可以看到here的旧格式与新格式here不同。

我尝试过：在alecxe给出的代码中，他使用了

for li in soup.select('table#productDetailsTable div.content ul li'):

我尝试将其更改为（并删除了之后的所有内容）：

for tr in soup.select('table#productDetails_detailBullets_sections1 tbody tr'):
    print text.tr
    print(repr(tr))

看看我是否能够从产品信息表中提取至少一些东西。但是，没有印刷。

我也尝试了find_all()和find()函数，但我无法提取我需要的东西，甚至无法提取我需要的东西。

解决这个问题的问题是由新表格的HTML结构引起的。它看起来像：

<table ... >
<tbody>
.
.
.    
<tr>
    <th class="a-color-secondary a-size-base prodDetSectionEntry">
        Best Sellers Rank
    </th>
    <td>
         <span>

                <span>#8,740 in Toys &amp; Games (<a href="/gp/bestsellers/toys-and-games/ref=pd_dp_ts_toys-and-games_1">See Top 100 in Toys &amp; Games</a>)</span>
        <br>

                <span>#67 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_1_1">Toys &amp; Games</a> &gt; <a href="/gp/bestsellers/toys-and-games/166359011/ref=pd_zg_hrsr_toys-and-games_1_2">Puzzles</a> &gt; <a href="/gp/bestsellers/toys-and-games/166363011/ref=pd_zg_hrsr_toys-and-games_1_3_last">Jigsaw Puzzles</a></span>
        <br>

                <span>#87 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_2_1">Toys &amp; Games</a> &gt; <a href="/gp/bestsellers/toys-and-games/251909011/ref=pd_zg_hrsr_toys-and-games_2_2">Preschool</a> &gt; <a href="/gp/bestsellers/toys-and-games/251910011/ref=pd_zg_hrsr_toys-and-games_2_3">Pre-Kindergarten Toys</a> &gt; <a href="/gp/bestsellers/toys-and-games/251942011/ref=pd_zg_hrsr_toys-and-games_2_4_last">Puzzles</a></span>
        <br>

        </span>
    </td>
    </tr>
.
. 
.
</tbody>
</table>

如果我想提取卖家排名＆＃34; Toys＆amp;游戏＆gt;拼图＆gt;拼图游戏＆＃34;我该怎么做？（第二部分中的文字，至少在这种情况下，在上面的HTML中）

Answer 1

我可以通过一些小的调整来使你的代码工作：

删除soup.select中的'tbody'，它是浏览器生成的标记
打印tr.text而非text.tr

代码：

for tr in soup.select('table#productDetails_detailBullets_sections1 tr'):
    if 'Jigsaw Puzzles' in tr.text :
        print(tr.text.strip())

或者如果您更喜欢find / find_all：

for tr in soup.find('table', id='productDetails_detailBullets_sections1').find_all('tr') :
    if 'Jigsaw Puzzles' in tr.text : 
        for span in tr.find('span').find_all('span') : 
            if 'Jigsaw Puzzles' in span.text : 
                print(span.text.strip())

如何使用BeautifulSoup在Amazon.com上抓取产品信息的新格式？

1 个答案: