Question

我正在尝试构建一个从网页收集文本的scraper。我正在查看具有不同类名的两个特定div（＆＃34; product-image＆＃34;和＆＃34; product-details＆＃34;）。我循环浏览它们，抓住每个＆＃34; a＆＃34;和＆＃34; dd＆＃34; div中的标记。

值得注意的是，这是我编写过的第一个Python程序......

这是我的代码：

当我打印出list_of_rows时，我在循环中为每个传递获得以下输出：

[价格]

[标题]，[作者]，[出版商]，[嗒嗒]，[嗒嗒]，[嗒嗒]

[价格]来自＆＃34; product-image＆＃34; div块。 [标题]等。来自＆＃34;产品细节＆＃34; div block。

所以基本上，findAll和我写的循环为每个div块输出不同的行我循环。我想得到的结果是两个块的单行输出，如下所示：

[价格]，[标题]，[作者]，[出版商]，[嗒嗒]，[嗒嗒]，[嗒嗒]

有没有办法在我当前的流程中执行此操作，或者我是否需要将其分解为多个循环，单独提取数据，然后合并？我已经浏览了StackOverflow和其他网站上的所有Q＆amp; A，虽然我可以找到多个类的findAll循环实例，但我找不到如何将输出减少到一行的任何示例

这是我正在解析的网页的摘录。此代码段在html I解析中出现1次x，其中x是页面上的产品数量：

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

非常感谢任何指示或帮助！

Answer 1

从你的问题，我得出了2个结果..我不确定你在寻找什么...所以我发布了两个案例

第一种情况 - 扩展列表而不是附加

from bs4 import BeautifulSoup
data = """<div class="product-image">
    <a class="thumb" href="/Store/Details/life-on-the-screen/_/R-9780684833484B"><img src="http://images.bookdepot.com/covers/large/isbn978068/9780684833484-l.jpg" alt="" class="cover" />
        <div class="price "><span>$</span>2.25
        </div>
    </a>
</div>

<div class="product-details">
    <dl>
        <dt><div class="nowrap"><span><a href="/Store/Details/life-on-the-screen/_/R-9780684833484B" title="Life On The Screen">Life On The Screen</a></span></div></dt>
        <dd class="type"><div class="nowrap"><span><a href="/Store/Browse/turkle-sherry/_/N-4294697489/Ne-4">Turkle, Sherry</a></span></div></dd>
        <dd class="type"><div class="nowrap"><a href="/Store/Browse/simon-and-schuster/_/N-4294151338/Ne-5">Simon and Schuster</a></div></dd>
        <dd class="type">(Paperback)</dd>
        <dd class="type">Computers &amp; Internet</dd>
        <dd class="type">ISBN: 9780684833484</dd>
        <dd>List $15.00 - Qty: 9</dd>
           </dl>
</div>"""

soup = BeautifulSoup(data,'lxml')

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.extend(list_of_cells)
print list_of_rows

输出

[u'\n$2.25\n        \n', u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']

第二种情况 - 您需要从html文本中删除新行字符

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text.strip())
    list_of_rows.append(list_of_cells)
print list_of_rows

输出

[[u'$2.25'], [u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']]

使用BeautifulSoup findAll将多行输出组合成一行，具有多个类/标签

1 个答案: