Python Crawler Beautifulsoup decompose()函数

时间:2015-01-09 22:56:33

标签: python web-scraping beautifulsoup html-parsing web-crawler

我无法使用python和BeautifulSoup在我的抓取工具中运行decompose()函数。

问题如下。我试图从网站产品中获取所有规范数据(您可以在源代码中看到):

soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
dt = soup.findAll('dt', {'class': 'product-specs--item-title'})

for i in range(0, len(dt)):

    dtRows = dt[i]
    dtRowsStrip = dtRows.text.strip()

    print(dtRows.text.strip())

    # print(dtRows)

    # dtRowsSplit = "".join(dtRowsStrip.split())
    # print(dtRowsSplit)

当我使用:

> print(dtRows.text.strip())

我得到输出,这个:

Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
Serie           


        Serie
Socket          


        Socket
Codenaam            


        Codenaam
Threads         


        Threads
Turbo Frequency         


        Turbo Frequency
Multiplier unlocked         


        Multiplier unlocked
Cache           


        Cache
Geheugencontroller          


        Geheugencontroller
etc ....

第一个完整的行是正确的。在第二行,由于<a>标记内有<dt>标记,因此我获得了双倍值。

一个例子是:

<dt class="product-specs--item-title">
    <a class="product-specs--help-icon js-tooltip" href="#spec_Serie" title="Zowel AMD als Intel produceren processoren in verschillende series. Een serie is bedoeld voor bepaald gebruik. Zo zijn Core i3 processoren geschikt voor internet &amp; office werkzaamheden en Core i7 processoren voor veeleisende multitasking en gaming. Binnen een serie zijn verschillende modellen processoren verkrijgbaar. Van welke serie is deze processor onderdeel?"><i class="icon icon-circle-questionmark"></i><span class="product-specs--help-title">Serie</span></a>
    <span>Serie</span>
</dt>

有人可以帮我删除完整的<a>代码吗?

附加信息:

如果我使用以下代码:

    soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')

for spec in soup.select('dt.product-specs--item-title'):
    print(spec.get_text(strip=True))

输出如下:

Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
SerieSerie
SocketSocket
CodenaamCodenaam
ThreadsThreads
Turbo FrequencyTurbo Frequency
Multiplier unlockedMultiplier unlocked
CacheCache
GeheugencontrollerGeheugencontroller
ProductieprocesProductieproces
Stroomverbruik maximaalStroomverbruik maximaal
KloksnelheidKloksnelheid
ProcessorkernenProcessorkernen
Type GPUType GPU

如你所见。在第二个<dl>块之后,我得到了双倍值。

附加: 谢谢......我也发现了它。我知道你的代码更好但只是想分享我的解决方案:

    for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
    print(spec.get_text(strip=True))

    parent = spec.find_parent('dt')
    value = parent.find_next_sibling("dd", {'class': 'product-specs--item-spec'})
    print(value.text.strip())

1 个答案:

答案 0 :(得分:3)

您只需要更具体地说明要提取的节点和节点:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))

for spec in soup.select('div.product-specs > dl.product-specs--list > dt.product-specs--item-title'):
    print spec.get_text(strip=True)

打印:

Serie
Threads
Socket
Kloksnelheid

在这里,我们基本上得到以下块:

enter image description here


如果您需要获取所有产品规格并避免重复,则需要使用span向下一级降至class="product-specs--help-title"

for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
    print spec.get_text(strip=True)

打印:

Serie
Socket
Codenaam
Threads
Turbo Frequency
Multiplier unlocked
Cache
Geheugencontroller
Productieproces
Stroomverbruik maximaal
Kloksnelheid
Processorkernen
Type GPU
Koeler meegeleverd

以下是如何获得名称:规格值对:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))

for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title'):
    name = spec.span
    if not name:
        continue

    value = spec.find_next_sibling('dd', class_='product-specs--item-spec')
    print name.get_text(strip=True), value.get_text(strip=True)

打印:

Serie Core i7
Socket 1150
Codenaam Haswell Refresh
Threads 8
Turbo Frequency 4400 MHz
Multiplier unlocked Ja
Cache 8 MB
Geheugencontroller DDR3-1600
Productieproces 22 nm
Stroomverbruik maximaal 88 watt
Kloksnelheid 4000 MHz
Processorkernen Quad-core
Type GPU Intel HD Graphics 4600
Koeler meegeleverd Ja