我无法使用python和BeautifulSoup在我的抓取工具中运行decompose()函数。
问题如下。我试图从网站产品中获取所有规范数据(您可以在源代码中看到):
soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
dt = soup.findAll('dt', {'class': 'product-specs--item-title'})
for i in range(0, len(dt)):
dtRows = dt[i]
dtRowsStrip = dtRows.text.strip()
print(dtRows.text.strip())
# print(dtRows)
# dtRowsSplit = "".join(dtRowsStrip.split())
# print(dtRowsSplit)
当我使用:
> print(dtRows.text.strip())
我得到输出,这个:
Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
Serie
Serie
Socket
Socket
Codenaam
Codenaam
Threads
Threads
Turbo Frequency
Turbo Frequency
Multiplier unlocked
Multiplier unlocked
Cache
Cache
Geheugencontroller
Geheugencontroller
etc ....
第一个完整的行是正确的。在第二行,由于<a>
标记内有<dt>
标记,因此我获得了双倍值。
一个例子是:
<dt class="product-specs--item-title">
<a class="product-specs--help-icon js-tooltip" href="#spec_Serie" title="Zowel AMD als Intel produceren processoren in verschillende series. Een serie is bedoeld voor bepaald gebruik. Zo zijn Core i3 processoren geschikt voor internet & office werkzaamheden en Core i7 processoren voor veeleisende multitasking en gaming. Binnen een serie zijn verschillende modellen processoren verkrijgbaar. Van welke serie is deze processor onderdeel?"><i class="icon icon-circle-questionmark"></i><span class="product-specs--help-title">Serie</span></a>
<span>Serie</span>
</dt>
有人可以帮我删除完整的<a>
代码吗?
附加信息:
#如果我使用以下代码:
soup = soup_function('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html')
for spec in soup.select('dt.product-specs--item-title'):
print(spec.get_text(strip=True))
输出如下:
Serie
Threads
Socket
Kloksnelheid
Fabrikantcode
Artikelnummer
Merk
Garantie
Garantietype
SerieSerie
SocketSocket
CodenaamCodenaam
ThreadsThreads
Turbo FrequencyTurbo Frequency
Multiplier unlockedMultiplier unlocked
CacheCache
GeheugencontrollerGeheugencontroller
ProductieprocesProductieproces
Stroomverbruik maximaalStroomverbruik maximaal
KloksnelheidKloksnelheid
ProcessorkernenProcessorkernen
Type GPUType GPU
如你所见。在第二个<dl>
块之后,我得到了双倍值。
附加: 谢谢......我也发现了它。我知道你的代码更好但只是想分享我的解决方案:
for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
print(spec.get_text(strip=True))
parent = spec.find_parent('dt')
value = parent.find_next_sibling("dd", {'class': 'product-specs--item-spec'})
print(value.text.strip())
答案 0 :(得分:3)
您只需要更具体地说明要提取的节点和节点:
from urllib2 import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))
for spec in soup.select('div.product-specs > dl.product-specs--list > dt.product-specs--item-title'):
print spec.get_text(strip=True)
打印:
Serie
Threads
Socket
Kloksnelheid
在这里,我们基本上得到以下块:
如果您需要获取所有产品规格并避免重复,则需要使用span
向下一级降至class="product-specs--help-title"
:
for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title span.product-specs--help-title'):
print spec.get_text(strip=True)
打印:
Serie
Socket
Codenaam
Threads
Turbo Frequency
Multiplier unlocked
Cache
Geheugencontroller
Productieproces
Stroomverbruik maximaal
Kloksnelheid
Processorkernen
Type GPU
Koeler meegeleverd
以下是如何获得名称:规格值对:
from urllib2 import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://www.processorstore.nl/product/476816/category-212194/intel-core-i7-4790k.html'))
for spec in soup.select('div.product-specs dl.product-specs--list > dt.product-specs--item-title'):
name = spec.span
if not name:
continue
value = spec.find_next_sibling('dd', class_='product-specs--item-spec')
print name.get_text(strip=True), value.get_text(strip=True)
打印:
Serie Core i7
Socket 1150
Codenaam Haswell Refresh
Threads 8
Turbo Frequency 4400 MHz
Multiplier unlocked Ja
Cache 8 MB
Geheugencontroller DDR3-1600
Productieproces 22 nm
Stroomverbruik maximaal 88 watt
Kloksnelheid 4000 MHz
Processorkernen Quad-core
Type GPU Intel HD Graphics 4600
Koeler meegeleverd Ja