抓取更多标签值bs4和其他库

时间:2020-09-20 23:06:19

标签: python html beautifulsoup screen-scraping

我正在尝试抓取以下HTML:

<select id="sizeShoe" name="attributes[&#39;size&#39;]" class="selectFld col-xs-12">
<option value="">Select Size</option>
<option value="025">2.5</option>
<option value="035">3.5</option>
<option value="040">4</option>
<option value="045">4.5</option>
<option value="050">5</option>
<option value="055">5.5</option>
<option value="060">6</option>
<option value="065">6.5</option>
<option value="070">7</option>
<option value="075">7.5</option>
<option value="080">8</option>
<option value="085" selected="selected">8.5</option>
<option value="090">9</option>
                        </select>

我需要创建一个具有以下值的字典:

argument = {"2.5":"025", "3.5":"035, "4":"040" ecc...}

我的尝试:

soup = BeautifulSoup(response.text, "lxml")
soup.prettify()

argument = {}
sizeShoe = soup.find("select", attrs={'id' : 'sizeShoe'})
for a in sizeShoe:
   valor = sizeShoe.get("value")

但是valor的结果是None

如何刮取数据并将其另存为词典? 还有比BeautifulSoup快的图书馆吗?

3 个答案:

答案 0 :(得分:1)

有没有比BeautifulSoup更快的库?

签出Scrapy。参见Difference between BeautifulSoup and Scrapy crawler?


尝试以下代码将数据抓取到字典:

from bs4 import BeautifulSoup, NavigableString

html = '''YOUR ABOVE CODE SNIPPET'''

soup = BeautifulSoup(html, 'lxml')

shoe_size = soup.select_one('#sizeShoe')

# Check that 'tag' is not an instance of 'NavigableString'
# Check that the value of 'value' is not an empty string

argument = {
    tag.text: tag['value']
    for tag in shoe_size
    if not isinstance(tag, NavigableString) and tag['value']
}

print(argument)

输出:

{'2.5': '025', '3.5': '035', '4': '040', '4.5': '045', '5': '050', '5.5': '055', '6': '060', '6.5': '065', '7': '070', '7.5':'075', '8': '080', '8.5': '085', '9': '090'}

答案 1 :(得分:0)

在此处找到代码:

import numpy as np

arr = np.array([9, 8, 7, 8, 9])

_, i = np.unique(arr, return_index=True)  # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)]  # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)}  # create a mapping of value:groupnumber
np.vectorize(m.get)(arr)  # use vectorize to create a new array using m

array([0, 1, 2, 1, 0])

result_dict:

{'2.5':'025', '3.5':'035', '4':'040', '4.5':'045', '5':'050', '5.5':'055', '6':'060', '6.5':'065', '7':'070', '7.5':'075', '8':'080', '8.5':'085', '9':'090'}

答案 2 :(得分:-1)

您必须使用soup.find_all()而不是soup.find()。 bs4是最好的。