Question

我的项目的最后一步之一是获得产品的价格，我获得了我需要的一切，除了价格。

来源：

<div class="prices">
<div class="price">
    <div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
    <div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>

我需要得到的是

之后

==“>

我不知道编码部分是否有保护，但是我得到的最直接的结果是这个<div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>

不知道是否相关，我正在使用“ html.parser”进行解析

PS。我没有尝试破解任何东西，这只是一个个人项目，可以帮助我学习。

编辑：如果在解析测试时我没有价格，其他方法可以在没有其他解析器的情况下获取它？

EDIT2：这是我的代码：

page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)

det[7].text
#or 
det[7].get_text()

结果是”

Answer 1

使用正则表达式

我想有几种方法可以使用beautifulsoup来完成，无论如何，这里是一种使用regex的方法

import regex

# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]

以下是正则表达式的简短说明：

[\d\.]+是我们实际上要搜索的内容：\d表示数字，\.表示句点，并且两个用+括在方括号中表示我们要找到至少一位数字/句号
之前/之后的括号进一步指定了在可能的匹配之前/之后必须先进行的匹配
(?<=data\-price[\=\"\w]+\>)表示必须进行data-price...>，其中...是符号A-z0-9="中的至少一个
最后，(?=\<\/div)表示在任何比赛之后都必须紧跟</div

使用lxml

以下是使用模块lxml

的方法

import lxml.html

tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]

Answer 2

"html.parser"可以很好地解决您的问题。由于您可以自己获得此<div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>，这意味着您现在只需要价格，因此可以使用get_text()，它是BeautifulSoup中的内置函数。

此函数返回标签之间的任何文本。

get_text（）的语法：tag_name.get_text()

解决问题的方法：

from bs4 import BeautifulSoup

data ='''
<div class="prices">
<div class="price">
    <div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
    <div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''

soup = BeautifulSoup(data,"html.parser")

# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})

# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)

输出

['151.4', '184.4']

希望您得到想要的东西。：）

当Scraping获得带有“ encoded”部分的html时，是否有可能得到它

2 个答案:

使用正则表达式

使用lxml