Question

我正在从网站抓取数据，但我遇到以下代码

code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
            <span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "

我只需要提取“₹7,372”。

我试过以下。 1. Code.text 但结果是

'\n\n₹ 7,372\xa0\r\n            \n–\n\n'

code.text.strip() 但结果是

'$ 7,372 \ xa0 \ r \ n \ n - ' -

有什么方法吗？请让我知道，以便我可以完成我的项目。

Answer 1

好的，我设法清理了你需要的数据。这种方式有点难看，但是工作=）

from bs4 import BeautifulSoup as BS

html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
            <span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """

soup=BS(html)

li = soup.find('li').text

for j in range(3):
    for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
        li=li.strip(i)

print(li)

输出：

₹ 7,372

在循环列表中，我概述了所有（据我所知）ASCII空格和你得到的符号。

循环启动3次，因为所需的值从第一次起就没有清理，你可以在变量资源管理器中每次迭代检查它。

此外，您还可以尝试找出哪些精确符号在spaces标记之间提供了大量伪<span>。

Answer 2

from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
            <span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
    l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372

如何提取被两个<span>元素包围的文本？

2 个答案: