我需要从网页中提取某些文本,但是文本所在的锚点标签嵌入了几个子类中。
如果已经解决了这种情况,我是不熟悉网络抓取功能的人,因此我必须从此网页(https://www.astm.org/search/fullsite-search.html?query=alloy&toplevel=products-and-services&sublevel=standards-and-publications抓取文字,我已经尝试过使用bs4来解析网页,但是在创建汤对象,我无法从每个单独的结果中获取标签。
使用请求和我尝试过的bs4
page = requests.get(url)
soup = BeautifulSoup(page.text)
print(soup)
并在找到我需要的文本的地方输出标签
<div class ="span8 main searchresults">
<div id="results"></div>
我期望看到结果ID中的标签,该标签应该看起来像
<div id="results">
<div class="res">
<div class="resTable">
<h4 class="resTitle>
<a...
我需要从每个结果中获取标题文本,例如第一个文本将是
"ASTM A506-16 Standard Specification for Alloy and Structural Alloy Steel, Sheet and Strip, Hot-Rolled and Cold-Rolled"
问题是当我尝试引用bs4中的任何这些标签时,什么也没有返回。如何遍历这些类以获取标签内的文本?
答案 0 :(得分:1)
您的数据似乎以JSON格式在HTML页面中进行了编码(BeautifulSoup在这里无法为您提供帮助,但是您可以使用Dim Temp(3), Difference As Double
Dim i As Integer
Const UpperBound As Double = 37.5
Const LowerBound As Double = 36
For i = 1 To 3
Console.WriteLine("Enter the temeperature of the baby.")
Temp(i) = Console.ReadLine()
If Temp(i) > UpperBound Or Temp(i) < LowerBound Then
Console.WriteLine("The temperature is not in the correct range.")
Else Console.WriteLine("The temperature is within acceptable range.")
End If
Next
Console.WriteLine("The minimum temperature is " & Temp.Min)
Console.WriteLine("The maximum temperature is " & Temp.Max)
Difference = Temp.Max - Temp.Min
Console.WriteLine("The difference is " & Difference)
Console.ReadKey()
模块提取数据):
re
打印:
import re
import json
import requests
from pprint import pprint
url = 'https://www.astm.org/search/fullsite-search.html?query=alloy&toplevel=products-and-services&sublevel=standards-and-publications'
data = json.loads(re.findall(r'var mc_results = ({.*?})\s*;', requests.get(url, verify=False).text, flags=re.DOTALL)[0])
for s in data['resSet']:
for result in s['results']:
pprint(result['res']['meta'])
print('*' * 80)
答案 1 :(得分:0)
这就是我要深入研究不同班级的事情
soup = BeautifulSoup(data.text, 'html.parser')
FeaturedArticles = soup.findAll('article',{'class':'featured'})
打印(精选文章)
for Articles in FeaturedArticles:
title = Articles.a.text
print(title)