浏览HTML标签以从锚定标签中提取文本

时间:2019-06-19 16:53:35

标签: python web-scraping

我需要从网页中提取某些文本,但是文本所在的锚点标签嵌入了几个子类中。

如果已经解决了这种情况,我是不熟悉网络抓取功能的人,因此我必须从此网页(https://www.astm.org/search/fullsite-search.html?query=alloy&toplevel=products-and-services&sublevel=standards-and-publications抓取文字,我已经尝试过使用bs4来解析网页,但是在创建汤对象,我无法从每个单独的结果中获取标签。

使用请求和我尝试过的bs4

    page = requests.get(url)
    soup = BeautifulSoup(page.text)
    print(soup)

并在找到我需要的文本的地方输出标签

    <div class ="span8 main searchresults">
    <div id="results"></div>

我期望看到结果ID中的标签,该标签应该看起来像

    <div id="results">
    <div class="res">
    <div class="resTable">
    <h4 class="resTitle>
    <a...

我需要从每个结果中获取标题文本,例如第一个文本将是

    "ASTM A506-16 Standard Specification for Alloy and Structural Alloy Steel, Sheet and Strip, Hot-Rolled and Cold-Rolled"

问题是当我尝试引用bs4中的任何这些标签时,什么也没有返回。如何遍历这些类以获取标签内的文本?

2 个答案:

答案 0 :(得分:1)

您的数据似乎以JSON格式在HTML页面中进行了编码(BeautifulSoup在这里无法为您提供帮助,但是您可以使用Dim Temp(3), Difference As Double Dim i As Integer Const UpperBound As Double = 37.5 Const LowerBound As Double = 36 For i = 1 To 3 Console.WriteLine("Enter the temeperature of the baby.") Temp(i) = Console.ReadLine() If Temp(i) > UpperBound Or Temp(i) < LowerBound Then Console.WriteLine("The temperature is not in the correct range.") Else Console.WriteLine("The temperature is within acceptable range.") End If Next Console.WriteLine("The minimum temperature is " & Temp.Min) Console.WriteLine("The maximum temperature is " & Temp.Max) Difference = Temp.Max - Temp.Min Console.WriteLine("The difference is " & Difference) Console.ReadKey() 模块提取数据):

re

打印:

import re
import json
import requests
from pprint import pprint

url = 'https://www.astm.org/search/fullsite-search.html?query=alloy&toplevel=products-and-services&sublevel=standards-and-publications'

data = json.loads(re.findall(r'var mc_results = ({.*?})\s*;', requests.get(url, verify=False).text, flags=re.DOTALL)[0])

for s in data['resSet']:
    for result in s['results']:
        pprint(result['res']['meta'])
        print('*' * 80)

答案 1 :(得分:0)

这就是我要深入研究不同班级的事情

加载到beautifulsoup

soup = BeautifulSoup(data.text, 'html.parser')

在html代码中找到页首横幅并进行解析

FeaturedArticles = soup.findAll('article',{'class':'featured'})

打印(精选文章)

for Articles in FeaturedArticles: 
    title = Articles.a.text
    print(title)