我需要帮助从网页中提取特定文本

时间:2019-01-20 03:55:34

标签: python web-scraping beautifulsoup python-requests

我正在尝试将此html文件中的11101973数字分配给变量,但仅需要一种无需任何其他信息即可获取该数字的方法:

Sub ListFiles()
    Dim FSO As Object
    Dim FSO_Folder As Object
    Dim myPath$
    Dim Obj
    Dim Str$
    Dim k1 As Long

    myPath$ = "C:\Users\jim\Desktop\UIAutomation_VBA-master"
    Set FSO = CreateObject("Scripting.FileSystemObject")
    Set FSO_Folder = FSO.GetFolder(myPath)
    For Each Obj In FSO_Folder.Files
            Str$ = Obj.Path
    Next Obj
End Sub

Sub ReadFiles()
    Dim FSO As Object
    Dim FSO_Folder As Object
    Dim myPath$
    Dim Obj
    Dim Str$
    Dim k1 As Long

    myPath$ = "C:\Users\jim\Desktop\UIAutomation_VBA-master"
    Set FSO = CreateObject("Scripting.FileSystemObject")
    Set FSO_Folder = FSO.GetFolder(myPath)

    Do
        k1 = 0
        For Each Obj In FSO_Folder.Files
            k1 = k1 + AccessRight(Obj.Path)
        Next Obj
        DoEvents
    Loop Until k1 = FSO_Folder.Files.Count
End Sub

Function AccessRight(ByVal FilePath As String) As Long
    On Error GoTo The_end

    AccessRight = 0
    Open FilePath For Binary Lock Read Write As #1
    Close #1
    AccessRight = 1

The_end:
End Function

如果需要更多信息,则页面来源在这里:view-source:https://www.kickz.com/uk/jordan-basketball-retro-air-jordan-1-retro-high-og-black_varsity_red_sail_university_blue-107840036 任何帮助表示赞赏!

2 个答案:

答案 0 :(得分:2)

beautifulsoup用于解析html元素而不是javascript变量。那里几乎没有JavaScript解析器,但是对于简单的任务,我更喜欢Regex

import requests, re

page = requests.get(url).text
theNumber = re.search(r'collectAskInput\((\d+)).group(1)
print(theNumber)
# 11101973

搜索其中的号码

onclick="return ProductDetails.collectAskInput(11101973)

答案 1 :(得分:0)

它在源代码中是一个脚本标记,您可以拉出字典形式的字符串。

import requests
import bs4
import json

url = 'https://www.kickz.com/uk/jordan-basketball-retro-air-jordan-1-retro-high-og-black_varsity_red_sail_university_blue-107840036'

response = requests.get(url)

soup = bs4.BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')

jsonObj = None
for script in scripts:
    if 'ec:addProduct' in script.text:
        jsonStr = script.text

        jsonStr = jsonStr.split("ga('ec:addProduct',")[1]
        jsonStr = jsonStr.split(");")[-4]
        jsonStr = jsonStr.replace("'", '"')

        jsonObj = json.loads(jsonStr)

id_var = jsonObj['id']    
print (id_var)

输出:

print (id_var)
107840036