在<script type =“ text / javascript”>和$(function()中提取内容

时间:2019-05-29 14:58:50

标签: javascript python-3.x web-scraping beautifulsoup

我正在从网站上抓取数据。我能够提取标签内的内容。但是它里面有'$(function(){'。我想在其中提取内容。

 导入urllib.request
从bs4导入BeautifulSoup
导入json
url ='https://www.broadwayinbound.com/shows/'
响应= urllib.request.urlopen(url)
数据= response.read()#一个字节对象
汤= BeautifulSoup(数据)
结果= soup.findAll('script',{'type':'text / javascript'})
r = []
结果:
    如果result.text中的'var shows = [':
        r.append(结果文本)
打印(r [0])
 

我想单独提取“无聊节目”的内容。

  {“ Id”:“ 12680”,“ ClientClassCode”:“默认”,“ ShowName”:“不是太骄傲-诱惑的生活和时代”,“ ShowCode”:“ AINTPROUD” ,“ SortName”:“不太骄傲-诱惑的生活和时代”,“ ShowLogo”:“ / product-resources / Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets -500-102318.jpg“,” ShowLogoText“:”不是太骄傲-诱惑的生活和时代的门票|百老汇...
 

2 个答案:

答案 0 :(得分:1)

假设其余代码工作正常,那么一个简单的正则表达式就可以解决问题:)

import urllib.request
import re
import json
from bs4 import BeautifulSoup

url = 'https://www.broadwayinbound.com/shows/'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
soup = BeautifulSoup(data)
results = soup.findAll('script', {'type':'text/javascript'})
r = []
for result in results :
    if 'var shows = [' in result.text:
        x = re.findall(r"var shows = (\[.*\])", result.text)
        if (len(x) > 0):
            r.append(x[0])

print(json.loads(r[0]))
print(json.loads(r[0])[0]["Id"])

答案 1 :(得分:0)

您将不得不操纵字符串。从本质上讲,它为您提供了json结构的列表:

import requests
from bs4 import BeautifulSoup
import json 

url = 'https://www.broadwayinbound.com/shows/'
response = requests.get(url)
data = response.text     # a `bytes` object
soup = BeautifulSoup(data)
results = soup.findAll('script', {'type':'text/javascript'})
r = []


for result in results :
    if 'var shows = [' in result.text:
        jsonStr = result.text

        jsonStr = jsonStr.split('var shows = [')[1]
        jsonStr = jsonStr.rsplit('];',1)[0]

        jsonStr_list = jsonStr.split('{"Id":')[1:]

        for each in jsonStr_list:
            each = jsonStr_list[0]
            w=1
            if each[-1] == ',':
                each = each.rstrip(',')

            jsonTemp = '{"Id":' + each
            jsonObj = json.loads(jsonTemp)

            r.append(jsonObj)

输出:

print (r)
[{'Id': '12680', 'ClientClassCode': 'default', 'ShowName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowCode': 'AINTPROUD', 'SortName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowLogo': '/product-resources/Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets-500-102318.jpg', 'ShowLogoText': "Ain't Too Proud - The Life and Times of The Temptations Tickets | Broadway Inbound", 'ShowPromo': '', 'ShowPromoText': '', 'Description': "<em>Ain't Too Proud</em> is the electrifying new musical that follows The Temptations' extraordinary journey from the streets of Detroit to the Rock & Roll Hall of Fame.<br /><br />Five guys. One dream. And a sound that would make music history. With their signature dance moves and unmistakable harmonies, they rose to the top of the charts creating an amazing 42 Top Ten Hits with 14 reaching number one. The rest is history — how they met, the groundbreaking heights they hit, and how personal and political conflicts threatened to tear the group apart as the United States fell into civil unrest. This thrilling story of brotherhood, family, loyalty, and betrayal is set to the beat of the group's treasured hits, including “My Girl,” “Just My Imagination,” “Get Ready,” “Papa Was a Rolling Stone,” and so many more.<br /><br />After breaking house records at Berkeley Rep, The Kennedy Center, and at the Ahmanson Theater, <em>Ain't Too Proud</em>, written by three time Obie Award winner Dominique Morisseau, directed by two-time Tony Award® winner Des McAnuff (<em>Jersey Boys</em>), and featuring choreography by Tony nominee Sergio Trujillo (<em>Jersey Boys</em>, <em>On Your Feet</em>), now brings the untold story of this legendary quintet to irresistible life on Broadway.", 'Category': 'Broadway', 'CategoryCode': 'BW', 'ShowType': 'Musical', 'ShowTypeCode': 'MUSICAL', 'Rating': 'Might not be suitable for younger children', 'RatingCode': 'PT', 'City': 'New York', 'CityCode': 'NYCA', 'FirstPerformance': '2/28/2019', 'NextPerformance': '5/30/2019', 'NextPerformanceTime': '7:00 PM', 'OnSaleThrough': '6/7/2020', 'Weekdays': ['fr', 'mo', 'sa', 'su', 'th', 'tu', 'we'], 'MinPrice': '42.00', 'MaxPrice': '385.90', 'GroupMinimum': '10', 'MaximumTickets': '25', 'VenueName': 'Imperial Theatre', 'Url': '/shows/aint-too-proud-the-life-and-times-of-the-temptations/', 'BroadwayCollectionEN': 'http://www.broadwaycollection.com/shows/https://www.broadwaycollection.com/shows/aint-too-proud/', 'BroadwayCollectionES': 'http://www.broadwaycollection.com/es/shows/https://www.broadwaycollection.com/es/shows/aint-too-proud/', 'BroadwayCollectionDE': 'http://www.broadwaycollection.com/de/shows/https://www.broadwaycollection.com/de/shows/aint-too-proud/', 'BroadwayCollectionJA': 'http://www.broadwaycollection.com/ja/shows/https://www.broadwaycollection.com/ja/shows/aint-too-proud/', 'BroadwayCollectionPT': 'http://www.broadwaycollection.com/pt-br/shows/https://www.broadwaycollection.com/pt-br/shows/aint-too-proud/', 'BroadwayCollectionZH': 'http://www.broadwaycollection.com/zh-hans/shows/https://www.broadwaycollection.com/zh-hans/shows/aint-too-proud/', 'RunTime': '2 hours and 30 minutes, including intermission', 'ShowLetUsKnow': False}, {'Id': '12680', 'ClientClassCode': 'default', 'ShowName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowCode': 'AINTPROUD', 'SortName': "Ain't Too Proud - The Life and Times of The Temptations", 'ShowLogo': '/product-resources/Aint-Too-Proud-Temptations-Musical-Broadway-Group-Sales-Show-Tickets-500-102318.jpg', 'ShowLogoText': "Ain't Too Proud - The Life and Times of The Temptations Tickets | Broadway Inbound", 'ShowPromo': '', 'ShowPromoText': '', 'Description': "<em>Ain't Too Proud</em> is the electrifying new musical that follows The Temptat ...