Question

我从网站（抓取餐厅评论）中提取了 HTML 代码，最终以字典的形式得到了我需要的部分。我设法使用下面的代码获得了具有相同标签的所有脚本，但我不知道如何过滤掉标签以仅获取其中包含评论的脚本并将其转换为字典并最终转换为 csv 文件。

这是我需要保留的（大部分）脚本标签：

这是我用来下载评论页面的所有 HTML 代码并将它们存储在文本文件中的代码：

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
from selenium import webdriver
import codecs
import os
os.system('cls')


PATH = "C:\\Users\\HCES\\Downloads\\chromedriver.exe"
driver = webdriver.Chrome(PATH)


for i in range(1,450):
    completeName = os.path.join('C:\\Users\\HCES\\Desktop\\jana\\scraped files', ("index{}.txt").format(i))
    file_object = codecs.open(completeName, "w", "utf-8")
    driver.get("https://www.zomato.com/beirut/divvy-ashrafieh/reviews?page={}&sort=dd&filter=reviews-dd".format(i))
    file_object.write(driver.page_source)
    print("Page {} is written.".format(i))

driver.quit()

这是我用来只打印出脚本标签的代码：

from selenium import webdriver
import codecs
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

for x in range(1,2):
    revCode = open("index{}.txt".format(x), "r", encoding="utf8")
    content = revCode.read()
    soup = BeautifulSoup(content, 'lxml')
    for script_tag in soup.find_all('script'):
        print(script_tag.text, script_tag.next_sibling)

非常感谢您的帮助，因为我需要它来工作

Answer 1

您可以使用 json 库以 json 格式获取标签内的数据：

import json
...

data = soup.find('script', {"type": "application/ld+json"})
json_data = json.loads(data.string)

现在您可以使用给定的键访问任何值。

从 HTML 抓取文本中提取字典

1 个答案: