从脚本中进行Webscraping

时间:2018-06-01 14:26:03

标签: python python-3.x

我正在尝试使用python的webpage ="https://www.zippia.com/amazon-com-careers-487/" page = requests.get(webpage) soup = BeautifulSoup(page.content, 'lxml') for links in soup.find_all('div', {'class':'companyEducationDegrees'}): raw_text = links.get_text() lines = raw_text.split('\n') print(lines) print('-------------------') 来提取公司所使用的语言比例。

然而,这些信息似乎来自一个脚本,而不是来自HTML,而且我遇到了一些麻烦。

例如,从下一页开始,当我尝试

Spanish 61.1%, French 9,7%, etc

我没有得到任何结果,而理想的结果应该是extern "C" void fortran_function(double *); extern "C" void fortran_function(float *);

1 个答案:

答案 0 :(得分:1)

正如您已经发现的那样,数据通过JS放入页面。但是,您仍然可以获取该数据,因为comapany上的整个数据始终随页面一起加载。您可以通过requests + BeautifulSoup + json(+ re)访问此数据:

import json
import re

import requests
from bs4 import BeautifulSoup

webpage = "https://www.zippia.com/amazon-com-careers-487/"
page = requests.get(webpage)
soup = BeautifulSoup(page.content, 'lxml')

for script in soup.find_all('script', {'type': 'text/javascript'}):
    if 'getCompanyInfo' in script.text:
        match = re.search("{[^\n]*}", script.text)
        data = json.loads(match.group())
        print(data["companyDiversity"]["languages"])

        json.dump(data, open("test.json", "w"), indent=2) # Only if you want the data put in a readable format to a file (like if you want to find the path to an entry)