我一直在试图刮掉桌子 here但在我看来,BeautifulSoup找不到任何桌子。
我写道:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table #prints nothing..
基于其他类似的问题,我认为HTML在某种程度上被打破了,但我不是专家.. 找不到答案: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup),甚至(Python+BeautifulSoup: scraping a particular table from a webpage)
非常感谢!
答案 0 :(得分:2)
虽然这不会找到不在r.text
中的表格,但您要求BeautifulSoup
使用xml
解析器而不是html.parser
,所以我建议您更改行到:
soup=BeautifulSoup(data,'html.parser')
Web抓取会遇到的一个问题是所谓的“客户端呈现”网站与服务器呈现的对比。基本上,这意味着您通过requests
模块或curl
从基本html请求获取的页面与在Web浏览器中呈现的内容不同。一些常见的框架是React和Angular。如果你检查你想要抓取的页面的来源,他们在他们的几个html元素上有data-react-id
个。 Angular页面的常见说明是具有前缀ng
的类似元素属性,例如ng-if
或ng-bind
。您可以通过各自的开发工具在Chrome或Firefox中查看该页面的来源,这些工具可以在任一浏览器中使用键盘快捷键Ctrl+Shift+I
启动。值得注意的是,并非所有的React&角度页面只是客户端呈现。
要获得此类内容,您需要使用像Selenium这样的无头浏览器工具。使用Selenium和Python进行Web抓取有很多资源。
答案 1 :(得分:2)
数据位于JavaScript变量中,您应该找到js文本数据,然后使用正则表达式提取它。当你得到数据时,它包含900多个学校字典的json列表对象,你应该使用json模块将它加载到python list obejct。
import requests, bs4, re, json
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r = requests.get(url)
data = r.text
soup = bs4.BeautifulSoup(data, 'lxml')
var = soup.find(text=re.compile('collegeSalaryReportData'))
table_text = re.search(r'collegeSalaryReportData = (\[.+\]);\n var', var, re.DOTALL).group(1)
table_data = json.loads(table_text)
pprint(table_data)
print('The number of school', len(table_data))
出:
{'% Female': '0.57',
'% High Job Meaning': 'N/A',
'% Male': '0.43',
'% Pell': 'N/A',
'% STEM': '0.1',
'% who Recommend School': 'N/A',
'Division 1 Basketball Classifications': 'Not Division 1 Basketball',
'Division 1 Football Classifications': 'Not Division 1 Football',
'Early Career Median Pay': '36200',
'IPEDS ID': '199643',
'ImageUrl': '/content/school_logos/Shaw University_50px.png',
'Mid-Career Median Pay': '45600',
'Rank': '963',
'School Name': 'Shaw University',
'School Sector': 'Private not-for-profit',
'School Type': 'Private School, Religious',
'State': 'North Carolina',
'Undergraduate Enrollment': '1664',
'Url': '/research/US/School=Shaw_University/Salary',
'Zip Code': '27601'}]
The number of school 963
答案 2 :(得分:1)
您正在解析html
,但您使用了xml
解析器
您应该使用soup=BeautifulSoup(data,"html.parser")
您的必要数据位于script
标记中,实际上实际上没有table
标记。因此,您需要在script
内找到文本。
N.B:如果您使用的是Python 2.x,请使用“HTMLParser”而不是“html.parser”。
这是代码。
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")
file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []
list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])
for script in scripts:
text = script.text
start = 0
end = 0
if(len(text) > 10000):
while(start > -1):
start = text.find('"School Name":"',start)
if(start == -1):
break
start += len('"School Name":"')
end = text.find('"',start)
school_name = text[start:end]
start = text.find('"Early Career Median Pay":"',start)
start += len('"Early Career Median Pay":"')
end = text.find('"',start)
early_pay = text[start:end]
start = text.find('"Mid-Career Median Pay":"',start)
start += len('"Mid-Career Median Pay":"')
end = text.find('"',start)
mid_pay = text[start:end]
start = text.find('"Rank":"',start)
start += len('"Rank":"')
end = text.find('"',start)
rank = text[start:end]
start = text.find('"% High Job Meaning":"',start)
start += len('"% High Job Meaning":"')
end = text.find('"',start)
high_job = text[start:end]
start = text.find('"School Type":"',start)
start += len('"School Type":"')
end = text.find('"',start)
school_type = text[start:end]
start = text.find('"% STEM":"',start)
start += len('"% STEM":"')
end = text.find('"',start)
stem = text[start:end]
list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()
这将在csv中生成必要的表格。完成后不要忘记关闭文件。