我一直试图用蟒蛇和美丽的汤刮一个网站表。我遇到的问题是该表是通过脚本生成的,因此表格如下所示:
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<!DOCTYPE html>
<html>
<head>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<meta charset=utf-8 />
<title>JS Bin</title>
</head>
<body>
<select id="ddlCategory">
<option>Choose</option>
<option value="Value1">Value1</option>
<option value="Value2">Value 2</option>
<option value="Others">Others</option>
</select>
<table>
<tr>
<td>
<asp:Label ID="Label1" runat="server" Text="Category: " />
</td>
<td>
</td>
</tr>
<tr id="Other" style="display: none">
<table>
<tbody>
<tr>
<td>
<input id="txtOthers" type="text" runat="server" />
</td>
</tr>
<tr>
<td>
<input id="txtOthers" type="text" runat="server" />
</td>
</tr>
<tr>
<td>
<input id="txtOthers" type="text" runat="server" />
</td>
</tr>
</tbody>
</table>
</tr>
</table>
</body>
</html>
我想知道是否有人知道刮桌子是否仍然可行。在表格前面有一个脚本标记,我想知道它是否有用。
<table class="table table-compact table-striped table-topics">
<thead>
<tr>
<th data-intro="Clicking a topic will allow you to view and ask general technical questions about the topic through SITIS." data-position="bottom">Topic #</th>
<th>Program</th>
<th>Component</th>
<th>Technology Area</th>
<th>Title</th>
<th data-intro="If there is SITIS activity for a topic a clickable 'QA' will appear in this column." data-position="bottom">SITIS</th>
</tr>
</thead>
<tbody>
{{#each this.Results}}
<tr>
<td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicNumber}}</a></td>
<td>{{this.ProgramTypeName}}</td>
<td>{{this.AgencyName}}</td>
<td>
<div class="icons">
{{#if this.TechAreaAirPlatform}}
<i class="glyph-icon flaticon-air-platform" data-toggle="tooltip" title="Technology Area: Air Platform"></i>
{{/if}}
{{#if this.TechAreaChemBioDefense }}
<i class="glyph-icon flaticon-chem-bio-defense" data-toggle="tooltip" title="Technology Area: Chem Bio Defense"></i>
{{/if}}
{{#if this.TechAreaInfoSystems}}
<i class="glyph-icon flaticon-info-systems" data-toggle="tooltip" title="Technology Area: Info Systems"></i>
{{/if}}
{{#if this.TechAreaGroundSea }}
<i class="glyph-icon flaticon-ground-sea" data-toggle="tooltip" title="Technology Area: Ground Sea"></i>
{{/if}}
{{#if this.TechAreaMaterials}}
<i class="glyph-icon flaticon-materials" data-toggle="tooltip" title="Technology Area: Materials"></i>
{{/if}}
{{#if this.TechAreaBioMedical }}
<i class="glyph-icon flaticon-bio-med" data-toggle="tooltip" title="Technology Area: Bio Medical"></i>
{{/if}}
{{#if this.TechAreaSensors }}
<i class="glyph-icon flaticon-sensors" data-toggle="tooltip" title="Technology Area: Sensors"></i>
{{/if}}
{{#if this.TechAreaElectronics }}
<i class="glyph-icon flaticon-electronics" data-toggle="tooltip" title="Technology Area: Electronics"></i>
{{/if}}
{{#if this.TechAreaBattlespace }}
<i class="glyph-icon flaticon-battlespace" data-toggle="tooltip" title="Technology Area: Battlespace"></i>
{{/if}}
{{#if this.TechAreaSpacePlatforms }}
<i class="glyph-icon flaticon-space-platform" data-toggle="tooltip" title="Technology Area: Space Platforms"></i>
{{/if}}
{{#if this.TechAreaHumanSystems }}
<i class="glyph-icon flaticon-human-systems" data-toggle="tooltip" title="Technology Area: Human Systems"></i>
{{/if}}
{{#if this.TechAreaWeapons }}
<i class="glyph-icon flaticon-weapons" data-toggle="tooltip" title="Technology Area: Weapons"></i>
{{/if}}
{{#if this.TechAreaNuclear }}
<i class="glyph-icon flaticon-nuclear" data-toggle="tooltip" title="Technology Area: Nuclear"></i>
{{/if}}
</div>
</td>
<td><a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">{{this.TopicTitle}}</a></td>
<td>{{#if this.PublishedQuestionCount}}<a href="/topics?topicId={{this.TopicId}}" target="_blank" data-topicid="{{this.TopicId}}">Q&A</a>{{/if}}</td>
</tr>
{{else}}
<tr>
<td colspan="6"><div class="alert alert-warning">No topics were found.</div></td>
</tr>
{{/each}}
</tbody>
</table>
提前谢谢你!
答案 0 :(得分:0)
注释中关于使用Selenium WebDriver的建议可能是解决问题的最简单方法。看来您正在尝试抓取一个使用Django模板或类似模板动态生成内容的网站。
因此,您需要模拟一个浏览器以实际加载页面上的所有内容,因为您当前仅获取静态html。您可以使用软件包管理器安装selenium,然后需要为要模拟的浏览器安装驱动程序:
pip install selenium
pip install chromedriver
注意:并非所有的Web驱动程序都可以与包管理器一起安装(我不认为),因此您可能必须从常规Internet下载它。
现在您可以使用诸如我编写的此函数之类的东西来刮取所需的页面:
# purpose: a function which takes a url and extracts the contents as a string
# depends on selenium webdriver to turn js-scripts into html as well as time and os libraries
# signature: pull_html_page(url:string, write:optional boolean) -> string
def pull_html_page(url, write = False):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
content = driver.page_source.encode('utf-8')
driver.quit()
if write == True:
# the "my dick computer won't let me be root" workaround
if os.geteuid() == 0:
print("We're root!")
else:
print("We're not root.")
CURRENT_SCRIPT = os.path.realpath(__file__)
os.system('echo ' + PASSWORD_FOR_SUDO + '|sudo -S python '+ CURRENT_SCRIPT)
clean = BeautifulSoup(content, "html.parser").prettify()
f = open("out.html", "w+")
f.write(clean)
f.close()
return content
如果此解决方案对您来说效率不够高,或者您只需要动态生成的数据,而静态html则不需要。您通常可以使用检查工具(我更喜欢chrome上的工具)来查看网络流量。有时,您可以看到返回JSON响应的URL,这样做可以节省您加载页面的时间,并且可以直接从响应URL中抓取数据。
祝你好运!