从网页获取脚本函数内部的信息

时间:2018-06-29 14:46:58

标签: python web-scraping beautifulsoup

我正在尝试从https://rosettacode.org/wiki/Category:Rascal和类似页面获取信息。我感兴趣的信息位于页面上部的右侧窗口中,该窗口列出了诸如execution method, garbage collected等语言的详细信息。此信息包含在页面的html源代码的以下行中:

<script type="8b5f853f8b614ed469e51514-">window.RLQ = window.RLQ || []; window.RLQ.push( function () {
mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Rascal","wgTitle":"Rascal","wgCurRevisionId":137957,"wgRevisionId":137957,"wgArticleId":11663,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],

"wgCategories":["Execution method/Interpreted","Garbage collection/Yes","Parameter passing/By value","Typing/Safe","Typing/Strong","Typing/Expression/Partially implicit","Typing/Checking/Dynamic","Impl needed","Programming Languages"],

"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Category:Rascal"
,"wgRelevantArticleId":11663,"wgIsProbablyEditable":!0,"wgRestrictionEdit":[],"wgRestrictionMove":[],"sfgAutocompleteValues":[],"sfgAutocompleteOnAllChars":!1,"sfgFieldProperties":[],"sfgDependentFields":[],"sfgShowOnSelect":[],"sfgScriptPath":"/mw/extensions/SemanticForms","sdgDownArrowImage":"/mw/extensions/SemanticDrilldown/skins/down-arrow.png","sdgRightArrowImage":"/mw/extensions/SemanticDrilldown/skins/right-arrow.png"});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function($,jQuery){mw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\"});});mw.loader.load(["ext.smw.style","ext.smw.tooltips","mediawiki.page.startup","mediawiki.legacy.wikibits"]);
} );</script>

主要部分在"wgCategories"中(显示在上面的代码中间)。

我有以下代码来获取页面:

import requests, sys
lang_url = 'https://rosettacode.org/wiki/Category:Rascal'
rg = requests.get(lang_url)
if rg is None: 
   print("Could not obtain web page.")
   sys.exit()
else: print("length of obtained page:", len(rg.text) )

from bs4 import BeautifulSoup

我可以使用BeautifulSoup的哪些功能来获取此信息?

编辑:我检查过BeautifulSoup-我可以通过title获得parap并通过aa['href']获得链接,依此类推,但是我找不到在脚本function中查找和搜索的方法。

2 个答案:

答案 0 :(得分:2)

您可以将.getElementsByTagName("td")对象的requests传递到content构造函数中,同时指定BeautifulSoup的HTML解析器BeautifulSoup,以将其获取到正确的格式。然后,您可以使用html.parser的{​​{3}}函数,该函数具有element标签参数并返回列表。见下文:

BeautifulSoup

如果您喜欢这种事情,另一种选择是使用import requests r = requests.get('https://rosettacode.org/wiki/Category:Rascal') from bs4 import BeautifulSoup as bs soup = bs(r.content, 'html.parser') print(soup.find_all('script'))

答案 1 :(得分:1)

这不是beautifulsoup,但您可能需要为此使用re,因为html解析将返回整个脚本块。

import re
wgcontent = re.findall('wgCategories":\[(.+?)]', rg.text)[0].replace('"', '').split(',')

这将返回以下列表:

Execution method/Interpreted
Garbage collection/Yes
Parameter passing/By value
Typing/Safe
Typing/Strong
Typing/Expression/Partially implicit
Typing/Checking/Dynamic
Impl needed
Programming Languages