我正在尝试从https://rosettacode.org/wiki/Category:Rascal和类似页面获取信息。我感兴趣的信息位于页面上部的右侧窗口中,该窗口列出了诸如execution method, garbage collected
等语言的详细信息。此信息包含在页面的html源代码的以下行中:
<script type="8b5f853f8b614ed469e51514-">window.RLQ = window.RLQ || []; window.RLQ.push( function () {
mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Rascal","wgTitle":"Rascal","wgCurRevisionId":137957,"wgRevisionId":137957,"wgArticleId":11663,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],
"wgCategories":["Execution method/Interpreted","Garbage collection/Yes","Parameter passing/By value","Typing/Safe","Typing/Strong","Typing/Expression/Partially implicit","Typing/Checking/Dynamic","Impl needed","Programming Languages"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Category:Rascal"
,"wgRelevantArticleId":11663,"wgIsProbablyEditable":!0,"wgRestrictionEdit":[],"wgRestrictionMove":[],"sfgAutocompleteValues":[],"sfgAutocompleteOnAllChars":!1,"sfgFieldProperties":[],"sfgDependentFields":[],"sfgShowOnSelect":[],"sfgScriptPath":"/mw/extensions/SemanticForms","sdgDownArrowImage":"/mw/extensions/SemanticDrilldown/skins/down-arrow.png","sdgRightArrowImage":"/mw/extensions/SemanticDrilldown/skins/right-arrow.png"});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function($,jQuery){mw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\"});});mw.loader.load(["ext.smw.style","ext.smw.tooltips","mediawiki.page.startup","mediawiki.legacy.wikibits"]);
} );</script>
主要部分在"wgCategories"
中(显示在上面的代码中间)。
我有以下代码来获取页面:
import requests, sys
lang_url = 'https://rosettacode.org/wiki/Category:Rascal'
rg = requests.get(lang_url)
if rg is None:
print("Could not obtain web page.")
sys.exit()
else: print("length of obtained page:", len(rg.text) )
from bs4 import BeautifulSoup
我可以使用BeautifulSoup的哪些功能来获取此信息?
编辑:我检查过BeautifulSoup-我可以通过title
获得para
,p
并通过a
和a['href']
获得链接,依此类推,但是我找不到在脚本function
中查找和搜索的方法。
答案 0 :(得分:2)
您可以将.getElementsByTagName("td")
对象的requests
传递到content
构造函数中,同时指定BeautifulSoup
的HTML解析器BeautifulSoup
,以将其获取到正确的格式。然后,您可以使用html.parser
的{{3}}函数,该函数具有element标签参数并返回列表。见下文:
BeautifulSoup
如果您喜欢这种事情,另一种选择是使用import requests
r = requests.get('https://rosettacode.org/wiki/Category:Rascal')
from bs4 import BeautifulSoup as bs
soup = bs(r.content, 'html.parser')
print(soup.find_all('script'))
。
答案 1 :(得分:1)
这不是beautifulsoup,但您可能需要为此使用re,因为html解析将返回整个脚本块。
import re
wgcontent = re.findall('wgCategories":\[(.+?)]', rg.text)[0].replace('"', '').split(',')
这将返回以下列表:
Execution method/Interpreted
Garbage collection/Yes
Parameter passing/By value
Typing/Safe
Typing/Strong
Typing/Expression/Partially implicit
Typing/Checking/Dynamic
Impl needed
Programming Languages