我正在从以下网站解析html:http://www.asusparts.eu/partfinder/Asus/All在One / E系列中,我只是想知道是否有任何方法可以在python中探索已解析的属性? 例如..下面的代码输出以下内容:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
<a href="#Bracket" onclick="getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')">Bracket</a>
这些是我想要检索的值:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
当页面从AJAX呈现时,他们正在使用脚本源,从脚本文件中生成以下URL:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
如何只检索上面列出的3个值?
[编辑]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
我添加了一个错误行,问题在于......
答案 0 :(得分:1)
你可以parse that as a Python literal,如果你从中移除this,
部分,并且只取括号之间的所有内容:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
我们删除了this
参数,因为它不是一个有效的python字符串文字,你也不想拥有它。
演示:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
现在你有一个Python元组,可以选择你喜欢的任何值。
您需要索引1,2和4处的值,例如:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]
答案 1 :(得分:1)
与Martijn的答案类似,但是原始使用pyparsing
(即,可以对其进行细化以识别函数,并仅使用括号引用字符串):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''<a href="#Bracket" onclick="getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')">Bracket</a>'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']