我尝试使用Python和Beautiful Soup进行网页抓取,但网页的源页面并不是最漂亮的。下面的代码是源页面的一小部分:
...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...
我想在字符串'birthdayFriends'之后得到参数'2',但我不知道如何得到它。到目前为止,我已经编写了下面的代码,但它只打印一个空列表。
import urllib2
from bs4 import BeautifulSoup
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='myWebpage',
user='myUsername',
passwd='myPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
page = urllib2.urlopen('myWebpage')
soup = BeautifulSoup(page.read())
bf = soup.findAll('birthdayFriends')
print bf
>> []
答案 0 :(得分:1)
假设html中有某个脚本标记如下:
<script>
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}}
</script>
然后您的代码可能类似于:
script = soup.findAll('script')[0] # or the number it appears in the file
# take the json part
j = bf.text.split('=')[1]
import json
# load json string to a dictionary
d = json.loads(j, strict=False)
print d["birthdayFriends"]
如果脚本标记的内容更复杂,请考虑在脚本行上循环或查看How can I parse Javascript variables using python?
另外,对于在python中解析JavaScript,另请参阅pynoceros