如何使用beautifulsoup
从下面的文本中获取ImageID和CaseID值<a href="GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1" target="_blank">View to scale Easy Street Draw file*</a>
我试过的代码是
link = "<a href="GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1" target="_blank">View to scale Easy Street Draw file*</a>"
img_uttp = link.find('ImageID')
我收到空白数据。
答案 0 :(得分:1)
网址可由parse_qs
urllib.parse
中的python3
解析。
from urllib.parse import parse_qs
query = parse_qs("GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1")
结果是:
{'CaseID': ['773013618'], 'ImageID': ['247572954'], 'Version': ['-1']}
你可以获得ImageID:
query['ImageID']
或python2
:
from urlparse import parse_qs
query = parse_qs("GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1")
query['ImageID']
答案 1 :(得分:1)
使用BeautifulSoup和urlparse库,
from bs4 import BeautifulSoup as bs
import urlparse
s = bs('<a href="GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1" target="_blank">View to scale Easy Street Draw file*</a>')
url = s.find('a').get('href')
parsed = urlparse.parse_qs(url)
# {'Version': ['-1'], 'CaseID': ['773013618'], 'ImageID': ['247572954']}
#print parsed['CaseID'][0]
#print parsed['ImageID'][0]
如果省略将文本片段实例化为漂亮的汤对象,
>>> link = '<a href="GetBinary.aspx?Scene&ImageID=247572954&CaseID=773013618&Version=-1" target="_blank">View to scale Easy Street Draw file*</a>'
>>> q = link.find('ImageID')
>>> q
34 #index of ImageID substring in link
将在普通字符串上调用 find
。 Python string.find()