如何从文本中提取ImageID和CaseID?

时间:2017-08-17 13:48:48

标签: python beautifulsoup

如何使用beautifulsoup

从下面的文本中获取ImageID和CaseID值
<a href="GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1" target="_blank">View to scale Easy Street Draw file*</a>

我试过的代码是

link = "<a href="GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1" target="_blank">View to scale Easy Street Draw file*</a>"
img_uttp = link.find('ImageID')

我收到空白数据。

2 个答案:

答案 0 :(得分:1)

网址可由parse_qs urllib.parse中的python3解析。

from urllib.parse import parse_qs
query = parse_qs("GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1")

结果是:

{'CaseID': ['773013618'], 'ImageID': ['247572954'], 'Version': ['-1']}

你可以获得ImageID:

query['ImageID']

python2

from urlparse import parse_qs
query = parse_qs("GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1")
query['ImageID']

答案 1 :(得分:1)

使用BeautifulSoup和urlparse库,

from bs4 import BeautifulSoup as bs
import urlparse

s = bs('<a href="GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1" target="_blank">View to scale Easy Street Draw file*</a>')
url = s.find('a').get('href')
parsed = urlparse.parse_qs(url)
# {'Version': ['-1'], 'CaseID': ['773013618'], 'ImageID': ['247572954']}
#print parsed['CaseID'][0]
#print parsed['ImageID'][0]

如果省略将文本片段实例化为漂亮的汤对象,

>>> link = '<a href="GetBinary.aspx?Scene&amp;ImageID=247572954&amp;CaseID=773013618&amp;Version=-1" target="_blank">View to scale Easy Street Draw file*</a>'
>>> q = link.find('ImageID')
>>> q
34 #index of ImageID substring in link
将在普通字符串上调用

findPython string.find()