Question

我的文字看起来像：

Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])

问题

有谁知道这个文字的格式？
我如何解析元素url的值（例如，从上面的文字中）： http://www.somesite.com/prof.php?pID=478 http://www.somesite.com/prof.php?pID=527
你建议解析什么python库说这种类型的输出，xml，json等？

我只是尝试loop through the url并仅解析url的值。

请记住我正在使用Django。

感谢您提供任何帮助。

修改 * 当前代码： *

domainLinkOutputAsString = str(domainLinkOutput) 

r = re.compile(" url='(.*?)',", )  ##ERRORENOUS, must be 're' compliant.

ProperDomains = r.findall(domainLinkOutputAsString)

return HttpResponse(ProperDomains)

Answer 1

您只需使用Python Regexp：

即可

import re
text = "Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite -  Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite -  Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite -  Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])"

# Create the regexp object to match the value of 'url'
r = re.compile(" url='(.*?)',", )

# Print all matches
print r.findall(text)

>>>['http://www.somesite.com/prof.php?pID=478', 'http://www.somesite.com/prof.php?pID=527', 'http://www.somesite.com/prof.php?pID=645']

在Python中解析文本（Django）

1 个答案: