我决定做这个小项目来学习如何使用机械化。现在转到urbandictionary,填写搜索表单中的“skid”一词,然后按提交并打印出HTML。
我想要它做的是找到第一个定义并打印出来。我怎么会这样做呢?
到目前为止,这是我的源代码:
import mechanize
br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")
br.select_form(nr=0)
br["term"] = "skid"
br.submit()
print br.response().read()
这是存储定义的地方:
<div class="definition">Canadian definition: Commonly used to refer to someone who stopped evolving, and bathing, during the 80's hair band era. Generally can be found wearing AC/DC muscle shirts, leather jackets, and sporting a <a href="/define.php?term=mullet">mullet</a>. The term "skid" is in part derived from "skid row", which is both a band enjoyed by those the term refers to, as well as their address. See also <a href="/define.php?term=white%20trash">white trash</a> and <a href="/define.php?term=trailer%20park%20trash">trailer park trash</a></div><div class="example">The skid next door got drunk and beat up his old lady.</div>
您可以看到它存储在div定义中。我知道如何在源代码中搜索div但我不知道如何获取标签之间的所有内容然后显示它。
答案 0 :(得分:1)
我认为正则表达式足以完成此任务(基于您的描述)。试试这段代码:
import mechanize, re
br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")
br.select_form(nr=0)
br["term"] = "skid"
br.submit()
source = br.response().read()
regex = "<div class=\"definition\">(.+?)</div>"
pattern = re.compile(regex)
r=re.findall(pattern,source)
print r[0]
这将显示标签之间的内容(没有示例部分,但它们完全相同),但我不知道您希望如何处理此内容中的标签。如果你想要它们,那就是它。或者如果你想删除它们,你可以使用像re.replace()。
答案 1 :(得分:1)
自提到以来,我认为我会提供BeautifulSoup答案。使用最有效的方法。
import bs4, urllib2
# Use urllib2 to get the html from the web
url = r"http://www.urbandictionary.com/define.php?term={term}"
request = url.format(term="skid")
raw = urllib2.urlopen(request).read()
# Convert it into a soup
soup = bs4.BeautifulSoup(raw)
# Find the requested info
for word_def in soup.findAll(class_ = 'definition'):
print word_def.string
答案 2 :(得分:0)
您可以使用lxml来解析HTML片段:
import lxml.html as html
import mechanize
br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")
br.select_form(nr=0)
br["term"] = "skid"
br.submit()
fragment = html.fromstring(br.response().read())
print fragment.find_class('definition')[0].text_content()
然而,此解决方案会移除div内的标签并使文本变平。