Question

我对正则表达式有点新意，但我自己做的那个不起作用。它应该从网站html给我数据。

我基本上想要从html中获取这个，以及所有多个。我把页面网址作为字符串btw。

<a href="http://store.steampowered.com/search/?category2=2" class="name">Co-Op</a>

我为正则表达式所做的是：

<a\bhref="http://store.steampowered.com/search/?category2=2"\bclass="name"*>(.*?)</a>\g

Answer 1

您永远不应解析HTML / XML或允许使用正则表达式进行级联的任何其他语言。

HTML的一个好处是，它可以转换为XML，XML有一个很好的解析工具包：

echo '<a href="http://store.steampowered.com/search/?category2=2" class="name">Co-Op</a>' | tidy -asxhtml -numeric 2> /dev/null | xmllint --html --xpath 'normalize-space(//a[@class="name" and @href="http://store.steampowered.com/search/?category2=2"])' - 2>/dev/null

使用查询：

normalize-space(//a[@class="name" and @href="http://store.steampowered.com/search/?category2=2"])

//表示任何标记（无论其深度），a表示a标记，我们还指定了class=name和href=(the link)。然后我们在此类标记normalize-space和<a>之间返回了</a>内容。

在Python中，您可以使用：

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://store.steampowered.com/app/24860/").read()
soup = BeautifulSoup(page)
print soup.find_all('a',attrs={'class':'name','href':'http://store.steampowered.com/search/?category2=2'})

评论你的正则表达式：

问题是它包含像?这样的标记，它们被解释为正则表达式而不是字符。你需要逃脱它们。它可能应该是：

<a\s+href="http://store\.steampowered\.com/search/\?category2=2"\s+class="name"\S*>(.*?)</a>\g

我还将\b替换为\s，\s表示空格字符，如空格，制表符，换行符。虽然正则表达式非常脆弱：如果有人决定交换href和class，程序就会出现问题。对于大多数这些问题，确实存在解决方案，但您最好使用XML分析工具。

正则表达式 - HTML

1 个答案: