Question

我在Windows Vista 64位上使用Python.org版本2.7 64位。我正在尝试使用Scrapy和Regex来解析下面page中名为'DataStore.prime（\'standings \'的项目的所有内容。如果我使用代码：

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ }.*', re.S)
        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

在找到正则表达式之后，我得到了页面上的所有内容。这不是我想要的。我只想要存储在项目中的数据。我也尝试过：

regex = re.compile(r'\[\[.*?\].*')

        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

这将返回我感兴趣的'Datastore.prime'项目中的第一个子部分，直到第一个结束']'。此方法并未将Regex指向我对页面感兴趣的项目。我认为我需要的是两者的混合。我尝试过使用最终的Regex：

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ } \[\[.*?\]\]\);.*', re.S)

但现在这完全返回页面的不同部分。我几乎就在那里，但我不能完全正确。

有人可以帮忙吗？

由于

修改

以下是我试图抓取的一些示例脚本：

DataStore.prime('standings', { stageId: 7794 }, [[Some sample stats here],[[Some sample stats here],[[Some sample stats here]]);

请注意，在上面的示例中，'StageId：7794'是一个动态变量，它会在遇到此数据结构的页面之间发生变化，因此不能包含在任何类型的正则表达式或其他解析方法中。

Answer 1

不要使用正则表达式解析网页。使用像Beautiful Soup这样的html解析器。

编辑：详细说明。

正则表达式用于识别和操作常规语法。 HTML为context-free，因此无法使用正则表达式正确识别或操作。相反，我们使用特殊的解析器来操纵HTML。 BeautifulSoup是比较流行的python html解析器之一。

Answer 2

在不到一个月的时间里，我似乎已经看到了几十个问题。你至少研究过可用的信息吗？例如：有一个关于这个网站的整个回忆录，详细说明了如何抓取它，用投注预测算法提取信息等。

http://www.diva-portal.org/smash/get/diva2:655630/FULLTEXT01.pdf

这是一段摘录：

使用附录C中描述的正则表达式匹配技术，我们使用该模式＆＃34; /Datastore.prime（'standings'，{stageId：＆＃34;.✩stageID。＆＃34;}，[（[。* \ n，？）+ /＆＃34; 并找到该表的源代码。附录中给出了一个如何看待它的例子 C.下一步是从表源代码中进一步提取每个唯一的matchID。对于这样，一个不太复杂的模式就足够了，因为我们知道每个匹配的ID标签在HTML超链接中，每个超链接使用match-ID作为属性。对于例如，以下可能是包含固定装置阿森纳的线中包含的超链接 21参与：

Answer 3

如果有人感兴趣，最终解决了这个问题的是：

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ }, \[\[.*?\]\]?\)?;', re.S)

        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

问题是正则表达式匹配最后']'括号和'）'的多个实例。通过指定'？'我现在只返回那些字符的0-1个实例，这意味着只有我想要的东西被刮掉了。

与Regex和网页有点噩梦

3 个答案: