Question

我正在尝试从文件中解析一些输入文本，这些文本最初是从Twitter API中获取的。该文件是直接文本，在这种情况下我实际上并没有抓住JSON。这是输入文本的片段：

.....HootSuite</a>", "text": "For independent news reports on the crisis in #Japan, 
see @DemocracyNow News Archive: http://ow.ly/4ht9Q
#nuclear #Fukushima #rdran #japon", "created_at": "Sat Mar 19.....

基本上我需要抓住这个：

"text": "For independent news reports "on" the crisis in #Japan, see @DemocracyNow 
News Archive: http://ow.ly/4ht9Q #nuclear #Fukushima #rdran #japon"

以下是我试图开始工作的两个，但我遇到了一些问题：

    re.findall('"text":[^_]*',line)
    re.findall('"text":[^:}]+',line)

第一个允许我在我想要的部分之后抓住所有“创建”。第二种类型也有效，但当文本包含“：”时，它不会直到信息结束

有人对RegEx有一些经验可以指出我正确的方向吗？

Answer 1

如果您正在使用Twitter API，我想它会将JSON返回给您。 JSON支持任意嵌套，并且正则表达式永远无法在每个场景中正确解析它。使用JSON解析器可以更好地服务。由于YAML是JSON的超集，因此您也可以使用YAML解析器。我会看看PyYaml。（这就是我所知道的。他们可能只是JSON解析器）

然后解析就像这样简单：

import yaml
results = yaml.load(twitter_response)
print results["text"]  # This would contain the string you're interested in.

Answer 2

使用simplejson解析JSON。

请遵循本教程：http://blogs.openshine.com/pvieytes/2011/05/18/parsing-twitter-user-timeline-with-python/

Answer 3

Json是一种简单的格式，如果你试图做一些微不足道的事情，你并不总是需要一个解析器。考虑示例行：

>>> line = """{ "text" : "blah blah foo", "other" : "blah blah bar" }"""

这有两种方法可以做你想做的事。

使用正则表达式：

>>> import re
>>> m = re.search('"text"\ *:\ *"([^"]*)',line)
>>> m.group()
'"text" : "blah blah bar'
>>> m.group(1)
'blah blah bar'

使用eval（json是一种非常pythonic格式）：

>>> d = eval(line)
>>> d['text']
'blah blah bar'

如何让我的RegEx捕获冒号两侧的文本？

3 个答案: