Question

我需要递归遍历JSON文件（从API发布回复），提取具有[＆＃34; text＆＃34;]作为键{"text":"this is a string"}

的字符串

我需要从元数据中具有最早日期的源开始解析，从该源提取字符串，然后移动到第二个最老的源，依此类推。 JSON文件可能被严重嵌套，并且字符串所在的级别可能会不时发生变化。

问题：有许多键称为[＆＃34; text＆＃34;]而且我不需要所有这些键，我只需要那些值为字符串的键。更好，＆＃34;文字＆＃34;：＆＃34;字符串＆＃34;我需要总是在"type":"sentence"的同一个对象{}中。见图。

我在问什么

修改下面的第二个代码，以便递归地遍历文件并仅在[＆＃34; text＆＃34;]值与＃34;类型＆＃34;一起提取到同一对象{}时提取它们：＆＃34;句子＆＃34;

在JSON文件的片段下面（绿色，我需要的文本和medatada，红色，我不需要提取的文件）：

链接到完整的JSON示例：http://pastebin.com/0NS5BiDk

到目前为止我做了什么：

1）简单方法：在字符串中转换json文件并在双引号之间搜索内容（＆＃34;＆＃34;）因为在所有json帖子响应中＆＃34;字符串＆＃34;我需要的是双引号之间的唯一。但是这个选项阻止我先前订购资源，因此不够好。

r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'\"(.+?)\"', j))

numentries = 0
for sentences in sentences_list:
    numentries += 1
    print(sentences)
    print(numentries)

2）更聪明的方式：递归地走过JSON文件并提取[＆＃34; text＆＃34;]值

def get_all(myjson, key):
    if type(myjson) is dict:
        for jsonkey in (myjson):
            if type(myjson[jsonkey]) in (list, dict):
                get_all(myjson[jsonkey], key)
            elif jsonkey == key:
                print (myjson[jsonkey])
    elif type(myjson) is list:
        for item in myjson:
            if type(item) in (list, dict):
                get_all(item, key)

print(get_all(r1.json(), "text"))

它将[＆＃34; text＆＃34;]的所有值提取为Key。不幸的是，在文件中还有其他东西（我不需要），其中[＆＃34; text＆＃34;]为Key。因此，它会返回我不需要的文本。

请告知。

更新

我已经编写了2个代码来按某个键对对象列表进行排序。第一个按照＆＃39;文本排序。的xml。第二个来自＆＃39;包含来自＆＃39;值。

第一个可行，但是一些XML，即使它们的数量更高，实际上也有比我预期更早的文档。

对于第二个代码，格式为＆＃39;包含来自＆＃39;不一致，有时价值根本不存在。第二个也给了我一个错误，但我无法弄清楚为什么 - string indices must be integers。

# 1st code (it works but not ideal)

j=r1.json()

list = []
for row in j["tree"]["children"][0]["children"]:
    list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:
    list.append(row)

def date(key):
    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
    try:
        return sorted(list_to_order,
                      key=lambda k: k[date(["metadata"][0]["value"])])
    except ValueError:
        return 0

print(order(list))

Answer 1

我认为只要选择正确的字符串，这将做你想要的。我还改变了使用isinstance()进行类型检查的方式，这被认为是一种更好的方法，因为它支持面向对象的多态性。

import json
_NUL = object()  # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
    """ Recursively find all the values of key in all the dictionaries in myjson
        with a "type" key equal to kind.
    """
    if isinstance(myjson, dict):
        key_value = myjson.get(key, _NUL)  # _NUL if key not present
        if key_value is not _NUL and myjson.get("type") == kind:
            yield key_value
        for jsonkey in myjson:
            jsonvalue = myjson[jsonkey]
            for v in get_all(jsonvalue, kind, key):  # recursive
                yield v
    elif isinstance(myjson, list):
        for item in myjson:
            for v in get_all(item, kind, key):  # recursive
                yield v    

with open('json_sample.txt', 'r') as f:
    data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
    print(text)
    numentries += 1

print('\nNumber of "text" entries found: {}'.format(numentries))

递归遍历提取SELECTED字符串的JSON文件

1 个答案: