如何从Python中的列表中提取特定字段

时间:2016-08-10 14:09:54

标签: python

以下是一个大型List的1行示例,为了便于阅读,我将200,000个这样的行一个接一个地保存到文件中。

['{activities:[{activity:121,dbCount:234,totalHits:4,query:Identification', 'and', 'prioritization', 'of', 'merozoite,searchedFrom:PersistentLink,searchType:And,logTime:1469765823000},{activity:115,format:HTML,searchTerm:Identification', 'and', 'prioritization', 'of', 'merozoite,mode:View,type:Abstract,shortDbName:cmedm,pubType:Journal', 'Article,isxn:15506606,an:23776179,title:Journal', 'Of', 'Immunology', '(Baltimore,', 'Md.:', '1950),articleTitle:Identification', 'and', 'prioritization', 'of', 'merozoite', 'antigens', 'as', 'targets', 'of', 'protective', 'human', 'immunity', 'to', 'Plasmodium', 'falciparum', 'malaria', 'for', 'vaccine', 'and', 'biomarker', 'development.,logTime:1469765828000}],session:-2147364846,customerId:s2775460,groupId:main,profileId:eds}']

从上面的这一行,我希望能够提取4个字段;即 - “查询”,“an”,“shortDbName”和“profileId”

任何形式的任何想法都将非常感激。非常感谢你

3 个答案:

答案 0 :(得分:0)

你的线看起来很奇怪。但是,假设您将该行存储在名为' mystring'的单个字符串变量中。您可以执行以下操作来解析查询的值:

 query = mystring[mystring.find("query:"):mystring.find("searchedFrom:")]

这会产生:

query:Identification', 'and', 'prioritization', 'of', 'merozoite,

答案 1 :(得分:0)

所以,我进行了一些更改,并使用了像这里的代码来获取所需的字段查询作为响应,但是如果我一次想要所有4个字段呢?

mystring = ['{activities:[{activity:121,dbCount:234,totalHits:4,query:Identification', 'and', 'prioritization', 'of', 'merozoite,searchedFrom:PersistentLink,searchType:And,logTime:1469765823000},{activity:115,format:HTML,searchTerm:Identification', 'and', 'prioritization', 'of', 'merozoite,mode:View,type:Abstract,shortDbName:cmedm,pubType:Journal', 'Article,isxn:15506606,an:23776179,title:Journal', 'Of', 'Immunology', '(Baltimore,', 'Md.:', '1950),articleTitle:Identification', 'and', 'prioritization', 'of', 'merozoite', 'antigens', 'as', 'targets', 'of', 'protective', 'human', 'immunity', 'to', 'Plasmodium', 'falciparum', 'malaria', 'for', 'vaccine', 'and', 'biomarker', 'development.,logTime:1469765828000}],session:-2147364846,customerId:s2775460,groupId:main,profileId:eds}']
sanitizedmystring = str(mystring).replace('"', '')
print sanitizedmystring
query = sanitizedmystring[sanitizedmystring.find('query:'):sanitizedmystring.find('searchedFrom:')]
print query

答案 2 :(得分:0)

使用以下正则表达式 - > =COUNTA(Names)我们应该能够捕获键/值  来自这些领域。这应该与您提到的任何关键字匹配(query|an|dbCount|shortDbName|profileId):([A-Za-z0-9]*)不捕获)以及冒号后面包含小写/大写/数字字符的任何字符串。然后,我们将每个标记的所有找到的结果附加到字典(:)。

key : [list of tags found]

示例输出:

import re
from collections import defaultdict

def extract_fields(l):
    queries = []
    d = defaultdict(list)
    regex = r"(query|an|dbCount|shortDbName|profileId):([A-Za-z0-9]+)"

    for line in l:
        query = re.findall(regex, line) 
        for match in query:
            queries.append(match)
    for item in queries:
        d[item[0]].append(item[1])

    return d