以下是一个大型List的1行示例,为了便于阅读,我将200,000个这样的行一个接一个地保存到文件中。
['{activities:[{activity:121,dbCount:234,totalHits:4,query:Identification', 'and', 'prioritization', 'of', 'merozoite,searchedFrom:PersistentLink,searchType:And,logTime:1469765823000},{activity:115,format:HTML,searchTerm:Identification', 'and', 'prioritization', 'of', 'merozoite,mode:View,type:Abstract,shortDbName:cmedm,pubType:Journal', 'Article,isxn:15506606,an:23776179,title:Journal', 'Of', 'Immunology', '(Baltimore,', 'Md.:', '1950),articleTitle:Identification', 'and', 'prioritization', 'of', 'merozoite', 'antigens', 'as', 'targets', 'of', 'protective', 'human', 'immunity', 'to', 'Plasmodium', 'falciparum', 'malaria', 'for', 'vaccine', 'and', 'biomarker', 'development.,logTime:1469765828000}],session:-2147364846,customerId:s2775460,groupId:main,profileId:eds}']
从上面的这一行,我希望能够提取4个字段;即 - “查询”,“an”,“shortDbName”和“profileId”
任何形式的任何想法都将非常感激。非常感谢你
答案 0 :(得分:0)
你的线看起来很奇怪。但是,假设您将该行存储在名为' mystring'的单个字符串变量中。您可以执行以下操作来解析查询的值:
query = mystring[mystring.find("query:"):mystring.find("searchedFrom:")]
这会产生:
query:Identification', 'and', 'prioritization', 'of', 'merozoite,
答案 1 :(得分:0)
所以,我进行了一些更改,并使用了像这里的代码来获取所需的字段查询作为响应,但是如果我一次想要所有4个字段呢?
mystring = ['{activities:[{activity:121,dbCount:234,totalHits:4,query:Identification', 'and', 'prioritization', 'of', 'merozoite,searchedFrom:PersistentLink,searchType:And,logTime:1469765823000},{activity:115,format:HTML,searchTerm:Identification', 'and', 'prioritization', 'of', 'merozoite,mode:View,type:Abstract,shortDbName:cmedm,pubType:Journal', 'Article,isxn:15506606,an:23776179,title:Journal', 'Of', 'Immunology', '(Baltimore,', 'Md.:', '1950),articleTitle:Identification', 'and', 'prioritization', 'of', 'merozoite', 'antigens', 'as', 'targets', 'of', 'protective', 'human', 'immunity', 'to', 'Plasmodium', 'falciparum', 'malaria', 'for', 'vaccine', 'and', 'biomarker', 'development.,logTime:1469765828000}],session:-2147364846,customerId:s2775460,groupId:main,profileId:eds}']
sanitizedmystring = str(mystring).replace('"', '')
print sanitizedmystring
query = sanitizedmystring[sanitizedmystring.find('query:'):sanitizedmystring.find('searchedFrom:')]
print query
答案 2 :(得分:0)
使用以下正则表达式 - > =COUNTA(Names)
我们应该能够捕获键/值
来自这些领域。这应该与您提到的任何关键字匹配(query|an|dbCount|shortDbName|profileId):([A-Za-z0-9]*)
(不捕获)以及冒号后面包含小写/大写/数字字符的任何字符串。然后,我们将每个标记的所有找到的结果附加到字典(:
)。
key : [list of tags found]
示例输出:
import re
from collections import defaultdict
def extract_fields(l):
queries = []
d = defaultdict(list)
regex = r"(query|an|dbCount|shortDbName|profileId):([A-Za-z0-9]+)"
for line in l:
query = re.findall(regex, line)
for match in query:
queries.append(match)
for item in queries:
d[item[0]].append(item[1])
return d