使用Python

时间:2016-07-10 08:03:27

标签: python pandas text-parsing string-parsing text-extraction

enter image description here

示例第一行事件日志文件,这里我已经成功提取了除最后一个键值对之外的evrything -

{"event_type":"ActionClicked","event_timestamp":1451583172592,"arrival_timestamp":1451608731845,"event_version":"3.0",
  "application":{"app_id":"7ffa58dab3c646cea642e961ff8a8070","cognito_identity_pool_id":"us-east-1:
    4d9cf803-0487-44ec-be27-1e160d15df74","package_name":"com.think.vito","sdk":{"name":"aws-sdk-android","version":"2.2.2"}
    ,"title":"Vito","version_name":"1.0.2.1","version_code":"3"},"client":{"client_id":"438b152e-5b7c-4e99-9216-831fc15b0c07",
      "cognito_id":"us-east-1:448efb89-f382-4975-a1a1-dd8a79e1dd0c"},"device":{"locale":{"code":"en_GB","country":"GB",
        "language":"en"},"make":"samsung","model":"GT-S5312","platform":{"name":"ANDROID","version":"4.1.2"}},
  "session":{"session_id":"c15b0c07-20151231-173052586","start_timestamp":1451583052586},"attributes":{"OfferID":"20186",
    "Category":"40000","CustomerID":"304"},"metrics":{}}

Hello Every One,我正在尝试从附加图像中显示的事件日志文件中提取内容。至于要求我必须获取customer IDoffer idcategory这些是我需要从这个事件日志文件中提取的重要变量。这是csv格式的文件。我尝试使用正则表达式,但它无法正常工作,因为您可以观察到每列的格式不同。正如您所看到的第一行有category customer id offer id,第二行是完全空白的,在这种情况下,正则表达式除了这个以外我们不得不考虑我们必须考虑所有可能的条件,我们有14000个sample.in事件日志文件... #Jason #Parsing #Python #Pandas

2 个答案:

答案 0 :(得分:2)

修改

编辑后的数据现在似乎是JSON数据。您仍然可以使用literal_eval,如下所示,或者您可以使用json模块:

import json

with open('event.log') as events:
    for line in events:
        event = json.loads(line)
        # process event dictionary

要访问CustomerIDOfferIDCategory等,您需要访问与'attributes'字典中的密钥event关联的嵌套字典:

print(event['attributes']['CustomerID'])
print(event['attributes']['OfferID'])
print(event['attributes']['Category'])

如果是某些键可能丢失的情况,请使用dict.get()代替:

print(event['attributes'].get('CustomerID'))
print(event['attributes'].get('OfferID'))
print(event['attributes'].get('Category'))

现在,如果密钥丢失,您将获得None

您可以扩展此原则以使用字典访问其他项目。

如果我理解您的问题,您还需要创建包含提取字段的CSV文件。您可以将提取的值与csv.DictWriter一起使用,如下所示:

import csv

with open('event.log') as events, open('output.csv', 'w') as csv_file:
    fields = ['CustomerID', 'OfferID', 'Category']
    writer = csv.DictWriter(csv_file, fields)
    writer.writeheader()
    for line in events:
        event = json.loads(line)
        writer.writerow(event['attributes'])

DictWriter只会在字典缺少密钥时将字段留空。

原始回答 数据不是CSV格式,它似乎包含Python字典字符串。可以使用ast.literal_eval()

将这些解析为Python词典
from ast import literal_eval

with open('event.log') as events:
    for line in events:
        event = literal_eval(line)
        # process event dictionary

答案 1 :(得分:1)

这可能不是将文本文件(由行分隔)中的嵌套json记录转换为DataFrame对象的最有效方法,但它确实有效。

import pandas as pd
import json
from pandas.io.json import json_normalize

with open('path_to_your_text_file.txt', 'rb') as f:
    data = f.readlines()

data = map(lambda x: eval(json_normalize(json.loads(x.rstrip())).to_json(orient="records")[1:-1]), data)
e = pd.DataFrame(data)
print e.head()