如何使用熊猫从.lo​​g文件读取数据

时间:2019-10-21 08:31:47

标签: python pandas

我有一个日志文件,其中包含来自webscrape脚本的100页数据。 像这样在日志中读取.log文件:

Title: Canon EF 100mm f/2.8L Macro IS USM
Price: 6�900 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=161065896
21-Oct-19 10:21:14 - Found:
Title: Canon EF 100mm f/2.8L Macro IS USM
Price: 7�500 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=155541389
21-Oct-19 10:21:14 - Found:
Title: Panasonic Lumix G 25mm F1.4 ASPH
Price: 3�200 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=161066674

我想导入此数据并将其发送给excel

title           price      link
canon 100mm     6900kr     https

2 个答案:

答案 0 :(得分:0)

如果日志文件的显示顺序不正确,则需要更改方法。由于以下功能将始终开始查找“标题”,“价格”和“链接”文本并添加到列表中。要转换为数据帧,所有列表的长度必须相等。让我知道它是否有效。

def log_to_frame(location="./datalake/file.log"):
    with open(location, mode='r', encoding='UTF-8') as f:
        title_list = []
        price_list = []
        link_list = []
        for line in f:
            if "Title" in line:
                title = line.split(": ")[1].rstrip()
                title_list.append(title)
            elif "Price" in line:
                price = line.split(": ")[1].replace("�", "").rstrip()
                price_list.append(title)
            elif "Link" in line:
                link = line.split(": ")[1].rstrip()
                link_list.append(title)
            else:
                pass
    main_df = pd.DataFrame({"title": title_list, "price": price_list, "link": link_list})
    return main_df


log_df = log_to_frame()
log_df.to_excel("log.xlsx", index=False)

答案 1 :(得分:0)

您可以将数据作为普通表加载到DataFrame中,然后使用DataFrame的logreset_index函数合并列。 假设每行上只有一个“:”符号,将“键”列与“值”列分开,并且每个“记录”的每个键都有一行。

import pandas as pd

p = pd.read_table("table.log", sep=':', header=None)
df = pd.DataFrame()
keys = set(p[0]) # set of all unique keys

for key in keys:
  # get all values with the current key and re-index them from 0...n
  col_data = p.loc[p[0]==key][1].reset_index(drop=True)
  # put this in a new column named after the key
  df[key] = col_data