从文件中提取字符串之间的信息并写入csv

时间:2019-06-20 12:49:30

标签: python regex string csv nested

我想从文本文件中提取一些信息(在字符串之间,例如oldtime:... oldtime!>),并将其写入CSV文件中。我的输入文本文件是这样的:

=======================
oldtime:

 hours:1:hours!>

 minutes:12:minutes!> 

oldtime!>

newtime:

 hours:15:hours!>

 minutes:17:minutes!> 

newtime!>


oldtime:

 hours:11:hours!>

 minutes:22:minutes!> 

oldtime!>  


newtime:

 hours:5:hours!>

 minutes:17:minutes!> 

newtime!>  

==========================              

我从这个开始,但是我不能再走了。

with open(inputfile, 'r') as f, open(outputfile.cvs, 'a') as f1:
    f1.write("oldtime; newtime \n")
    for row in f:
        if "oldtime:" in str(row):
            temp = re.split(r'(@oldtime[\n\r]|[\n\r]@oldtime!>)', str(row))

            ???

        if "newtime:"  in str(row):
            temp = re.split(r'(@newtime[\n\r]|[\n\r]@newtime!>)', str(row))

我想将这样的csv文件作为输出

oldtime  newtime
01:12     15:17
11:22     05:17

能帮我吗?谢谢。

3 个答案:

答案 0 :(得分:2)

这是使用Regex和csv模块的一种方法。

例如:

import re
import csv

with open(filename) as infile, open(filename_1, "w") as outfile:
    data = infile.read()
    hrs = re.findall(r"hours:(\d+):hours", data)       #Get all HRS
    mins = re.findall(r"minutes:(\d+):minutes", data)  #Get All Mins
    data = zip(hrs, mins)

    writer = csv.writer(outfile)                       #Write CSV
    writer.writerow(["oldtime", "newtime"])            #Header
    for m, n in zip(data[0::2], data[1::2]):         
        writer.writerow([":".join(m), ":".join(n)])    #Write OLD time & New Time

答案 1 :(得分:1)

另一个类似于Rakesh解决方案的解决方案,假定您的文件始终具有相同的结构(旧时间->小时->分钟->新时间->小时->分钟...)。

  1. 提取具有正则表达式的字符串的所有数字:match = re.findall(r'\d+', str_file)

  2. 通过加入hoursminutes来转换此列表:dates = [i+ ":" + j for i, j in zip(match[::2], match[1::2])]

  3. 使用dataframe模块创建pandas

  4. 导出数据

代码在这里:

# Import module
import pandas as pd

with open("../temp.txt", 'r') as f:
    # Read file as a string
    str_file = f.read()

    # Extract all numbers
    match = re.findall(r'\d+', str_file)
    print(match)
    # ['1', '12', '15', '17', '11', '22', '5', '17']

    # create dates
    dates = [i+ ":" + j for i, j in zip(match[::2], match[1::2])]
    print(dates)
    # ['1:12', '15:17', '11:22', '5:17']

    # create dataframe
    df = pd.DataFrame({"oldtime": dates[::2],
                        "newtime": dates[1::2]})
    print(df)
    #    oldtime  newtime
    # 0    1:12   15:17
    # 1   11:22    5:17

    # Export the data
    df.to_csv("output.csv", index= False)

enter image description here

编辑1: 假设可以刷卡oldtimenewtime块。在这里,我逐行读取文件行,并在字典中将oldtimenewtime进行分类。有很多slice,但正在处理我的测试文件。

# Import module
import pandas as pd

with open("../temp.txt", 'r') as f:
    # Read file as a string
    list_split = ["oldtime:", "newtime:"]
    dates = {"oldtime:": [], "newtime:": []}
    line = f.readline().rstrip('\n')

    while True:
        line = line.rstrip('\n')
        print([line])
        if line in list_split:
            key = line

            hours = f.readline().rstrip('\n').split(":")[1]
            minutes = f.readline().rstrip('\n').split(":")[1]

            dates[key].append(hours+':'+minutes)

        line = f.readline()
        if not line:
            break

    print(dates)
    # {'oldtime:': ['1:12', '11:22'], 'newtime:': ['15:17', '5:17']}

    # create dataframe
    df = pd.DataFrame({"oldtime": dates["oldtime:"],
                       "newtime": dates["newtime:"]})
    print(df)
    #    oldtime  newtime
    # 0    1:12   15:17
    # 1   11:22    5:17

    # Export the data
    df.to_csv("output.csv", index=False)

编辑2:

import pandas as pd

with open("../temp.txt", 'r') as f:
    # Read file as a string
    list_split = ["oldtime:", "newtime:"]
    dates = {"oldtime": [], "newtime": []}
    line = f.readline().rstrip('\n')

    while True:
        # Ignore blank lines
        if ("oldtime:" in line) or ("newtime:" in line):
            # Process new "oldtime" or "newtime" block

            # Class : either "oldtime" or "newtime"
            class_time = line.replace(" ", "").rstrip('\n')[:-1]

            # Default hour - minute values
            hours = "24"
            minutes = "60"

            # Read next line
            line = f.readline().rstrip('\n')

            # While block not ended 
            while class_time + "!>" not in line:
                # If hour in line: update hour
                if 'hour' in line:
                    hours = line.split(":")[1]
                # If minute in line: update minute
                elif 'minute' in line:
                    minutes = line.split(":")[1]

                # Read next line
                line = f.readline().rstrip('\n')
            # End block

            # Add block read to dictionary
            dates[class_time].append(hours+':'+minutes)

        # Read next line
        line = f.readline()
        # If end of file: exit
        if not line:
            break

    # create dataframe
    df = pd.DataFrame({"oldtime": dates["oldtime"],
                       "newtime": dates["newtime"]})

    # Export the data
    df.to_csv("output.csv", index=False)

希望有帮助!

答案 2 :(得分:0)

大问题:)。

这是我做的一个简单解决方案,将字符串分隔为“:”字符,将数字字符串转换为整数,将其与:组合,然后将其写入csv。

这是代码:

import csv
f = "data.txt"
with open('data.txt','r') as f:
    data = f.read()
data = data.split(sep=':')
nums = []
for i in data:
    try:
        nums.append(int(i))
    except ValueError:
        pass

times = []
for i in range(len(nums)):
    if i%2 ==0:
        times.append(str(nums[i]) + ":" + str(nums[i+1]))
num_rows = len(times)/2

with open('time_data.csv','w+',newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['oldtime','newtime'])
    for i in range(len(times)):
        if i%2==0:
            writer.writerow([times[i],times[i+1]])

在阅读Rakesh的答案后,我这样写:

import re
import csv
list_i = ''
file_name = 'data.txt'
file_name1 = 'data_1.txt'
with open(file_name,'r') as f, open(file_name1,'w',newline='') as f1:
    data = f.read()
    list_1 = re.findall(r'hours:\d+:hours',data)
    list_2 = re.findall(r'minutes:\d+:minutes',data)
    for i in list_1:
        list_i += i  
    list_2_i = ''
    for i in list_2:
        list_2_i += i 
    list_1 = re.findall(r'\d+',list_i)
    list_2 = re.findall(r'\d+',list_2_i)
    data = []
    for i in range(len(list_1)):
        if i%2==0:
            data.append([str(list_1[i]) + ':' + str(list_2[i]),str(list_1[i+1]) + ':' + str(list_2[i+1])])
    writer = csv.writer(f1)
    writer.writerow(['oldtime','newtime'])
    for i in data:
        writer.writerow(i)

@Rakesh您的代码还返回错误: TypeError:“ zip”对象不可下标 有没有办法来解决这个问题? :)