Question

我正在尝试使用Python评估一些数据。数据示例如下

****************************************************************
*                       SAMPLE DATA                            *
****************************************************************
* Date      Times    Severity  Source  Machine     State   
18-May-2019 16:28:18    I   StatesLog   Off-Line   States: IALT-1
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: TdALclr-0
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: S722a1-0, S722a2-0, S722ascon-0
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: !S722a1-1, S722(OFF)-0, !S722a2-1

我的追求（最终）是

Time                    Data
18-May-2019 16:28:18    IALT-1
18-May-2019 16:28:19    TdALclr-0
18-May-2019 16:28:19    S722a1-0, 
18-May-2019 16:28:19    S722a2-0, 
18-May-2019 16:28:19    S722ascon-0
18-May-2019 16:28:19    !S722a1-1, 
18-May-2019 16:28:19    S722(OFF)-0, 
18-May-2019 16:28:19    !S722a2-1

有了这么短的数据，我可以手动调整所需的列数，但是由于某些数据的大小超过100Mb，因此我不知道需要将多少列放入一个DataFrame中。

我尝试下面的代码删除大标题

import pandas as pd

with open('test.txt') as oldfile, open('newtest.txt', 'w') as newfile:
    newfile.write('Date      Times    Severity  Source  Machine State  Data Data1 Data2')
    for line in oldfile:
        if '*' not in line:
            newfile.write(line)

df = pd.read_table('newtest.txt', sep ='\s+', engine = 'python') 
df[['Date', 'Times', 'Data', 'Data1', 'Data2']].to_csv('trial.csv')

工作到一定程度，但是使用正常数据一段时间后，我收到一个解析错误，即“ read_table”命令“ Z行中预期X个字段，看到Y”行中有太多字段。我认为这是因为列数是从顶行中提取的？

我需要一种读取文件的方法，以了解以某种方式传递给熊猫以减轻错误的最大列数。列名暂时无关紧要，因为我以后总是可以在代码中对其进行调整。

那么希望我的代码的底部会给我我想要的结果

df['Time'] = df['Date'].astype(str) + ' ' +df['Times']
a = df.set_index('Time').stack()
df = a[a !=0].reset_index(drop=True, level=1).reset_index(name='Data').to_csv('output.csv')

Answer 1

首先使用str.replace从您的列中删除States:。然后使用this函数将值取消嵌套到行中：

df['State'] = df['State'].str.replace('States:', '')

df = explode_str(df, 'State', ',').reset_index(drop=True)

          Date     Times Severity     Source   Machine         State
0  18-May-2019  16:28:18        I  StatesLog  Off-Line        IALT-1
1  18-May-2019  16:28:19        I  StatesLog  Off-Line     TdALclr-0
2  18-May-2019  16:28:19        I  StatesLog  Off-Line      S722a1-0
3  18-May-2019  16:28:19        I  StatesLog  Off-Line      S722a2-0
4  18-May-2019  16:28:19        I  StatesLog  Off-Line   S722ascon-0
5  18-May-2019  16:28:19        I  StatesLog  Off-Line     !S722a1-1
6  18-May-2019  16:28:19        I  StatesLog  Off-Line   S722(OFF)-0
7  18-May-2019  16:28:19        I  StatesLog  Off-Line     !S722a2-1

如果要删除其他列本身：

explode_str(df, 'State', ',')[['Date', 'State']].reset_index(drop=True)

          Date         State
0  18-May-2019        IALT-1
1  18-May-2019     TdALclr-0
2  18-May-2019      S722a1-0
3  18-May-2019      S722a2-0
4  18-May-2019   S722ascon-0
5  18-May-2019     !S722a1-1
6  18-May-2019   S722(OFF)-0
7  18-May-2019     !S722a2-1

从其他答案中使用的功能：

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

Answer 2

我设法从反复试验中得出各栏的内容。我对python还是很陌生，所以尽管这行得通，但它可能不是最佳或最干净的处理方式。确实需要花费大量时间才能确定哪一行具有大数据列。

这确实阻止了Erfan的代码工作

import numpy as np
import csv
import os


with open('test.txt', 'r') as oldfile, open('newtest.txt', 'w') as newfile:
    for line in oldfile:
        newfile.write(line)  # Leave original file untouched and save a copy to modify

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()                       # remove the "," from the data part and replace with ' '
    f.write(content.replace(',', ' ')) # all info now has ' ' seperator

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(' ', ',')) # replace seperator with ','

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,,,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,', ',')) # try to remove extra ,'s  Still left one column with ,, not sure why?

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace('States:', '')) # remove 'States:'



num_lines = sum(1 for line in open('newtest.txt')) #Find how many lines is in the data
x = num_lines - 10 # subtract 10 as we don't need the header file
y = 10  # 10 lines in the header
max_col=0


while x > 1:
    a = pd.read_csv('newtest.txt', header=None, skiprows=y, nrows = 1,)
    max_col_ln = a.shape[1]


#    print(x) # --- used for testing to see how many lines left

    if max_col_ln > max_col:    # read entire file and find what is the largest column number needed
        max_col = max_col_ln    # as it probably won't be on line 1
    x = x - 1
    y = y + 1



z = 0

with open('newtest2.txt', 'w') as tempfile:
    while max_col > 0:
        tempfile.write('Column' + str(z) +',') # Create ColumnX, ColumnY etc for the maximum number of columns
        max_col = max_col-1
        z = z + 1

with open('newtest2.txt', 'r') as temphead:
    headers = temphead.read().replace('\n', '') #Load headers as an index for columns

with open('newtest.txt', 'r+') as oldfile, open ('newtest3.txt', 'w') as tempdata:
    tempdata.write(headers) # write headers at the top of the new temp file
    for line in oldfile:
        if '*' not in line:
           tempdata.write(line) #write all the data but remove the * at the start of data


newdata = pd.read_table('newtest3.txt') # read the txt as a table
newdata.to_csv('data.csv', quoting=csv.QUOTE_NONE, escapechar='*', index=False) #write to csv using * as escape char and no index 

df = pd.read_csv('data.csv')
df['Time'] = df["Column0*"] + ' ' + df['Column1*'] # combine first 2 columns to make a "Time" column
cols= list(df.columns)
cols = [cols[-1]] + cols[:-1] 
df = df[cols] # swap "time" and Column0 around
df = df.drop(['Column0*', 'Column1*', 'Column2*', 'Column3*', 'Column4*', 'Column5*'], axis=1).to_csv('data.csv', index=False) #remove columns I don't require from the data

with open('data.csv', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace('*', '')) # remove the * escape char from earlier and write back to the csv.



os.remove('newtest.txt')
os.remove('newtest2.txt')
os.remove('newtest3.txt') # bit of house keeping after all the changes

Answer 3

原始答案出现错误。原始数据有时在其中有一个行距（我无法控制数据，这就是我们所得到的）

18-May-2019 15:06:11    I   StatesLog   On-Line    States: S644(OFF)-0, !S644a1-1, S644(OFF)-1, !S644a2-0, S770(OFF)-1, 
                                                           !S770a1-0
18-May-2019 15:06:11    I   StatesLog   On-Line    States: S644(ON)-1, S644(ON)-0, S770(ON)-0
18-May-2019 15:06:12    I   StatesLog   On-Line    States: I770DG-1, I770RGs-0

18-May-2019 15:06:11    I   StatesLog   On-Line    States: S644(OFF)-0, !S644a1-1, S644(OFF)-1, !S644a2-0, S770(OFF)-1,

'''

我得到的错误是：

    Traceback (most recent call last)
  File "explode.py", line 42, in <module>
    explode_str(df, 'Bit', ',')[['Times', 'Bit']].reset_index(drop = True).to_csv('test.csv')
  File "explode.py", line 9, in explode_str
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
ValueError: count < 0

我在使用int32和int64时遇到了原始问题，现在我已移至64位系统以尝试解决此问题。

Answer 4

我是python的新手。但是您可以调用Bash脚本来获取所需的内容。

import os
filename="test.txt"
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | tail -n 1"
max = os.system(cmd)
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | head -n 1"
min = os.system(cmd)
print(min, max)

如果文件中的分隔符不是空格/ TAB，则可以添加-F'，'（对于csv文件）。

使用Python查找文件或csv中的最大列数

4 个答案: