我正在尝试使用Python评估一些数据。数据示例如下
****************************************************************
* SAMPLE DATA *
****************************************************************
* Date Times Severity Source Machine State
18-May-2019 16:28:18 I StatesLog Off-Line States: IALT-1
18-May-2019 16:28:19 I StatesLog Off-Line States: TdALclr-0
18-May-2019 16:28:19 I StatesLog Off-Line States: S722a1-0, S722a2-0, S722ascon-0
18-May-2019 16:28:19 I StatesLog Off-Line States: !S722a1-1, S722(OFF)-0, !S722a2-1
我的追求(最终)是
Time Data
18-May-2019 16:28:18 IALT-1
18-May-2019 16:28:19 TdALclr-0
18-May-2019 16:28:19 S722a1-0,
18-May-2019 16:28:19 S722a2-0,
18-May-2019 16:28:19 S722ascon-0
18-May-2019 16:28:19 !S722a1-1,
18-May-2019 16:28:19 S722(OFF)-0,
18-May-2019 16:28:19 !S722a2-1
有了这么短的数据,我可以手动调整所需的列数,但是由于某些数据的大小超过100Mb,因此我不知道需要将多少列放入一个DataFrame中。
我尝试下面的代码删除大标题
import pandas as pd
with open('test.txt') as oldfile, open('newtest.txt', 'w') as newfile:
newfile.write('Date Times Severity Source Machine State Data Data1 Data2')
for line in oldfile:
if '*' not in line:
newfile.write(line)
df = pd.read_table('newtest.txt', sep ='\s+', engine = 'python')
df[['Date', 'Times', 'Data', 'Data1', 'Data2']].to_csv('trial.csv')
工作到一定程度,但是使用正常数据一段时间后,我收到一个解析错误,即“ read_table”命令“ Z行中预期X个字段,看到Y”行中有太多字段。我认为这是因为列数是从顶行中提取的?
我需要一种读取文件的方法,以了解以某种方式传递给熊猫以减轻错误的最大列数。列名暂时无关紧要,因为我以后总是可以在代码中对其进行调整。
那么希望我的代码的底部会给我我想要的结果
df['Time'] = df['Date'].astype(str) + ' ' +df['Times']
a = df.set_index('Time').stack()
df = a[a !=0].reset_index(drop=True, level=1).reset_index(name='Data').to_csv('output.csv')
答案 0 :(得分:0)
首先使用str.replace
从您的列中删除States:
。然后使用this函数将值取消嵌套到行中:
df['State'] = df['State'].str.replace('States:', '')
df = explode_str(df, 'State', ',').reset_index(drop=True)
Date Times Severity Source Machine State
0 18-May-2019 16:28:18 I StatesLog Off-Line IALT-1
1 18-May-2019 16:28:19 I StatesLog Off-Line TdALclr-0
2 18-May-2019 16:28:19 I StatesLog Off-Line S722a1-0
3 18-May-2019 16:28:19 I StatesLog Off-Line S722a2-0
4 18-May-2019 16:28:19 I StatesLog Off-Line S722ascon-0
5 18-May-2019 16:28:19 I StatesLog Off-Line !S722a1-1
6 18-May-2019 16:28:19 I StatesLog Off-Line S722(OFF)-0
7 18-May-2019 16:28:19 I StatesLog Off-Line !S722a2-1
如果要删除其他列本身:
explode_str(df, 'State', ',')[['Date', 'State']].reset_index(drop=True)
Date State
0 18-May-2019 IALT-1
1 18-May-2019 TdALclr-0
2 18-May-2019 S722a1-0
3 18-May-2019 S722a2-0
4 18-May-2019 S722ascon-0
5 18-May-2019 !S722a1-1
6 18-May-2019 S722(OFF)-0
7 18-May-2019 !S722a2-1
从其他答案中使用的功能:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
答案 1 :(得分:0)
我设法从反复试验中得出各栏的内容。我对python还是很陌生,所以尽管这行得通,但它可能不是最佳或最干净的处理方式。确实需要花费大量时间才能确定哪一行具有大数据列。
这确实阻止了Erfan的代码工作
import numpy as np
import csv
import os
with open('test.txt', 'r') as oldfile, open('newtest.txt', 'w') as newfile:
for line in oldfile:
newfile.write(line) # Leave original file untouched and save a copy to modify
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate() # remove the "," from the data part and replace with ' '
f.write(content.replace(',', ' ')) # all info now has ' ' seperator
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(' ', ',')) # replace seperator with ','
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(',,,,', ',')) # try to remove extra ,'s
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(',,,', ',')) # try to remove extra ,'s
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(',,', ',')) # try to remove extra ,'s
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace(',,', ',')) # try to remove extra ,'s Still left one column with ,, not sure why?
with open('newtest.txt', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace('States:', '')) # remove 'States:'
num_lines = sum(1 for line in open('newtest.txt')) #Find how many lines is in the data
x = num_lines - 10 # subtract 10 as we don't need the header file
y = 10 # 10 lines in the header
max_col=0
while x > 1:
a = pd.read_csv('newtest.txt', header=None, skiprows=y, nrows = 1,)
max_col_ln = a.shape[1]
# print(x) # --- used for testing to see how many lines left
if max_col_ln > max_col: # read entire file and find what is the largest column number needed
max_col = max_col_ln # as it probably won't be on line 1
x = x - 1
y = y + 1
z = 0
with open('newtest2.txt', 'w') as tempfile:
while max_col > 0:
tempfile.write('Column' + str(z) +',') # Create ColumnX, ColumnY etc for the maximum number of columns
max_col = max_col-1
z = z + 1
with open('newtest2.txt', 'r') as temphead:
headers = temphead.read().replace('\n', '') #Load headers as an index for columns
with open('newtest.txt', 'r+') as oldfile, open ('newtest3.txt', 'w') as tempdata:
tempdata.write(headers) # write headers at the top of the new temp file
for line in oldfile:
if '*' not in line:
tempdata.write(line) #write all the data but remove the * at the start of data
newdata = pd.read_table('newtest3.txt') # read the txt as a table
newdata.to_csv('data.csv', quoting=csv.QUOTE_NONE, escapechar='*', index=False) #write to csv using * as escape char and no index
df = pd.read_csv('data.csv')
df['Time'] = df["Column0*"] + ' ' + df['Column1*'] # combine first 2 columns to make a "Time" column
cols= list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols] # swap "time" and Column0 around
df = df.drop(['Column0*', 'Column1*', 'Column2*', 'Column3*', 'Column4*', 'Column5*'], axis=1).to_csv('data.csv', index=False) #remove columns I don't require from the data
with open('data.csv', 'r+')as f:
content = f.read()
f.seek(0)
f.truncate()
f.write(content.replace('*', '')) # remove the * escape char from earlier and write back to the csv.
os.remove('newtest.txt')
os.remove('newtest2.txt')
os.remove('newtest3.txt') # bit of house keeping after all the changes
答案 2 :(得分:0)
原始答案出现错误。原始数据有时在其中有一个行距(我无法控制数据,这就是我们所得到的)
18-May-2019 15:06:11 I StatesLog On-Line States: S644(OFF)-0, !S644a1-1, S644(OFF)-1, !S644a2-0, S770(OFF)-1,
!S770a1-0
18-May-2019 15:06:11 I StatesLog On-Line States: S644(ON)-1, S644(ON)-0, S770(ON)-0
18-May-2019 15:06:12 I StatesLog On-Line States: I770DG-1, I770RGs-0
18-May-2019 15:06:11 I StatesLog On-Line States: S644(OFF)-0, !S644a1-1, S644(OFF)-1, !S644a2-0, S770(OFF)-1,
'''
我得到的错误是:
Traceback (most recent call last)
File "explode.py", line 42, in <module>
explode_str(df, 'Bit', ',')[['Times', 'Bit']].reset_index(drop = True).to_csv('test.csv')
File "explode.py", line 9, in explode_str
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
ValueError: count < 0
我在使用int32和int64时遇到了原始问题,现在我已移至64位系统以尝试解决此问题。
答案 3 :(得分:0)
我是python的新手。但是您可以调用Bash脚本来获取所需的内容。
import os
filename="test.txt"
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | tail -n 1"
max = os.system(cmd)
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | head -n 1"
min = os.system(cmd)
print(min, max)
如果文件中的分隔符不是空格/ TAB,则可以添加-F','(对于csv文件)。