我正在尝试阅读大约2000个.txt文件,这些文件并非都具有相同的列。我只想在所有文件中选择通用标头,然后将它们保存到csv文件中,以上传到MySQL数据库中。 我需要解析这些文件以选择仅我需要的列的帮助。我只需要以下几列:代码,startDate,startTime,endDate,endTime,s,数字。 在startDate和endDate之后有时间列,这些时间列在文件中没有标题。我刚刚将它们命名为“ startTime”和“ endTime”
作为说明
file1示例:
code startDate endDate s number
-------------------------------------- ------------------- ------------------- - ----------
4000 23-04-2010 00:00:00 23-04-2010 00:14:59 E 1
4001 23-04-2010 00:00:00 23-04-2010 00:14:59 E 0
4002 23-04-2010 00:00:00 23-04-2010 00:14:59 E 0
4003 23-04-2010 00:00:00 23-04-2010 00:14:59 E 0
file2示例:
code lineNum startDate endDate s number id description
-------------------------------------- -------------------------------------- ------------------- ------------------- - ---------- ------------------ ----------------------------------------------------------------------------------------------------
3000 2111201 31-10-2010 05:45:00 31-10-2010 05:59:59 E 9 311 CAPITAL
3000 2111201 31-10-2010 05:45:00 31-10-2010 05:59:59 E 4 1411 USUARIO FRECUENTE
3000 2111201 31-10-2010 05:45:00 31-10-2010 05:59:59 E 1 7071 FUNCIONARIO
3000
file_list = [file1, file2,...]
datalist = []
for file in file_list[]:
with open(file,'r') as f:
reader = f.readlines()
for line in reader:
#use regex to search for only rows with text and numbers
if re.search(r'[0-9a-zA-Z]', line):
datalist.append(line.strip().split())
header = datalist[0]
try:
repeatingHeaderIndx = datalist[1:].index(header) + 1
#remove repeating header from data using index
datalist.pop(repeatingHeaderIndx)
except:
pass
df = pd.DataFrame(datalist[1:])
当我检查完整的数据框时,它得到的列超出了我要求的列数,因为每个文件中的列数可能不同。
答案 0 :(得分:0)
您可以修改正则表达式,使其仅与包含任一列名称的行匹配-
obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')
with open('words.txt', 'r') as reader:
for line in reader:
match = obj.findall(line)
datalist.append(match)
因此您的代码应类似于-
file_list = [file1, file2,...]
obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')
datalist = []
for file in file_list[]:
with open(file,'r') as f:
reader = f.readlines()
for line in reader:
match = obj.findall(line)
if match:
datalist.append(match)
header = datalist[0]
try:
repeatingHeaderIndx = datalist[1:].index(header) + 1
#remove repeating header from data using index
datalist.pop(repeatingHeaderIndx)
except:
pass
df = pd.DataFrame(datalist[1:])