TT1 4444 | Drowsy | 9 19 bit drowsy
TT2 45888 | Blurred see - hazy | 29 50 little seeing vision
TT4 45933 | Excessive upper pain | 62 78 pain problems
我希望将部分信息导出到Excel工作表或CSV文件中。我期望的CSV文件是这样的:
Column 1 Column 2 column 3
4444 Drowsy bit drowsy
45888 Blurred see - hazy little seeing vision
45933 Excessive upper pain pain problems
如您所见,我不需要文本文件的第一,第四和第五列中的信息。
更新问题: 某些行中的信息结构如下:
TT6 112397013 | ari | or 76948002|pain| 22 345 agony
预期输出如下:
Column 1 Column 2 column 3
112397013 air agony
76948002 pain agony
问题的第二次更新:文本文件中还有另一个例外:
TT9 CONCEPT_LESS 336 344 mobility
我只想让这一行的输出如下:
CONCEPT_LESS mobility
有什么建议吗?谢谢!
答案 0 :(得分:1)
我假设您可以将数据读入字符串列表。代码使用正则表达式(re)将它们解析为所需的输出,然后您可以将其写出到csv文件:
import re
#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1 4444 | Drowsy | 9 19 bit drowsy",
"TT2 45888 | Blurred see - hazy | 29 50 little seeing vision",
"TT4 45933 | Excessive upper pain | 62 78 pain problems"]
#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"
#Only looks for letters after a space
describe_pattern = "\s(\D.*)"
#Format the output lines
out_lines = []
for line in lines:
split_line = line.split("|")
tt_num = re.findall(tt_num_pattern,split_line[0])[0]
state = split_line[1].strip() #Just trim edges of whitespace
describe = re.findall(describe_pattern,split_line[2])[0]
describe = describe.strip()
out_line = tt_num+","+state+","+describe
out_lines.append(out_line)
#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
print out_line
输出:
4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
很高兴这有帮助。这是您要求的更新。老实说,这不是很好(灵活)的代码,但它有效:
import re
#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1 4444 | Drowsy | 9 19 bit drowsy",
"TT2 45888 | Blurred see - hazy | 29 50 little seeing vision",
"TT4 45933 | Excessive upper pain | 62 78 pain problems",
"TT6 112397013 | air | or 76948002|pain| 22 345 agony"]
#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"
#Only looks for letters after a space
describe_pattern = "\s(\D.*)"
#Format the output lines
out_lines = []
for line in lines:
split_line = line.split("|")
#If there is an 'or'
if len(split_line) == 5:
tt_num = split_line[2].replace("or","").strip()
state = split_line[3].strip()
describe = re.findall(describe_pattern,split_line[4])[0].strip()
out_line = tt_num+","+state+","+describe
out_lines.append(out_line)
tt_num = re.findall(tt_num_pattern,split_line[0])[0]
state = split_line[1].strip()
out_line = tt_num+","+state+","+describe
out_lines.append(out_line)
#If there is no 'or'
elif len(split_line) == 3:
tt_num = re.findall(tt_num_pattern,split_line[0])[0]
state = split_line[1].strip() #Just trim edges of whitespace
describe = re.findall(describe_pattern,split_line[2])[0]
describe = describe.strip()
out_line = tt_num+","+state+","+describe
out_lines.append(out_line)
#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
print out_line
更新输出:
4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
76948002,pain,agony
112397013,air,agony
答案 1 :(得分:1)
由于输入文本文件没有管道或空格或逗号的特定类型的分隔符,我们需要将文件读取为字符串。
要提取所需信息,请使用正则表达式。
csv module 用于创建和写入csv数据。
请提供check here以获取有关csv模块的更多信息。
xyz.txt的内容:
TT1 4444 | Drowsy | 9 19 bit drowsy
TT2 45888 | Blurred see - hazy | 29 50 little seeing vision
TT4 45933 | Excessive upper pain | 62 78 pain problems
TT6 112397013 | air | or 76948002|pain| 22 345 agony
TT9 CONCEPT_LESS 336 344 mobility
代码(评论内联):
import re
import csv
def extract_data(val):
tmp1,tmp2,tmp3 = val[0],val[1],val[2]
tmp1 = re.findall(r'.*\s+(\w+)',tmp1.strip())[0]
tmp2 = tmp2.strip()
tmp3 = re.findall(r'\s+(\D+)',tmp3.strip())[0]
return (tmp1,tmp2,tmp3)
#Open CSV file for wrting data
csv_fh = open("demo.csv", 'w')
writer = csv.writer(csv_fh)
#Write Header to csv file
writer.writerow( ('Column 1', 'Column 2', 'Column 3') )
#Start reading text file line by line
with open("xyz.txt","r") as fh:
for line in fh.readlines():
#Check or in line
if "or" in line:
val_list = line.split('|')
val1 = val_list[:2]
val2 = val_list[2:]
val1.append(val2[-1])
for v in [val1,val2]:
l = extract_data(v)
writer.writerow( l )
elif '|' in line and 'or' not in line:
#Split on basis of pipe(|)
val = line.split('|')
l = extract_data(val)
writer.writerow( l )
elif '|' not in line:
val = line.split()
data = [val[1],val[4],'']
writer.writerow( data )
else:
pass
#Close CSV file
csv_fh.close()
demo.csv的内容:
Column 1,Column 2,Column 3
4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
112397013,air,agony
76948002,pain,agony
CONCEPT_LESS,mobility,