很长一段时间以来,我一直在尝试将空格分隔数据格式化为CSV结构。
初始数据表由下式给出:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
它包含大量空间和不必要的信息。信息有点像这样
Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
我想将其转换为以下格式
Doctor's name,Specialization,Hospital name,Address,Fees,Schedule
所以当前的数据应该是这样的
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
直到现在我已成功删除了Book Appointment字段。
但是,我在分类医院名称方面遇到了困难。因为它的间距变化很大。这个问题可行吗?
cat -A file
的输出如下:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
答案 0 :(得分:3)
没有直接的方法将专业化与医院名称分开,但是通过一些假设,您可以使用perl
来执行此操作:
perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file
给出:
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM
由于它是基于perl的正则表达式,因此您可以使用regex101通过正则表达式调试器来了解它的工作原理。正则表达式非常简单,但事实上有很多部分可能会让它看起来令人生畏。
警告:以上内容可以根据两件事分开专业化:
我知道它可能无法解决完整的问题,因为总有一些行不符合上述规则,但这可以让你开始清理它们。如果有任何错误分离(即,当专业化由超过1个单词组成且医院名称没有两个连续的上/下)时,您将正确放置一个专业化词,其余的在医院名。
答案 1 :(得分:2)
不幸的是,根据您的输入,无法将专业化与医院名称分开。其他字段可以被捕获,虽然不是很优雅并且有gawk(可能> = 4.0,但我认为3.x应该有效):
$ awk -F" \t " -v OFS="," -v S=" " '
{
sub(/\s+$/, "");
split($2, Data, /[ ,]{2,}/);
Address = Data[1];
split($2, Data, / +/);
nData = length(Data);
Schedule = Data[nData - 2];
Fees = Data[nData - 4] S Data[nData - 3];
split($1, Data, / +/);
Name = Data[1] S Data[2] S Data[3]; # assume all names are Dr. Xxx Xxx only
match($1, /[0-9]+ years experience /);
SpecializationHospital = substr($1, RSTART + RLENGTH);
print Name, SpecializationHospital, Address, Fees, Schedule;
} ' data.txt
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM