我的文件格式是这样的。
# Jon Doe
# 27212000-C
# Calorina, 06/03 1993
# South Calorina Jaka Km 1
# Num 009.006
# Calorina. 11710, Tp.108437347343
# joe.st'a gmail.com
# 20-09-2016 Akn
# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend, Soeprapto Gang Siaga
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info
# Jenny Doe
# 5641141 2/E.15263
# Zimbabwe, 05/06/1993
# Mujair Street Iv No.185
# Mujair, 15116. Tp.04545454
# jenny@gmail.com
# 22-09-2016/T Info
# Igor Kart
# 36412777/E,15264
# Kongo, 30/10/1994
# Kp. Pintu Air Kel. Pabuaran Kec.Boj
# onggede Kab.Bogor RT 04/09
# Bogor, 16320. Tp,107262626
# igor.@gmail.com
# 22-09-2016T Info
如何从输出中获取最佳的结构数据? 我想得到这样的结果csv。 Good_format.csv
Name Code Bday Address Phone Email Info
Jon Doe 27212000-C Calorina, 06/03 1993 South Calorina Jaka Km 1Num 009.006 Calorina. 11710 108437347343 joe.st'a gmail.com 20-09-2016 Akn
Jenny Doe 5641141 2/E.15263 Zimbabwe, 05/06/1993 Mujair Street Iv No.185 Mujair, 15116. 04545454 jenny@gmail.com 22-09-2016/T Info
Igor Kart 36412777/E,15264 Kongo, 30/10/1993 Kp. Pintu Air Kel. Pabuaran Kec.Bojonggede Kab.Bogor RT 04/09Bogor, 16320. 107262626 igor.@gmail.com 22-09-2016T Info
并将错误的格式记录到log.txt。 我需要不好的格式来重新修复它。
# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend,
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info
答案 0 :(得分:1)
import pandas as pd
from tabulate import tabulate
filepath = "SO.txt"
colList = ['Name', 'Code', 'Bday', 'Address', 'Phone', 'Email', 'Info']
df_full = pd.DataFrame(columns = colList)
with open(filepath) as fp:
contents = fp.read()
#print(contents)
groups = [[line.split("#")[1].strip() for line in group.split("\n") if line != ""] for group in contents.split("\n\n")]
#print(groups)
for groupInd, group in enumerate(groups):
df_temp = pd.DataFrame(columns = colList, index = [groupInd])
#If first line of each group contains at least a number, then the above code returns True
if not(any(chr.isdigit() for chr in group[0])):
df_temp.Name = group[0]
df_temp.Code = group[1]
df_temp.Bday = group[2]
#####
#Concatenate a list of address and phone lines into one string
temp = ' '.join(group[3:-2]).split('Tp')
df_temp.Address = temp[0]
#Extract digit string means remove commas, dots, ...
df_temp.Phone = ''.join(filter(lambda i: i.isdigit(), temp[1]))
#####
df_temp.Email = group[-2]
df_temp.Info = group[-1]
df_full = pd.concat([df_full, df_temp], axis=0)
print(tabulate(df_full, headers='keys', tablefmt='psql'))
输出:
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+
| | Name | Code | Bday | Address | Phone | Email | Info |
|----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------|
| 0 | Jon Doe | 27212000-C | Calorina, 06/03 1993 | South Calorina Jaka Km 1 Num 009.006 Calorina. 11710, | 108437347343 | joe.st'a gmail.com | 20-09-2016 Akn |
| 2 | Jenny Doe | 5641141 2/E.15263 | Zimbabwe, 05/06/1993 | Mujair Street Iv No.185 Mujair, 15116. | 04545454 | jenny@gmail.com | 22-09-2016/T Info |
| 3 | Igor Kart | 36412777/E,15264 | Kongo, 30/10/1994 | Kp. Pintu Air Kel. Pabuaran Kec.Boj onggede Kab.Bogor RT 04/09 Bogor, 16320. | 107262626 | igor.@gmail.com | 22-09-2016T Info |
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+