我编写了一个脚本,用一个4k +行无序数据的大型4MB文本文件转换为格式化的,更容易处理的CSV文件。
问题:
分析我的文件大小,看起来我丢失了超过1MB的数据(20K行| 编辑:原始文件是7MB因此丢失~4MB数据),当我尝试搜索特定的 sorted_CSV.csv 中 CommaOnly.txt 中的数据点我找不到它们。
我觉得这很奇怪。
我尝试了什么:
我搜索并替换了 CommaOnly.txt 中可能导致问题的所有unicode字符..没有运气!
示例: \ u0b99替换为“”
以下是一些数据丢失的示例
来自: CommaOnly.txt
的一行name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings
Sorted_CSV.csv
Not present.
可能导致这种情况的原因是什么?
代码:
import re
import csv
import time
# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']
## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string
## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour
## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"
elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])
if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"
elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])
if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"
elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"
elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])
if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"
elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])
if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values
# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY
with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")
for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")
print("Finished! :)")
time.sleep(60)
答案 0 :(得分:1)
我使用csv包重写了对我来说似乎很可疑的主循环。
你的reformo_sort例程是不完整和语法上不正确的,有空的elif块和丢失的处理,所以我得到了不完整的行,但这应该比你的代码好得多。注意csv的使用,"二进制"标志,单一打开写入模式而不是打开/关闭每一行(更快)和now_sorted数组的1-out-of-2过滤。
with open("CommaOnly.txt", 'rb') as c:
print("Starting.. :)")
cr = csv.reader(c,delimiter=",",quotechar='"')
with open("sorted_CSV.csv", "wb") as fw:
cw = csv.writer(fw,delimiter=",",quotechar='"')
for to_sort in cr:
now_sorted = reformo_sort(to_sort)
cw.writerow(now_sorted[1::2])