Question

我正在寻找一种优化我已经开发的算法的方法。正如我的问题的标题所说，我正在处理逗号分隔的字符串，有时包含任意数量的嵌入式逗号。这一切都是在大数据的背景下完成的，因此速度很重要。我所拥有的就是我需要的一切，但是，我必须相信会有更快的方法。如果您有任何建议，我很乐意听到。提前谢谢。

代码：

import os,re


commaProblemA=re.compile('^"[\s\w\-()/*.@!#%^\'&$\{\}|<>:0-9]+$')

commaProblemB=re.compile('^[\s\w\-()/*.@!#%^\'&$\{\}|<>:0-9]*"$')

#example string
#these are read from a file in practice
z=',,"N/A","DWIGHT\'s BEET FARM,INC.","CAMUS,ALBERT",35.00,0.00,"NIETZSCHE,FRIEDRICH","God, I hope this works, fast.",,,35.00,,,"",,,,,,,,,,,"20,4,2,3,2,33","223,2,3,,34 00:00:00:000000",,,,,,,,,,,,0,,,,,,"ERW-400",,,,,,,,,,,,,,,1,,,,,,,"BLA",,"IGE6560",,,,'

testList=z.split(',')


for i in testList:
    if re.match(commaProblemA,i):
       startingIndex=testList.index(i)
       endingIndex=testList.index(i)
       count=0
       while True:
           endingIndex+=1
           if re.match(commaProblemB,testList[endingIndex]):
               diff=endingIndex-startingIndex
               while count<diff:             
                   testList[startingIndex]=(testList[startingIndex]+","+testList[startingIndex+1])
                   testList.pop(startingIndex+1)
                   count+=1                   
               break




print(str(lineList))
print(len(lineList))

Answer 1

如果您真的想自己这样做而不是使用图书馆，请先提供一些提示：

不要在csv数据上使用split()。（也不利于表现）
表现：不要使用regEx。

扫描数据的常规方法是这样的（伪代码，假设单行csv）：

for each line
    bool insideQuotes = false;
    while not end of line {

        if currentChar == '"'
            insideQuotes = !insideQuotes; // ( ! meaning 'not')
            // this also handles the case of escaped quotes inside the field
            //    (if escaped with an extra quote)

        else if currentChar == ',' and !insideQuotes
            // seperator found - handle field
    }

为了获得更好的性能，您可以在二进制模式下打开文件，并在扫描时自己处理换行。这样您就不需要扫描一行，将其复制到缓冲区中（例如使用getline（）或类似函数），然后再次扫描该缓冲区以提取字段。

使用Python解析逗号分隔的带有嵌入式逗号的字符串

1 个答案: