Question

我正在编写一个脚本，其中一个函数是读取一个包含其中一行URL的CSV文件。不幸的是，创建这些CSV的系统并没有在URL列中的值上加上双引号，所以当URL包含逗号时，它会破坏我所有的csv解析。

这是我使用的代码：

with open(accesslog, 'r') as csvfile, open ('results.csv', 'w') as enhancedcsv: 
        reader = csv.DictReader(csvfile)
        for row in reader: 
            self.uri =  (row['URL'])
            self.OriCat = (row['Category'])
            self.query(self.uri)
            print self.URL+","+self.ServerIP+","+self.OriCat+","+self.NewCat"

这是打破解析的示例网址 - 此网址位于名为＆＃34; URL＆＃34;的行上。（注意最后的逗号）

ams1-ib.adnxs.com/ww=1238&wh=705&ft=2&sv=43&tv=view5-1&ua=chrome&pl=mac&x=1468251839064740641,439999,v,mac,webkit_chrome,view5-1,0,,2,

URL之后的下一行总是在括号之间带有数值。例如：（9999）所以这可以用来定义带逗号的URL何时结束。

如何使用csv模块处理这种情况？

Answer 1

你必须手动多做一点。试试这个

def process(lines, delimiter=','):
    header = None
    url_index_from_start = None
    url_index_from_end = None
    for line in lines:
        if not header:
            header = [l.strip() for l in line.split(delimiter)]
            url_index_from_start = header.index('URL')
            url_index_from_end = len(header)-url_index_from_start

        else:
            data = [l.strip() for l in line.split(delimiter)]

            url_from_start = url_index_from_start
            url_from_end = len(data)-url_index_from_end

            values = data[:url_from_start] + data[url_from_end+1:] + [delimiter.join(data[url_from_start:url_from_end+1])]
            keys = header[:url_index_from_start] + header[url_index_from_end+1:] + [header[url_index_from_start]]

            yield dict(zip(keys, values))

用法：

lines = ['Header1, Header2, URL, Header3',
         'Content1, "Content2", abc,abc,,abc, Content3']

result = list(process(lines))

assert result[0]['Header1'] == 'Content1'
assert result[0]['Header2'] == '"Content2"'
assert result[0]['Header3'] == 'Content3'
assert result[0]['URL'] == 'abc,abc,,abc'

print(result)

结果：

>>> [{'URL': 'abc,abc,,abc', 'Header2': '"Content2"', 'Header3': 'Content3', 'Header1': 'Content1'}]

Answer 2

您是否考虑过使用Pandas来读取数据？

另一种可能的解决方案是使用正则表达式预处理数据......

#make a list of everything you want to change
old = re.findall(regex, f.read())

#append quotes and create a new list
new = []
for url in old:
     url2 = "\""+url+"\""
     new.append(url2)

#combine the lists
old_new = list(zip(old,new))

#Then use the list to update the file:
f = open(filein,'r')
filedata = f.read()
f.close()
for old,new in old_new:    
    newdata = filedata.replace(old,new)    
f = open(filein,'w')
f.write(newdata)
f.close()

阅读CSV文件并过滤结果

2 个答案: