我有一个包含24列的大约30,000行数据的CSV文件。最后一列是地理列,看起来像这样:
Ethiopia
IL
IL
TX
TX
MD
NY
NY
Ethiopia
Ethiopia
Sweden
CA
CA
HI
Latvia
OH
现在我只希望包含所有行的整个CSV与美国的地理位置相对应,这些地理位置将是2个字符的州缩写(CA,HI,OH等)
基本上我希望CSV中的所有数据都能删除任何与美国无关的内容,如果可能的话,甚至会更好地删除基于美国的位置的第一行X行,其余部分则由CSV末尾的其他内容排列。
到目前为止,这是我的代码:
import csv
ask = "Y"
while ask != "N":
inputfile = input("Please enter filename: ")
filename = open(inputfile, "r")
data = []
with filename as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
if len(row[24]) == 3:
data = row[24]
datalist = row[0:23].join(data)
output = open("Newly Created Data.csv","w")
output.write(datalist)
print ("Done.")
output.close()
ask = input("Another file, Y or N? ")
它仅通过读取USA位置正确排列第24列中的数据,但我不知道如何对文件的其余部分进行排序,而其他23列只与美国位置匹配。
我正在使用Python 3,谢谢。
答案 0 :(得分:0)
import csv
states = set(["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY",])
with open('file.txt') as f, open('ofile.txt','w+') as o:
reader = csv.reader(f)
writer = csv.writer(o)
writer.writerows(sorted(reader,key=lambda row: not row[-1] in states))
将对文件进行排序,如
A,B,C,Ethiopia
A,B,C,IL
A,B,C,IL
A,B,C,TX
A,B,C,TX
A,B,C,MD
A,B,C,NY
A,B,C,NY
A,B,C,Ethiopia
A,B,C,Ethiopia
A,B,C,Sweden
A,B,C,CA
A,B,C,CA
A,B,C,HI
A,B,C,Latvia
A,B,C,OH
进入
A,B,C,IL
A,B,C,IL
A,B,C,TX
A,B,C,TX
A,B,C,MD
A,B,C,NY
A,B,C,NY
A,B,C,CA
A,B,C,CA
A,B,C,HI
A,B,C,OH
A,B,C,Ethiopia
A,B,C,Ethiopia
A,B,C,Ethiopia
A,B,C,Sweden
A,B,C,Latvia
当读回时如下:
with open('ofile.txt') as f:
for line in csv.reader(f):
print(line)
产地:
>>>
['A', 'B', 'C', 'IL']
['A', 'B', 'C', 'IL']
['A', 'B', 'C', 'TX']
['A', 'B', 'C', 'TX']
['A', 'B', 'C', 'MD']
['A', 'B', 'C', 'NY']
['A', 'B', 'C', 'NY']
['A', 'B', 'C', 'CA']
['A', 'B', 'C', 'CA']
['A', 'B', 'C', 'HI']
['A', 'B', 'C', 'OH']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Ethiopia']
['A', 'B', 'C', 'Sweden']
['A', 'B', 'C', 'Latvia']
答案 1 :(得分:0)
对于纯粹的标准库解决方案,可能类似
import csv
with open('location.csv', newline='') as fp_in:
reader = csv.reader(fp_in, delimiter=',')
data = list(reader)
data.sort(key=lambda x: (len(x[-1].strip()) != 2, x[-1].strip()))
with open("locout.csv", "w", newline='') as fp_out:
writer = csv.writer(fp_out, delimiter=',')
writer.writerows(data)
排序键功能lambda x: (len(x[-1].strip()) != 2, x[-1].strip()))
意味着它将首先根据最后一列是否有两个字符对数据进行排序,首先放置2个字符的位置,然后放置名称(实际上)按字母顺序排列它们,至少如果它们都以大写字母开头。)
我假设文件不是太大:30000行不是很多,即使有24列,所以我们也可能完全在内存中工作。
(旁白:如果你正在进行大量的CSV操作,你可能会对pandas库感兴趣 - 这使得很多操作比其他操作简单得多。)