我有一个csv文件。在该文件中,如果第一个,第五个和第十三个属性的值相同,则行将被视为重复。在这种情况下,要删除重复的行。如何在python中做到这一点?
我写了一段代码,但似乎代码无限循环:
import csv
rows = csv.reader(open("items4.csv", "r"))
newrows = []
i=0
for row in rows:
if(i==0):
newrows.append(row)
i=i+1
continue
for row1 in newrows:
if(row[1]!=row1[1] and row[5]!=row1[5] and row[13]!=row1[13]):
newrows.append(row)
writer = csv.writer(open("items5.csv", "w"))
writer.writerows(newrows)
答案 0 :(得分:1)
我会稍微改变你的逻辑以引入一个标志,如下所示:
for row1 in newrows:
if row[1]==row1[1] and row[5]==row1[5] and row[13]==row1[13]:
break
else:
newrows.append(row)
初始代码的问题在于,如果行与newrows
中的任何内容都不匹配,则会将该行添加到newrows
中,这有效地无限扩展row[1]!=row1[1] and row[5]!=row1[5] and row[13]!=row1[13]
,因为您不断添加满足的值: $sf = '\\path\dept\Extracts\Date_Modified.csv'
$regex = "\d{1,2}/\d{1,2}/\d{4}"
(Get-Content $sf) |
Foreach-Object {$_ -replace $regex, (get-date -f "yyyy-MM-dd") } |
Set-Content $sf
答案 1 :(得分:1)
@Clarence已经给出了一个很好的答案。
作为替代方案,当事情变得更加复杂时,熊猫会使这些事情变得更容易。
假设您要在列表中考虑列,名为 col_list
import pandas as pd
# --- About read_csv ---
# header and delimiter are two arguments to consider for read_csv
df = pd.read_csv('path/to/your/file.csv')
# --- About drop_duplicates ---
# inplace being True changes the df itself rather than creating a new DataFrame
# subset takes the labels of columns to consider, you call them with df.columns so df.columns[col_list] will give you your desired column labels
df.drop_duplicates(subset=df.columns[col_list], inplace=True)
# --- Important Reminder!!! ---
# Don't forget that Python indices start with 0 not 1, therefore first columns should be denoted as 0 in your col_list
# --- Write your file back ---
df.to_csv('path/to/your/new_file.csv')