Question

我有一个包含2000行的CSV数据集，其中有一个关于名字/姓氏的杂乱列。在本专栏中，我需要分离名字和姓氏。为此，我在过去二十年里在法国给出了所有姓氏的基地。

因此，源数据库看起来像：

"name"; "town"
"Johnny Aaaaaa"; "Bordeaux"
"Bbbb Tom";"Paris"
"Ccccc Pierre Dddd" ; "Lyon"
...

我想获得类似的东西：

"surname"; "firstname"; "town"
"Aaaaaa"; "Johnny "; "Bordeaux"
"Bbbb"; "Tom"; "Paris"
"Ccccc Dddd" ; "Pierre"; "Lyon"
...

而且，我的名字参考数据库：

"firstname"; "sex"
"Andre"; "M"
"Bob"; "M"
"Johnny"; "M"
...

从技术上讲，我必须比较第一个基地的每一行和第二个基地的每个字段，以便识别哪个字符链对应于第一个名称...... 我不知道如何做到这一点。

欢迎任何想法......谢谢。

Answer 1

看起来你想要

从文件中输入数据，输入input.csv
提取名称并将其拆分为名字和姓氏
使用名字
可能会再次将数据写入新的csv或打印出来。

您可以按照以下方法操作。你可以使用正则表达式进行更复杂的分割，但这里有一些基本的使用条带命令：

inFile=open('input.csv','r')
rows=inFile.readlines()
newData=[]
if len(rows) > 1:
    for row in rows[1:]:
         #Remove the new line chars at the end of line and split on ;
         data=row.rstrip('\n').split(';')

         #Remove additional spaces in your data
         name=data[0].strip()

         #Get rid of quotes
         name=name.strip('"').split(' ')
         fname=name[1]
         lname=name[0]
         city=data[1].strip()
         city=city.strip('"')

         #Now you can get the sex info from your other database save this in a list to get the sex info later
         sex='M' #replace this with your db calls
         newData.append([fname, lname, sex, city])

inFile.close()
#You can put all of this in the new csv file by something like this (it seperates the fileds using comma):

outFile=open('otput.csv','w')
for row in newData:
    outFile.write(','.join(row))
    outFile.write('\n')
outFile.close(

Answer 2

好。最后，我选择了“暴力”方法：每行的每个术语与我的第二个基础的11.000个键（在一个词典中转换）进行比较。不聪明，但效率很高。

for row in input:       
    splitted = row[0].lower().split()       
    for s in splitted :         
        for cle, valeur in dict.items() :       
            if cle == s :                   
                print ("{} >> {}".format(cle, valeur))

关于更漂亮的解决方案的所有想法仍然受欢迎。

通过比较来分离名字/姓氏

2 个答案: