我有两个数据库(txt文件)。一个是两列,制表符分隔的,包含名称和ID。
name1 \t ID1
name1 \t ID2
name2 \t ID9
name2 \t ID40
name3 \t ID3
另一个数据库与第一列中的第一个数据库具有相同的ID,而第二列列出了逗号分隔的相同类型的ID(这些是第一个数据库的子项,因为第二个数据库是分层)。
ID1 \t ID1,ID2,ID3
ID2 \t ID2, ID9
我想要做的是获得与第二个数据库格式相同的第三个数据库,但在第二个列中,我想将子ID替换为第一个数据库的名称。例如:
ID1 \t name1,name2,name3
ID2 \t name1,name2
有办法做到这一点吗?我是初学者,在使用Web服务之前必须映射ID,但这是进一步分析所需的自定义格式,我不知道从哪里开始。
提前致谢!
答案 0 :(得分:0)
import csv
# Reading the first db is simple since there's only a fixed delimiter
# Use csv module to split the lines and create a dictionary that maps id to name
id_dictionary = {}
with open('db_1.txt', 'r') as infile:
reader = csv.reader(infile, delimiter='\t')
for line in reader:
id_dictionary[line[1]] = line[0]
# We can again split on tab but that will return 'name1,name2' etc as a single
# string that we call split() on later.
row_data = []
with open('db_2.txt', 'r') as infile:
reader = csv.reader(infile, delimiter='\t')
for line in reader:
# ID remains unchanged, so keep the first value
row = [line[0]]
# Split the string into individual elements in a list
id_codes = line[1].split(',')
# List comprehension to look for ID in the dictionary and return the
# name stored against it
translated = [id_dictionary.get(item) for item in id_codes]
# Add translated to the list that we are using to represent a row
row.extend(translated)
# Append the row to our collection of rows
row_data.append(row)
with open('db_3.txt', 'w') as outfile:
for row in row_data:
outfile.write(row[0])
outfile.write('\t')
outfile.write(','.join(map(str,row[1:]))) # Join values by a comma
outfile.write('\n')
答案 1 :(得分:0)
您可以尝试这一行awk脚本:
awk -v FS="\t|," -v OFS="," 'FILENAME=="file_name.txt" {str[$2]=$1;next;} {for(i=2;i<=NF;i++) {sub($i,str[$i],$i)};a=$1;$1="";print a"\t"$0}' file_name.txt fileID.txt|sed -e 's/,//' -e 's/,$//'
awk的“file_name.txt”是txt文件,其第一列具有“name1,name2 ...”,而“fileID.txt”在第一列中具有“ID1,ID2,...” “
sed用于修剪列表开头和末尾的逗号,这些逗号不是必需的。
答案 2 :(得分:0)
#suppose database files are f1.txt,f2.txt,f3.txt
#use set to get key-value format datas
def getArr(f):
i=f.readline()
arr=[]
while i:
i=i.replace('\n','')
arr.append(i.split('\t'))
i=f.readline()
return arr
if __name__=="__main__":
f1=file("f1.txt")
f2=file("f2.txt")
f3=open('f3.txt','w')
arr1=getArr(f1)
arr2=getArr(f2)
dic={}
for array in arr1:
dic[array[1]]=array[0]
for i in arr2:
keys=i[1].split(',')
print keys
line=i[0]+'\t'
for key in keys:
line+=dic.get(key)+','
line=line[:-1]+'\n'
f3.write(line)
f1.close()
f2.close()
f3.close()