所以我有多行看起来像这样的文件(空格分隔符文件):
A1BG P04217 VAR_018369 p.His52Arg Polymorphism rs893184 -
A1BG P04217 VAR_018370 p.His395Arg Polymorphism rs2241788 -
AAAS Q9NRG9 VAR_012804 p.Gln15Lys Disease - Achalasia
如何使字典在第二列中查找id并将数字(在单词之间)存储在第四列。
我尝试了这个,但它给了我超出范围的索引
lookup = defaultdict(list)
with open ('humsavar.txt', 'r') as humsavarTxt:
for line in csv.reader(humsavarTxt):
code = re.match('[a-z](\d+)[a-z]', line[1], re.I)
if code:
lookup[line[-2]].append(code.group(1))
print lookup['P04217']
答案 0 :(得分:3)
以下是原始代码的变体:
import csv, re
from collections import defaultdict
lookup = defaultdict(list)
with open('humsavar.txt', 'rb') as humsavarTxt:
reader = csv.reader(humsavarTxt, delimiter=" ", skipinitialspace=True)
for line in reader:
code = re.search(r'(\d+)', line[3])
lookup[line[1]].append(int(code.group(1)))
产生
>>> lookup
defaultdict(<type 'list'>, {'P04217': [52, 395], 'Q9NRG9': [15]})
>>> lookup['P04217']
[52, 395]
答案 1 :(得分:1)
如果id和数字始终位于第二和第四列,并且它始终以空格分隔,则不需要使用常规表达式。你可以在空格上拆分:
lookup = defaultdict(list)
with open ('humsavar.txt', 'r') as humsavarTxt:
for line in humsavarTxt:
lookup[line.split(' ')[1]].append(line.split(' ')[3])
答案 2 :(得分:0)
如果你想要一个纯字典,这可行:
d={}
with open(your_file,'rb') as f:
for line in f:
l=line.split()
num=int(re.search(r'(\d+)',l[3]).group(1))
d.setdefault(l[1],[]).append(num)
打印:
{'P04217': [52, 395], 'Q9NRG9': [15]}
对于非正则表达式解决方案,您也可以这样做:
d={}
with open(your_file,'rb') as f:
for line in f:
els=line.split()
num=int(''.join(c for c in els[3] if c.isdigit()))
d.setdefault(els[1],[]).append(num)