我正在尝试解析文件:
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A7XSJ:01332:11633 unclassified 0 0 0 0 137 1
A7XSJ:01333:11603 unclassified 0 0 0 0 237 1
A7XSJ:01336:11606 unclassified 0 0 0 0 26 1
A7XSJ:01338:11596 unclassified 0 0 0 0 214 1
A7XSJ:01348:11595 samp_72_20190715_11 2019071572 196 196 29 72 5
A7XSJ:01348:11595 samp_74_20190715_14 2019071574 196 196 29 72 5
A7XSJ:01348:11595 species 28901 196 196 29 72 5
A7XSJ:01350:11601 species 28901 169 169 28 276 3
A7XSJ:01351:11603 samp_72_20190715_8 2019071572 55696 55696 251 251 4
A7XSJ:01351:11603 species 28901 55696 55696 251 251 4
A7XSJ:01359:11613 unclassified 0 0 0 0 206 1
A7XSJ:01361:11598 samp_72_20190715_5 2019071572 11881 11881 124 226 3
A7XSJ:01361:11598 species 28901 11881 11881 124 226 3
A7XSJ:01361:11598 samp_74_20190715_5 2019071574 11881 11881 124 226 3
A7XSJ:01362:11618 unclassified 0 0 0 0 207 1
A7XSJ:01364:11635 unclassified 0 0 0 0 141 1
A7XSJ:01364:11637 unclassified 0 0 0 0 112 1
A7XSJ:01369:11611 unclassified 0 0 0 0 158 1
A7XSJ:01375:11615 unclassified 0 0 0 0 118 1
A7XSJ:01377:11616 unclassified 0 0 0 0 115 1
A7XSJ:01381:11632 unclassified 0 0 0 0 201 1
A7XSJ:01332:11649 species 28901 53361 53361 246 256 4
A7XSJ:01332:11649 samp_72_20190715_29 2019071572 53361 53361 246 256 4
A7XSJ:01332:11649 samp_74_20190715_30 2019071574 53361 53361 246 256 4
A7XSJ:01334:11655 genus 590 9604 0 113 264 1
A7XSJ:01335:11668 samp_72_20190715_17 2019071572 25281 25281 174 259 2
A7XSJ:01335:11668 species 28901 25281 25281 174 259 2
A7XSJ:01342:11657 unclassified 0 0 0 0 187 1
A7XSJ:01343:11650 samp_72_20190715_4 2019071572 31329 31329 192 200 2
A7XSJ:01343:11650 species 28901 31329 31329 192 200 2
A7XSJ:01345:11679 unclassified 0 0 0 0 226 1
A7XSJ:01346:11642 samp_74_20190715_6 2019071574 23104 23104 167 167 3
A7XSJ:01346:11642 species 28901 23104 23104 167 167 3
A7XSJ:01346:11642 samp_72_20190715_6 2019071572 23104 23104 167 167 3
A7XSJ:01347:11650 samp_72_20190715_18 2019071572 14161 14161 134 251 2
A7XSJ:01347:11650 species 28901 14161 14161 134 251 2
A7XSJ:01347:11656 species 28901 25281 25281 174 174 2
A7XSJ:01347:11656 samp_74_20190715_2 2019071574 25281 25281 174 174 2
A7XSJ:01347:11688 unclassified 0 0 0 0 179 1
A7XSJ:01350:11657 unclassified 0 0 0 0 146 1
A7XSJ:01351:11671 unclassified 0 0 0 0 190 1
A7XSJ:01354:11685 samp_72_20190715_24 2019071572 23716 23716 169 242 3
A7XSJ:01354:11685 species 28901 23716 23716 169 242 3
得到类似的东西:
Description Count Percent Percent_informative
0 Unclassified 579472.0 44.36676 0.0
-1 Trash 284016.0 21.74543 0.0
28901 bmatch 216343.27 16.56413 48.87931
2019071572 samp_72_20190715 match 86973.57 6.65905 19.65029
2019071574 samp_74_20190715 match 76994.85 5.89504 17.39576
这是我的脚本:
pd.set_option('expand_frame_repr', False)
pd.options.mode.chained_assignment = None # default='warn'
df = pd.read_csv(dir_taxonomy+"names.dmp", sep="|", names=["Description", "Strain", "Type", "Other"], index_col=0)
df = df.replace({' ':''}, regex=True)
df = df[(df["Type"] == "scientific name")]
df = df.drop(df.columns[[1, 2, 3]], axis=1)
df_test = pd.read_csv(file_test, header=0, sep='\t', index_col=0)
df.loc[0] = ['Unclassified']
df.loc[-1] = ['Trash']
df['Count'] = 0.0
for index, row in df_test.iterrows():
if row['seqID'] != 'unclassified':
if row['hitLength'] >= 30 and row['hitLength']/row['queryLength'] >= 0.7:
df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
else:
df.at[-1, 'Count'] = df.at[-1, 'Count'] + (1/row['numMatches'])
else:
df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
df = df[(df["Count"] != 0.0)]
df['Percent'] = round(df['Count']*100/sum(df['Count']),5)
df['Percent_informative'] = round(df['Count']*100/sum(df['Count'][:-2]),5)
df.at[0, 'Percent_informative'] = 0
df.at[-1, 'Percent_informative'] = 0
df['Count'] = round(df['Count'],2)
df = df.sort_values(['Count'], ascending=[0])
df.to_csv(file_output, header=True, index=True, sep='\t')
我收到此错误:
Traceback (most recent call last):
File "filter_test.py", line 145, in <module>
main(sys.argv[1:])
File "filter_test.py", line 124, in main
df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2270, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2771, in _get_value
return engine.get_value(series._values, index)
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 127, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 153, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas/_libs/index_class_helper.pxi", line 122, in pandas._libs.index.Int64Engine._maybe_get_bool_indexer
KeyError: 1
我在网上检查了此警报的问题,并尝试了其他操作,例如:
位置
df.iloc[0]
在脚本和初始文件中用逗号替换制表符
sed -i "" $'s/,/ /g' test.tsv
sed -i "s/\t/,/g" test.tsv
显示不同的变量...
但是我不明白为什么线路有问题
df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
以及如何解决