为什么python中的df.at行显示问题?

时间:2019-07-15 13:39:14

标签: python pandas

我正在尝试解析文件:

readID  seqID   taxID   score   2ndBestScore    hitLength   queryLength numMatches
A7XSJ:01332:11633   unclassified    0   0   0   0   137 1
A7XSJ:01333:11603   unclassified    0   0   0   0   237 1
A7XSJ:01336:11606   unclassified    0   0   0   0   26  1
A7XSJ:01338:11596   unclassified    0   0   0   0   214 1
A7XSJ:01348:11595   samp_72_20190715_11 2019071572  196 196 29  72  5
A7XSJ:01348:11595   samp_74_20190715_14 2019071574  196 196 29  72  5
A7XSJ:01348:11595   species 28901   196 196 29  72  5
A7XSJ:01350:11601   species 28901   169 169 28  276 3
A7XSJ:01351:11603   samp_72_20190715_8  2019071572  55696   55696   251 251 4
A7XSJ:01351:11603   species 28901   55696   55696   251 251 4
A7XSJ:01359:11613   unclassified    0   0   0   0   206 1
A7XSJ:01361:11598   samp_72_20190715_5  2019071572  11881   11881   124 226 3
A7XSJ:01361:11598   species 28901   11881   11881   124 226 3
A7XSJ:01361:11598   samp_74_20190715_5  2019071574  11881   11881   124 226 3
A7XSJ:01362:11618   unclassified    0   0   0   0   207 1
A7XSJ:01364:11635   unclassified    0   0   0   0   141 1
A7XSJ:01364:11637   unclassified    0   0   0   0   112 1
A7XSJ:01369:11611   unclassified    0   0   0   0   158 1
A7XSJ:01375:11615   unclassified    0   0   0   0   118 1
A7XSJ:01377:11616   unclassified    0   0   0   0   115 1
A7XSJ:01381:11632   unclassified    0   0   0   0   201 1
A7XSJ:01332:11649   species 28901   53361   53361   246 256 4
A7XSJ:01332:11649   samp_72_20190715_29 2019071572  53361   53361   246 256 4
A7XSJ:01332:11649   samp_74_20190715_30 2019071574  53361   53361   246 256 4
A7XSJ:01334:11655   genus   590 9604    0   113 264 1
A7XSJ:01335:11668   samp_72_20190715_17 2019071572  25281   25281   174 259 2
A7XSJ:01335:11668   species 28901   25281   25281   174 259 2
A7XSJ:01342:11657   unclassified    0   0   0   0   187 1
A7XSJ:01343:11650   samp_72_20190715_4  2019071572  31329   31329   192 200 2
A7XSJ:01343:11650   species 28901   31329   31329   192 200 2
A7XSJ:01345:11679   unclassified    0   0   0   0   226 1
A7XSJ:01346:11642   samp_74_20190715_6  2019071574  23104   23104   167 167 3
A7XSJ:01346:11642   species 28901   23104   23104   167 167 3
A7XSJ:01346:11642   samp_72_20190715_6  2019071572  23104   23104   167 167 3
A7XSJ:01347:11650   samp_72_20190715_18 2019071572  14161   14161   134 251 2
A7XSJ:01347:11650   species 28901   14161   14161   134 251 2
A7XSJ:01347:11656   species 28901   25281   25281   174 174 2
A7XSJ:01347:11656   samp_74_20190715_2  2019071574  25281   25281   174 174 2
A7XSJ:01347:11688   unclassified    0   0   0   0   179 1
A7XSJ:01350:11657   unclassified    0   0   0   0   146 1
A7XSJ:01351:11671   unclassified    0   0   0   0   190 1
A7XSJ:01354:11685   samp_72_20190715_24 2019071572  23716   23716   169 242 3
A7XSJ:01354:11685   species 28901   23716   23716   169 242 3

得到类似的东西:

    Description Count   Percent Percent_informative
0   Unclassified    579472.0    44.36676    0.0
-1  Trash   284016.0    21.74543    0.0
28901   bmatch  216343.27   16.56413    48.87931
2019071572  samp_72_20190715 match  86973.57    6.65905 19.65029
2019071574  samp_74_20190715 match  76994.85    5.89504 17.39576

这是我的脚本:

pd.set_option('expand_frame_repr', False)
pd.options.mode.chained_assignment = None  # default='warn'

df = pd.read_csv(dir_taxonomy+"names.dmp", sep="|", names=["Description", "Strain", "Type", "Other"], index_col=0)
df = df.replace({'  ':''}, regex=True)
df = df[(df["Type"] == "scientific name")]
df = df.drop(df.columns[[1, 2, 3]], axis=1)

df_test = pd.read_csv(file_test, header=0, sep='\t', index_col=0)

df.loc[0] = ['Unclassified']
df.loc[-1] = ['Trash']

df['Count'] = 0.0

for index, row in df_test.iterrows():
    if row['seqID'] != 'unclassified':
        if row['hitLength'] >= 30 and row['hitLength']/row['queryLength'] >= 0.7:
            df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
        else:
            df.at[-1, 'Count'] = df.at[-1, 'Count'] + (1/row['numMatches'])

    else:
        df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])

df = df[(df["Count"] != 0.0)]
df['Percent'] = round(df['Count']*100/sum(df['Count']),5)

df['Percent_informative'] = round(df['Count']*100/sum(df['Count'][:-2]),5)
df.at[0, 'Percent_informative'] = 0
df.at[-1, 'Percent_informative'] = 0

df['Count'] = round(df['Count'],2)
df = df.sort_values(['Count'], ascending=[0])
df.to_csv(file_output, header=True, index=True, sep='\t')

我收到此错误:

Traceback (most recent call last):
  File "filter_test.py", line 145, in <module>
    main(sys.argv[1:])
  File "filter_test.py", line 124, in main
    df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])
  File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2270, in __getitem__
    return self.obj._get_value(*key, takeable=self._takeable)
  File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2771, in _get_value
    return engine.get_value(series._values, index)
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 127, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 153, in pandas._libs.index.IndexEngine._get_loc_duplicates
  File "pandas/_libs/index_class_helper.pxi", line 122, in pandas._libs.index.Int64Engine._maybe_get_bool_indexer
KeyError: 1

我在网上检查了此警报的问题,并尝试了其他操作,例如:

位置

df.iloc[0]

在脚本和初始文件中用逗号替换制表符

sed -i "" $'s/,/ /g' test.tsv
sed -i "s/\t/,/g" test.tsv

显示不同的变量...

但是我不明白为什么线路有问题

df.at[row['taxID'], 'Count'] = df.at[row['taxID'], 'Count'] + (1/row['numMatches'])

以及如何解决

0 个答案:

没有答案