我有以下数据框:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 G nan
2 chr19 45867080 G ERCC2
3 chr3 52436341 C BAP1
4 chr7 151875065 G KMT2C
5 chr19 1206633 CGGGT STK11
我希望将整个'end_position'列转换为包含'start_position'+ len('end_position')的值,结果应为:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 24128946 nan
2 chr19 45867080 45867081 ERCC2
3 chr3 52436341 52436342 BAP1
4 chr7 151875065 151875066 KMT2C
5 chr19 1206633 1206638 STK11
我写了以下脚本:
patient_vcf_to_df.apply(pd.to_numeric, errors='ignore')
patient_vcf_to_df['end_position'] = patient_vcf_to_df['end_position'].map(lambda x: patient_vcf_to_df['start_position'] + len(x))
但我得到了错误: TypeError:必须是str,而不是int
任何人都知道如何解决问题?
非常感谢!
答案 0 :(得分:1)
首先,我以likes ≡ ¬dislikes
¬∃(Dog ⊓ dislikes.(Colourful ⊓ Toy))
行将成为标题(列名称)的方式阅读您的CSV:
Cute(∀Pomeranian)
获得以下DF:
0
作为积极的副作用:
df = pd.read_csv(filename, header=1)
如果你想小写你的列:
Chr Position Ref Gene_Name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
确保In [99]: df.dtypes
Out[99]:
chr object
position int64 # <--- NOTE
ref object
gene_name object
dtype: object
列是数字dtype:
In [97]: df.columns = df.columns.str.lower()
In [98]: df
Out[98]:
chr position ref gene_name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
然后:
position