我是一名新手,正在努力处理熊猫数据框。
代码如下:
import csv
import csv
import numpy as np
import pandas as pd
# read the CSV into a pandas data frame (df)
df = pd.read_csv(sys.argv[1], delimiter=',')
print (df)
# to open a .txt file and use sequences as a string
with open(sys.argv[2], 'r') as f:
contents = f.read()
s = contents
s_len = len(s)
found_s_len = 0
keep_going = True
while s_len>0 and keep_going:
AGATC = "AGATC" * s_len
AGATC2 = "AGATC"
if AGATC in s:
found_a_len = s_len
keep_going = False
s_len=s_len -1
result0 = (AGATC2, found_a_len)
s_len = len(s)
found_s_len1 = 0
keep_going = True
while s_len>0 and keep_going:
AATG = "AATG" * s_len
AATG2 = "AATG"
if AATG in s:
found_a_len1 = s_len
keep_going = False
s_len=s_len -1
result1 = (AATG2, found_a_len1)
s_len = len(s)
found_s_len2 = 0
keep_going = True
while s_len>0 and keep_going:
TATC = "TATC" * s_len
TATC2 = "TATC"
if TATC in s:
found_a_len2 = s_len
keep_going = False
s_len=s_len -1
result2 = (TATC2, found_a_len2)
total_result = (result0, result1, result2)
d = {'AGATC': [found_a_len], AATG2: [found_a_len1], TATC2: [found_a_len2]}
df1 = pd.DataFrame(data=d)
print (df1)
作为第一个参数,它需要一个.csv文件,如下所示:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
作为第二个参数,它使用一个.txt文件,该文件带有如下所示的DNA序列(它只是一行DNA“ AAGGT ...”序列:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
该代码的命令行参数是python dna.py数据库/small.csv序列/1.txt,其中small.csv是第一个参数,而1.txt是第二个参数。
我正在做的是从第二个参数中计算出最长的连续重复子字符串('AGATC''AATG''TATC'),然后查看重复次数是否与第一个参数中提供的.csv文件中的任何人匹配论据。
到目前为止,我已经收到了上面的代码(对不起,它确实很杂乱,可能效率很低),如果运行代码,输出将是:
name AGATC AATG TATC
0 Alice 2 8 3
1 Bob 4 1 5
2 Charlie 3 2 5
AGATC AATG TATC
0 4 1 5
所以现在我需要比较这两个数据帧(df和df1),看看是否匹配。如果匹配,则代码应打印匹配人员的姓名。在此特定示例中,匹配项必须与“ Bob”匹配,因此代码应显示“ Bob”。需要在熊猫中使用某种IF语句。每当我尝试运行比较时,它都会返回错误。我怀疑这是因为两者的数据类型都不相同,我真的很沮丧,因为我坚持使用它……谢谢您的帮助!