现在使用“ 1-1-A-A”之类的输入,并使用“-”散布每个点 但是输入可以是各种类型,例如“ chr1-1-C-G”,“ 3-1-C-A”,“ CHRX-34-A-T”等。
第一个位置应接受“ chr1,chr2,... chr 23,chrX,ChrY” ,第二个位置只能接受正数,第三和第四个位置只能接受{A,C,G,T}的一个字母
所以我正在考虑使用'''re.findall'''并使用错误情况为错误的输入返回警告。但不确定如何使用正则表达式给出错误。
有人可以帮忙吗?
def _check_input(var_str): # maybe better to check each input seperately
"""
Checks if the input is a valid variant string
:param var_str: string supposed to be in the format 'chr-pos-ref-alt'
:return: bool which tells wether the input is valid
"""
pattern = re.compile(
r"""([1-9]|[1][0-9]|[2][0-2]|[XY]) # the chromosome
-(\d+) # the position
-[ACGT]+ #RawDescriptionHelpFormatter,
-[ACGT]+ # alt""",
re.X,
)
if re.fullmatch(pattern, var_str) is None:
return False
else:
return True
def string_to_dict(inp):
"""
Converts a variant string into a dictionary
:param inp: string which should be a valid variant
:return: dictionary with the variants keys and values
"""
inp_list = inp.split("-")
inp_dict = {
"chr": inp_list[0],
"pos": inp_list[1],
"ref": inp_list[2],
"alt": inp_list[3],
}
return inp_dict
答案 0 :(得分:1)
Regex非常适合检查序列的全局有效性。不幸的是,我看不到如何使用一个正则表达式来实现错误检查。
所以我认为您可以使用正则表达式来检查输入的全部有效性。如果无效,则可以添加更多代码以警告用户可能出了问题。
import re
def _check_input(var_str):
"""
Checks if the input is a valid variant string
:param var_str: string supposed to be in the format 'chr-pos-ref-alt'
:return: a match object
:raises: ValueError on invalid input
"""
pattern = re.compile(
r"(?:chr)?(?P<chr>[1-9]|[1][0-9]|[2][0-3]|[XY])" # the chromosome
r"-(?P<pos>\d+)" # the position
r"-(?P<ref>[ACGT])" # RawDescriptionHelpFormatter
r"-(?P<alt>[ACGT])", # alt
re.X | re.IGNORECASE,
)
match = re.match(pattern, var_str)
if not match:
_input_error_suggestion(var_str)
return match # you can access values like so match['chr'], match['pos'], match['ref'], match['alt']
def _input_error_suggestion(var_str):
parts = var_str.split('-')
if len(parts) != 4:
raise ValueError('Input should have 4 parts separated by -')
chr, pos, nucleotide1, nucleotide2 = parts
# check part 1
chr_pattern = re.compile(r'(?:chr)?([1-9]|[1][0-9]|[2][0-3]|[XY])', re.IGNORECASE)
if not re.match(chr_pattern, chr):
raise ValueError('Input first part should be a chromosome chr1, chr2, ..., chr 23, chrX, chrY')
# check part 2
try:
p = int(pos)
except ValueError:
raise ValueError('Input second part should be an integer')
if p < 0:
raise ValueError('Input second part should be a positive integer')
# check part 3 and 4
for i, n in enumerate((nucleotide1, nucleotide2)):
if n not in 'ACGT':
raise ValueError(f"Input part {3 + i} should be one of {{A,C,G,T}}")
# something else
raise ValueError(f"Input was malformed, it should be in the format 'chr-pos-ref-alt'")
我通过
改进了原始正则表达式