用“从python中的infile分裂线”

时间:2015-09-16 12:14:58

标签: python split

我有一系列输入文件,例如:

chr1    hg19_refFlat    exon    44160380    44160565    0.000000    +   .   gene_id "KDM4A"; transcript_id "KDM4A";
chr1    hg19_refFlat    exon    19563636    19563732    0.000000    -   .   gene_id "EMC1"; transcript_id "EMC1";
chr1    hg19_refFlat    exon    52870219    52870551    0.000000    +   .   gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1    hg19_refFlat    exon    53373540    53373626    0.000000    -   .   gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1    hg19_refFlat    exon    11839859    11840067    0.000000    +   .   gene_id "C1orf167"; transcript_id "C1orf167";
chr1    hg19_refFlat    exon    29037032    29037154    0.000000    +   .   gene_id "GMEB1"; transcript_id "GMEB1";
chr1    hg19_refFlat    exon    103356007   103356060   0.000000    -   .   gene_id "COL11A1"; transcript_id "COL11A1";

在我的代码中我试图从每一行捕获2个元素,第一个是它表示外显子之后的数字,第二个是基因(由“”包围的数字和字母组合,例如“KDM4A”。这里是我的代码:

    with open(infile,'r') as r:
        start = set([line.strip().split()[3] for line in r])
        genes = set([line.split('"')[1] for line in r])
        print len(start)
        print len(genes)
由于某些原因,开始工作正常,但基因没有捕获任何东西。这是输出:

 48050
 0

我认为这与基因名称周围的“”有关,但是如果我在终端上输入它,它可以正常工作:

>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>> 

任何解决方案都会受到高度赞赏吗?即使它是一种完全不同的方式从每一行捕获2项数据。感谢

5 个答案:

答案 0 :(得分:8)

这是因为当您在此处循环一次start = set([line.strip().split()[3] for line in r])时,您的文件对象已经用完了,您试图在此处循环genes = set([line.split('"')[1] for line in r])在耗尽的文件对象上

<强>解决方案:

你可以寻找文件的开头(这是解决方案之一)

修改代码:

with open(infile,'r') as r:
    start = set([line.strip().split()[3] for line in r])
    r.seek(0, 0)
    genes = set([line.split('"')[1] for line in r])
    print len(start)
    print len(genes)

答案 1 :(得分:4)

您可以使用正则表达式。

with open(file) as f:
    start = []
    genes = []
    for line in f:
        st, gen = re.search(r'\bexon\s+(\d+)\b.*?\s+gene_id\s+"([^"]*)"', line).groups()
        start.append(st)
        genes.append(gen)
    print set(start)
    print set(genes)

DEMO

答案 2 :(得分:2)

您可以将所有行加载到列表中,然后对该列表中的每个项目执行split(不确定文件长度的效率)

with open(infile) as r:
    lines = [line for line in r]
    start = set([line.strip().split()[3] for line in lines])
    genes = set([line.split('"')[1] for line in lines]) 

答案 3 :(得分:2)

使用shlex(就像它的shell参数一样),中和多个空格和引号 不确定它是否更快,但更安全,更好看

import shlex
with open(infile, 'r') as f:
    for line in f:
        parts = shlex.split(line.replace(';', ''))
        print parts[3], parts[9]

答案 4 :(得分:2)

无法加载private async void countdown() { listBox1.Items.Clear(); listBox1.Items.Add("3"); await Task.Delay(1000); listBox1.Items.Add("2"); await Task.Delay(1000); listBox1.Items.Add("1"); await Task.Delay(1000); listBox1.Items.Clear(); } 的原因是您需要从头开始重新读取文件。以下方法应该有效:

public string GetChormeURL(string ProcessName)
        {
            string ret = "";
            Process[] procs = Process.GetProcessesByName(ProcessName);
            foreach (Process proc in procs)
            {
                // the chrome process must have a window
                if (proc.MainWindowHandle == IntPtr.Zero)
                {
                    continue;
                }
                //AutomationElement elm = AutomationElement.RootElement.FindFirst(TreeScope.Children,
                //         new PropertyCondition(AutomationElement.ClassNameProperty, "Chrome_WidgetWin_1"));
                // find the automation element
                AutomationElement elm = AutomationElement.FromHandle(proc.MainWindowHandle);


            // manually walk through the tree, searching using TreeScope.Descendants is too slow (even if it's more reliable)
            AutomationElement elmUrlBar = null;
            try
            {
                // walking path found using inspect.exe (Windows SDK) for Chrome 43.0.2357.81 m (currently the latest stable)
                // Inspect.exe path - C://Program files (X86)/Windows Kits/10/bin/x64
                var elm1 = elm.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, "Google Chrome"));
                if (elm1 == null) { continue; } // not the right chrome.exe
                var elm2 = TreeWalker.RawViewWalker.GetLastChild(elm1); // I don't know a Condition for this for finding
                var elm3 = elm2.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, ""));
                var elm4 = TreeWalker.RawViewWalker.GetNextSibling(elm3); // I don't know a Condition for this for finding
                var elm5 = elm4.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.ToolBar));
                var elm6 = elm5.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, ""));
                elmUrlBar = elm6.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.Edit));
            }
            catch
            {
                // Chrome has probably changed something, and above walking needs to be modified. :(
                // put an assertion here or something to make sure you don't miss it
                continue;
            }

            // make sure it's valid
            if (elmUrlBar == null)
            {
                // it's not..
                continue;
            }

            // elmUrlBar is now the URL bar element. we have to make sure that it's out of keyboard focus if we want to get a valid URL
            if ((bool)elmUrlBar.GetCurrentPropertyValue(AutomationElement.HasKeyboardFocusProperty))
            {
                continue;
            }

            // there might not be a valid pattern to use, so we have to make sure we have one
            AutomationPattern[] patterns = elmUrlBar.GetSupportedPatterns();
            if (patterns.Length == 1)
            {
                try
                {
                    ret = ((ValuePattern)elmUrlBar.GetCurrentPattern(patterns[0])).Current.Value;
                    return ret;
                }
                catch { }
                if (ret != "")
                {
                    // must match a domain name (and possibly "https://" in front)
                    if (Regex.IsMatch(ret, @"^(https:\/\/)?[a-zA-Z0-9\-\.]+(\.[a-zA-Z]{2,4}).*$"))
                    {
                        // prepend http:// to the url, because Chrome hides it if it's not SSL
                        if (!ret.StartsWith("http"))
                        {
                            ret = "http://" + ret;
                        }
                        return ret;
                    }
                }
                continue;
            }
        }
        return ret;
    }

给你输出:

genes