我有一系列输入文件,例如:
chr1 hg19_refFlat exon 44160380 44160565 0.000000 + . gene_id "KDM4A"; transcript_id "KDM4A";
chr1 hg19_refFlat exon 19563636 19563732 0.000000 - . gene_id "EMC1"; transcript_id "EMC1";
chr1 hg19_refFlat exon 52870219 52870551 0.000000 + . gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1 hg19_refFlat exon 53373540 53373626 0.000000 - . gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1 hg19_refFlat exon 11839859 11840067 0.000000 + . gene_id "C1orf167"; transcript_id "C1orf167";
chr1 hg19_refFlat exon 29037032 29037154 0.000000 + . gene_id "GMEB1"; transcript_id "GMEB1";
chr1 hg19_refFlat exon 103356007 103356060 0.000000 - . gene_id "COL11A1"; transcript_id "COL11A1";
在我的代码中我试图从每一行捕获2个元素,第一个是它表示外显子之后的数字,第二个是基因(由“”包围的数字和字母组合,例如“KDM4A”。这里是我的代码:
with open(infile,'r') as r:
start = set([line.strip().split()[3] for line in r])
genes = set([line.split('"')[1] for line in r])
print len(start)
print len(genes)
由于某些原因,开始工作正常,但基因没有捕获任何东西。这是输出:
48050
0
我认为这与基因名称周围的“”有关,但是如果我在终端上输入它,它可以正常工作:
>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>>
任何解决方案都会受到高度赞赏吗?即使它是一种完全不同的方式从每一行捕获2项数据。感谢
答案 0 :(得分:8)
这是因为当您在此处循环一次start = set([line.strip().split()[3] for line in r])
时,您的文件对象已经用完了,您试图在此处循环genes = set([line.split('"')[1] for line in r])
在耗尽的文件对象上
<强>解决方案:强>
你可以寻找文件的开头(这是解决方案之一)
修改代码:
with open(infile,'r') as r:
start = set([line.strip().split()[3] for line in r])
r.seek(0, 0)
genes = set([line.split('"')[1] for line in r])
print len(start)
print len(genes)
答案 1 :(得分:4)
您可以使用正则表达式。
with open(file) as f:
start = []
genes = []
for line in f:
st, gen = re.search(r'\bexon\s+(\d+)\b.*?\s+gene_id\s+"([^"]*)"', line).groups()
start.append(st)
genes.append(gen)
print set(start)
print set(genes)
答案 2 :(得分:2)
您可以将所有行加载到列表中,然后对该列表中的每个项目执行split
(不确定文件长度的效率)
with open(infile) as r:
lines = [line for line in r]
start = set([line.strip().split()[3] for line in lines])
genes = set([line.split('"')[1] for line in lines])
答案 3 :(得分:2)
使用shlex(就像它的shell参数一样),中和多个空格和引号 不确定它是否更快,但更安全,更好看
import shlex
with open(infile, 'r') as f:
for line in f:
parts = shlex.split(line.replace(';', ''))
print parts[3], parts[9]
答案 4 :(得分:2)
无法加载private async void countdown()
{
listBox1.Items.Clear();
listBox1.Items.Add("3");
await Task.Delay(1000);
listBox1.Items.Add("2");
await Task.Delay(1000);
listBox1.Items.Add("1");
await Task.Delay(1000);
listBox1.Items.Clear();
}
的原因是您需要从头开始重新读取文件。以下方法应该有效:
public string GetChormeURL(string ProcessName)
{
string ret = "";
Process[] procs = Process.GetProcessesByName(ProcessName);
foreach (Process proc in procs)
{
// the chrome process must have a window
if (proc.MainWindowHandle == IntPtr.Zero)
{
continue;
}
//AutomationElement elm = AutomationElement.RootElement.FindFirst(TreeScope.Children,
// new PropertyCondition(AutomationElement.ClassNameProperty, "Chrome_WidgetWin_1"));
// find the automation element
AutomationElement elm = AutomationElement.FromHandle(proc.MainWindowHandle);
// manually walk through the tree, searching using TreeScope.Descendants is too slow (even if it's more reliable)
AutomationElement elmUrlBar = null;
try
{
// walking path found using inspect.exe (Windows SDK) for Chrome 43.0.2357.81 m (currently the latest stable)
// Inspect.exe path - C://Program files (X86)/Windows Kits/10/bin/x64
var elm1 = elm.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, "Google Chrome"));
if (elm1 == null) { continue; } // not the right chrome.exe
var elm2 = TreeWalker.RawViewWalker.GetLastChild(elm1); // I don't know a Condition for this for finding
var elm3 = elm2.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, ""));
var elm4 = TreeWalker.RawViewWalker.GetNextSibling(elm3); // I don't know a Condition for this for finding
var elm5 = elm4.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.ToolBar));
var elm6 = elm5.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.NameProperty, ""));
elmUrlBar = elm6.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.Edit));
}
catch
{
// Chrome has probably changed something, and above walking needs to be modified. :(
// put an assertion here or something to make sure you don't miss it
continue;
}
// make sure it's valid
if (elmUrlBar == null)
{
// it's not..
continue;
}
// elmUrlBar is now the URL bar element. we have to make sure that it's out of keyboard focus if we want to get a valid URL
if ((bool)elmUrlBar.GetCurrentPropertyValue(AutomationElement.HasKeyboardFocusProperty))
{
continue;
}
// there might not be a valid pattern to use, so we have to make sure we have one
AutomationPattern[] patterns = elmUrlBar.GetSupportedPatterns();
if (patterns.Length == 1)
{
try
{
ret = ((ValuePattern)elmUrlBar.GetCurrentPattern(patterns[0])).Current.Value;
return ret;
}
catch { }
if (ret != "")
{
// must match a domain name (and possibly "https://" in front)
if (Regex.IsMatch(ret, @"^(https:\/\/)?[a-zA-Z0-9\-\.]+(\.[a-zA-Z]{2,4}).*$"))
{
// prepend http:// to the url, because Chrome hides it if it's not SSL
if (!ret.StartsWith("http"))
{
ret = "http://" + ret;
}
return ret;
}
}
continue;
}
}
return ret;
}
给你输出:
genes