Question

我有一个很大的边缘列表（约2600万个），前两列为节点，可选列的数量可变：

Node1    Node2    OptionalCol1    OptionalCol2   ...

Gene A    Gene D   --             --
Gene C    Gene F   --             --
Gene D    Gene C   --             --
Gene F    Gene A   --             --

我想要一个文本文件，该文件具有一个非冗余节点列表，这些列表结合了列。输出：

Gene A
Gene D
Gene C
Gene F

我的python代码：

file1 = open("input.txt", "r")
node_id = file1.readlines()
node_list=[]

for i in node_id:
    node_info=i.split()
    node_info[0]=node_info[0].strip()
    node_info[1]=node_info[1].strip()
    if node_info[0] not in node_list:
        node_list.append(node_info[0])
    if node_info[1] not in node_list:
        node_list.append(node_info[1])

print node_list

是否可以使用awk做到这一点？谢谢

Answer 1

假定分隔符是一个制表符（from matplotlib.pyplot import * from numpy import * dt = 1/1000 T = 1 t = arange(0, T, dt) n = t.size y = sin(pi * t * 3) + 39 + 3 * t + random.rand(n) from scipy import optimize # subtract drift lin = lambda x, a, b : a * x + b coeff, _ = optimize.curve_fit(lin,t, y) dmy= y- coeff[0] * t + coeff[0] # compute power fy = abs(fft.fft(y))[:n//2] ** 2 fyn= abs(fft.fft(dmy - dmy.mean()))[:n//2] ** 2 # NB demeaned freq= linspace(0, T / dt, n//2) # get freqs fig, ax = subplots(2, sharex = 'all') for axi, data, label in zip(ax, [fy,fyn], 'raw processed'.split()): axi.plot(freq, data) axi.set(xlim = (0, 10), title = label) axi.set_xlabel('freq') subplots_adjust(hspace = .5)）。如果是一堆空间（一堆不止一个）而不是\t，请使用：-F"\t"：

-F"  +"

输出的顺序不是特定的，而是可以的。解释：

$ awk -F"\t" 'NR>2{a[$1];a[$2]}END{for(i in a)print i}' file
Gene A
Gene C
Gene D
Gene F

Answer 2

您可以将awk与唯一排序组合在一起：

$ awk '/Gene/ {print $1, $2; print $3, $4}' file | sort -u
Gene A
Gene C
Gene D
Gene F

或者如果您的列用制表符分隔：

$ awk -F'\t' '/Gene/ {print $1; print $2}' file | sort -u
Gene A
Gene C
Gene D
Gene F

Answer 3

如果文件由制表符分隔，则可以使用此功能，但是可以将sep参数更改为分隔符。

import pandas as pd
import numpy as np

df = pd.read_csv('input.txt', sep='\t', usecols=['Node1', 'Node2'])
node_list = np.concatenate((df['Node1'].unique(), df['Node2'].unique()))

在处理关系数据（如文件外观）时，pandas是非常有用且快速的工具。

Answer 4

像这样在python中使用file1=open("input.txt",'r') lines = file1.read().split('\n') all_nodes_as_string=' '.join(lines) #you can use '\t' here if that's what sepparates the nodes on each line all_nodes_with_dupes = all_nodes_as_string.split(' ') all_unique_nodes = set(all_nodes_with_dupes)：

{{1}}

边列表中节点的唯一列表

4 个答案: