Question

我有两个文本文件。

第一个是空格分隔列表：

23 dog 4
24 cat 5
28 cow 7

第二个是'|' - 分隔列表：

?dog|parallel|numbering|position
Dogsarebarking
?cat|parallel|nuucers|position
CatisBeautiful

我想得到如下输出文件：

?dog|paralle|numbering|position|23
?cat|parallel|nuucers|position|24

这是一个'|'分隔的列表，其中包含第二个文件的值，该文件附加了第一个文件的第一列中的值，其中两个文件的第二列中的值匹配。

Answer 1

使用csv读取第一个文件，使用字典存储file1行。第二个文件采用FASTA格式，因此我们只选择以?开头的行：

import csv

with open('file1', 'rb') as file1:
    file1_data = dict(line.split(None, 2)[1::-1] for line in file1 if line.strip())

with open('file2', 'rb') as file2, open('output', 'wb') as outputfile:
    output = csv.writer(outputfile, delimiter='|')
    for line in file2:
        if line[:1] == '?':
            row = line.strip().split('|')
            key = row[0][1:]
            if key in file1_data:
                 output.writerow(row + [file1_data[key]])

这会产生：

?dog|parallel|numbering|position|23
?cat|parallel|nuucers|position|24

输入示例。

Answer 2

这是pandas库擅长的任务：

import pandas as pd
df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
df2 = pd.read_csv("c2.txt", sep=" ", header=None)
merged = df1.merge(df2, on=1).ix[:,:-1]
merged.to_csv("merged.csv", sep="|", header=None, index=None)

以下是一些解释。首先，我们在文件中读入名为DataFrames的对象：

>>> df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
>>> df1
               0      1          2         3
0      ?parallel    dog  numbering  position
3      ?parallel    cat    nuucers  position
6  ?non parallel  honey  numbering  position
>>> df2 = pd.read_csv("c2.txt", sep=" ", header=None)
>>> df2
    0    1  2
0  23  dog  4
1  24  cat  5
2  28  cow  7

.dropna()会跳过没有任何数据的情况。或者，df1 = df1[df1[0].str.startswith("?")]可能是另一种方式。

然后我们将它们合并到第一列：

>>> df1.merge(df2, on=1)
         0_x    1        2_x         3  0_y  2_y
0  ?parallel  dog  numbering  position   23    4
1  ?parallel  cat    nuucers  position   24    5

我们不需要最后一列，所以我们将其分割：

>>> df1.merge(df2, on=1).ix[:,:-1]
         0_x    1        2_x         3  0_y
0  ?parallel  dog  numbering  position   23
1  ?parallel  cat    nuucers  position   24

然后我们使用to_csv将其写出来，生成：

>>> !cat merged.csv
?parallel|dog|numbering|position|23
?parallel|cat|nuucers|position|24

现在，对于许多简单的任务，pandas可能过度，并且学习如何使用csv模块等低级工具也很重要。 OTOH，当你想要做的事情现在（tm）时，它非常非常方便。

Answer 3

这似乎与关系数据库中的JOIN完全相同。

内部联接是应用程序中使用的最常见的联接操作，可以视为默认的联接类型。内连接通过基于连接谓词组合两个表（A和B）的列值来创建新的结果表。查询将A的每一行与B的每一行进行比较，以找到满足连接谓词的所有行对。当满足连接谓词时，A和B的每对匹配行的列值将合并到结果行中。

看一下这个例子：

import sqlite3
conn = sqlite3.connect('example.db')

# get hands on the database
c = conn.cursor()

# create and populate table1
c.execute("DROP TABLE table1")
c.execute("CREATE TABLE table1 (col1 text, col2 text, col3 text)")
with open("file1") as f:
    for line in f:
        c.execute("INSERT INTO table1 VALUES (?, ?, ?)", line.strip().split())

# create table2
c.execute("DROP TABLE table2")
c.execute("CREATE TABLE table2 (col1 text, col2 text, col3 text, col4 text)")
with open("file2") as f:
    for line in f:
        c.execute("INSERT INTO table2 VALUES (?, ?, ?, ?)", 
            line.strip().split('|'))

# make changes persistent
conn.commit()

# retrieve desired data and write it to file
with open("file3", "w+") as f:
    for x in c.execute(
        """
        SELECT table2.col1
             , table2.col2
             , table2.col3
             , table2.col4
             , table1.col1 
        FROM table1 JOIN table2 ON table1.col2 = table2.col2
        """):
        f.write("%s\n" % "|".join(x))

# close connection
conn.close()

输出文件如下所示：

paralle|dog|numbering|position|23
parallel|cat|nuucers|position|24

匹配不同的列并使用python将它们组合在一起

3 个答案: