Question

我有一个很大的tsv文件，包含许多我需要用python解析的列。我需要获取原始文件，删除不必要的列或数据，并使用数据创建一个新文件。一个问题，其中一个列的数据以逗号分隔。

文件数据如下所示：

Mark    Adams    8429    Main    TX    beef,broccoli,carrot,chicken
John    Baker    1241    Wells    TX    tortilla,grapes,corn,steak
Joe    Hills    1235    Wilcox    TX    mushrooms,bacon,chicken,butter,yogurt,eggs
White    Fang    9999    Wolf    TX    salt,pepper,lettuce,lamb
Zach    Trott    0421    Spirit    TX    peas,milk,pork,cups,chicken

example.tsv

example_output.tsv

我需要新文件包含第0,1,2,3列，并且只从第5列中获取特定值，这些值存储在下面的列表中。我需要它只返回列出的任何肉类，并忽略第5列中的其余项目。

list = ['beef', 'pork', 'chicken', 'steak', 'bacon', 'lamb']

这甚至可能吗？我正在使用的当前代码...

keys = [True, True, True, True, True, []]
# Columns matching True are included, False excludes them.
# A nested list causes the tab-separated column to be split at commas and filtered equally.

from itertools import compress

with open("test.txt") as in_file, open("file2.tsv") as out_file:
for line in in_file:
    output = []
    columns = line.split("\t")
    for c, k in zip(columns, keys):
         if isinstance(k, list):
             output.append(",".join(compress(c.split(","), k)))
         elif k:
             output.append(c)
    print(*output, sep="\t", file=out_file)

到目前为止的错误： 1.）它只在第5列中选择尽可能多的项目，就像我写的True语句一样。但每行将有不同数量的项目。所以我需要它在每行的第5列中迭代每个逗号分隔值，并返回列表中的值。

2。）当我尝试运行它时，我在最后一个print语句中也出现语法错误。

谢谢！

Answer 1

由于您尚未提供任何代码示例，表明您是否已尝试解决此问题，因此我将坚持提供一种可能的（非常基本的）算法，您可以尝试：< / p>

从TSV读取一行

使用＆＃34; TAB＆＃34;作为分隔符 - ＆gt;这会给你一个清单

将您在上一步中获得的列表中所需的列（元素）写入输出文件（在新行中）

对于最后一列，您可以使用逗号作为分隔符进行拆分，并根据需要处理结果列表（提示：也许是   del关键字可能对你有用）

冲洗并重复：返回步骤1，直到处理完整个文件

Answer 2

您可以创建一个键，用于定义要保留的列和子字段以及丢弃和处理文件的内容，如下所示：

keys = [True, True, True, True, False, [True, False, False, True]]
# Columns matching True are included, False excludes them. 
# A nested list causes the tab-separated column to be split at commas and filtered equally.

from itertools import compress

with open("file.tsv") as in_file, open("file2.tsv", "w") as out_file:
    for line in in_file:
        output = []
        columns = line.split("\t")
        for c, k in zip(columns, keys):
             if isinstance(k, list):
                 output.append(",".join(compress(c.split(","), k)))
             elif k:
                 output.append(c)
        print(*output, sep="\t", file=out_file)

将此作为输入文件（以制表符分隔的列）：

first_name1 last_name1  house_number1   street1 state1  meat1,carrots1,soup1,3 eggs1
first_name2 last_name2  house_number2   street2 state2  meat2,carrots2,soup2,3 eggs2
first_name3 last_name3  house_number3   street3 state3  meat3,carrots3,soup3,3 eggs3
first_name4 last_name4  house_number4   street4 state4  meat4,carrots4,soup4,3 eggs4

会在第二个文件中产生此输出：

first_name1 last_name1  house_number1   street1 meat1,3 eggs1
first_name2 last_name2  house_number2   street2 meat2,3 eggs2
first_name3 last_name3  house_number3   street3 meat3,3 eggs3
first_name4 last_name4  house_number4   street4 meat4,3 eggs4

See this code running on ideone.com

更新

当您澄清最后一列的分割方式时，这里有一个更新的代码示例，可以让您设置一个单词列表，以便只有那里列出的单词出现在输出中：

keys = [True, True, True, True, False, ['beef', 'pork', 'chicken', 'steak', 'bacon', 'lamb']]

for line in in_file:
    output = []
    columns = line.split("\t")
    for c, k in zip(columns, keys):
        if isinstance(k, list):
            output.append(",".join(word for word in c.split(",") if word in k))
        elif k:
            output.append(c)
    print(*output, sep="\t", file=out_file)

您的示例输入：

Mark    Adams     8429    Main      TX    beef,broccoli,carrot,chicken
John    Baker     1241    Wells     TX    tortilla,grapes,corn,steak
Joe     Hills     1235    Wilcox    TX    mushrooms,bacon,chicken,butter,yogurt,eggs
White    Fang     9999    Wolf      TX    salt,pepper,lettuce,lamb
Zach     Trott    0421    Spirit    TX    peas,milk,pork,cups,chicken

您的示例的输出：

Mark Adams 8429主要牛肉，鸡肉约翰贝克1241威尔斯牛排 Joe Hills 1235 Wilcox培根，鸡肉白方9999狼羔羊 Zach Trott 0421精神猪肉，鸡肉

See this code running on ideone.com

Python 2的重要说明：

上面的两个脚本是为Python 3编写的。如果您使用的是Python 2，则print()函数不可用。在这种情况下，您必须使用不同的方法来编写输出文件。

只需更换此行：

        print(*output, sep="\t", file=out_file)

有了这个：

        out_file.write("\t".join(output) + "\n")

其余部分应该兼容。

使用python

2 个答案:

更新

Python 2的重要说明：