Question

我有一个txt文件，如下所示，包括4行作为示例，每个行字符串用,分隔。

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50

该文件可以从这里下载

我想提取所有唯一的单元格，即output2

   India1
   India2
   myIndia
   Where
   Here
   India
   uyete
   AFD 
   TTT

我尝试逐行阅读并打印它“我将数据称为df`

myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
   print lines

Answer 1

选项1：.csv，.txt文件

Native Python无法读取.xls个文件。如果您将文件转换为.csv或.txt，则可以使用标准库中的csv模块：

# `csv` module, Standard Library
import csv

filepath = "./test.csv"

with open(filepath, "r") as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)                                  # skip 'A', 'B'
    items = set()
    for line in reader:
        line = [word.replace(" ", "") for word in line if word]
        line = filter(str.strip, line)
        items.update(line)

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

选项2：.xls，.xlsx文件

如果您想保留原始.xls格式，则必须安装third-party module到handle Excel files。

从命令提示符安装xlrd：

pip install xlrd

在Python中：

# `xlrd` module, third-party
import itertools
import xlrd

filepath = "./test.xls"

with xlrd.open_workbook(filepath) as workbook:
    worksheet = workbook.sheet_by_index(0)                 # assumes first sheet
    rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
    cells = itertools.chain.from_iterable(rows)
    items = list({val.replace(" ", "") for val in cells if val})

print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']

选项3：DataFrames

您可以使用pandas DataFrames处理csv和文本文件。其他格式See documentation。

import pandas as pd
import numpy as np

# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"

# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df

DataFrame输出

0     India1
1     India2
2    myIndia
3      Where
4       Here
5      India
6      uyete
7        AFD
8        TTT
dtype: object

另存为文件

# Save as .txt or .csv without index, optional

# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)

注意：选项1＆amp;的结果2也可以使用pd.Series(list(items))转换为无序的pandas柱状对象。

最后：作为剧本

将上述三个选项中的任何一个保存在文件（名为stack）中的函数（restack.py）中。将此脚本保存到目录。

# restack.py
import pandas as pd
import numpy as np

def stack(filepath, save=False, target="./output.txt"):
    # Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8

    # Determines columns from the first line, so add commas in text file, else may throw an error
    df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
    df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True)      # remove special chars    
    stack = df.stack()
    clean_df = pd.Series(stack.unique())

    if save:
        clean_df.to_csv(target, index=False)
        print("Your results have been saved to '{}'".format(target))

    return clean_df

if __name__ == "__main__":
    # Set up input prompts
    msg1 = "Enter path to input file e.g. ./test.txt: "
    msg2 = "Save results to a file? y/[n]: "

    try:
        # Python 2
        fp = raw_input(msg1)
        result = raw_input(msg2)
    except NameError:
        # Python 3
        fp = input(msg1)
        result = input(msg2)

    if result.startswith("y"):
        save = True
    else:
        save = False

    print(stack(fp, save=save))

从其工作目录，通过命令行运行脚本。回答提示：

> python restack.py 

Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'

您的结果应该在您的控制台中打印，并可选择保存到文件output.txt。根据您的兴趣调整任何参数。

Answer 2

如果您的stack.txt文件看起来像这样（即它已保存为.txt文件）：

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

解决方案：

from collections import OrderedDict

with open("stack.txt", "r") as f:
    # read your data in and strip off any new-line characters
    data = [eval(line).strip() for line in f.readlines()]
    # get individual words into a list
    individual_elements = [word for row in data for word in row.split(",")]
    # remove duplicates and preserve order
    uniques = OrderedDict.fromkeys(individual_elements)   
    # convert from OrderedDict object to plain list
    final = [word for word in uniques]

获得所需的柱状输出：

print("\n".join(final))

哪个收益率：

India1
India2
myIndia     
Where
Here   
India
uyete
AFD
TTT

Answer 3

我不会给你完整的代码，但我会给你一些想法。

首先，您需要阅读文件的所有行：

lines = open("file.txt").readlines()

然后，从每一行中提取数据：

lines = [line.split(",") for line in lines]

您可以使用itertools.combinations生成组合。对于每一行，打印行元素的组合。

如果您不关心元素的顺序，则可以使用set获取唯一元素。在使用set之前，您应首先展开列表lines，可能使用itertools.chain.from_iterable。

Answer 4

您逐行阅读文本文件的代码很好。所以你仍然需要

将每一行拆分为“单元格”
删除重复项

您可以使用TextField tf = new TextField("My text field"); tf.setRequired(true); tf.setReadOnly(true);

split

你想删除空格，所以我会line.split(',')每个单元格：

strip

您可以使用[value.strip() for elem in line.split(',')]

set

最后，我认为在阅读文件时最好使用set(cells)（上下文管理器）。把它们放在一起：

with

如果你想要更紧凑，你可以在一个列表理解中完成：

with open('df.txt', 'r') as f:
    cells = []
    for line in f:
        cells += [value.strip() for value in line.split(',')]

cells = list(set(cells))

如何制作唯一的列表单元格？

4 个答案: