我有一个txt文件,如下所示,包括4行作为示例,每个行字符串用,
分隔。
"India1,India2,myIndia "
"Where,Here,Here "
"Here,Where,India,uyete"
"AFD,TTT"
https://gist.github.com/anonymous/cee79db7029a7d4e46cc4a7e92c59c50
该文件可以从这里下载
我想提取所有唯一的单元格,即output2
India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT
我尝试逐行阅读并打印它“我将数据称为df`
myfile = open("df.txt")
lines = myfile.readlines()
for line in lines:
print lines
答案 0 :(得分:1)
选项1:.csv
,.txt
文件
Native Python无法读取.xls
个文件。如果您将文件转换为.csv
或.txt
,则可以使用标准库中的csv
模块:
# `csv` module, Standard Library
import csv
filepath = "./test.csv"
with open(filepath, "r") as f:
reader = csv.reader(f, delimiter=',')
header = next(reader) # skip 'A', 'B'
items = set()
for line in reader:
line = [word.replace(" ", "") for word in line if word]
line = filter(str.strip, line)
items.update(line)
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
选项2:.xls
,.xlsx
文件
如果您想保留原始.xls
格式,则必须安装third-party module到handle Excel files。
从命令提示符安装xlrd
:
pip install xlrd
在Python中:
# `xlrd` module, third-party
import itertools
import xlrd
filepath = "./test.xls"
with xlrd.open_workbook(filepath) as workbook:
worksheet = workbook.sheet_by_index(0) # assumes first sheet
rows = (worksheet.row_values(i) for i in range(1, worksheet.nrows))
cells = itertools.chain.from_iterable(rows)
items = list({val.replace(" ", "") for val in cells if val})
print(list(items))
# ['uyete', 'NHYG', 'QHD', 'SGDH', 'AFD', 'DNGS', 'lkd', 'TTT']
选项3:DataFrames
您可以使用pandas DataFrames处理csv和文本文件。其他格式See documentation。
import pandas as pd
import numpy as np
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
filepath = "./test2.txt"
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
clean_df
DataFrame输出
0 India1
1 India2
2 myIndia
3 Where
4 Here
5 India
6 uyete
7 AFD
8 TTT
dtype: object
另存为文件
# Save as .txt or .csv without index, optional
# target = "./output.csv"
target = "./output.txt"
clean_df.to_csv(target, index=False)
注意:选项1&的结果2也可以使用pd.Series(list(items))
转换为无序的pandas柱状对象。
最后:作为剧本
将上述三个选项中的任何一个保存在文件(名为stack
)中的函数(restack.py
)中。将此脚本保存到目录。
# restack.py
import pandas as pd
import numpy as np
def stack(filepath, save=False, target="./output.txt"):
# Using data from gist.github.com/anonymous/a822647a00087abc12de3053c700b9a8
# Determines columns from the first line, so add commas in text file, else may throw an error
df = pd.read_csv(filepath, sep=",", header=None, error_bad_lines=False)
df = df.replace(r"[^A-Za-z0-9]+", np.nan, regex=True) # remove special chars
stack = df.stack()
clean_df = pd.Series(stack.unique())
if save:
clean_df.to_csv(target, index=False)
print("Your results have been saved to '{}'".format(target))
return clean_df
if __name__ == "__main__":
# Set up input prompts
msg1 = "Enter path to input file e.g. ./test.txt: "
msg2 = "Save results to a file? y/[n]: "
try:
# Python 2
fp = raw_input(msg1)
result = raw_input(msg2)
except NameError:
# Python 3
fp = input(msg1)
result = input(msg2)
if result.startswith("y"):
save = True
else:
save = False
print(stack(fp, save=save))
从其工作目录,通过命令行运行脚本。回答提示:
> python restack.py
Enter path to input file e.g. ./test.txt: ./@data/test2.txt
Save results to a file? y/[n]: y
Your results have been saved to './output.txt'
您的结果应该在您的控制台中打印,并可选择保存到文件output.txt
。根据您的兴趣调整任何参数。
答案 1 :(得分:1)
如果您的stack.txt
文件看起来像这样(即它已保存为.txt
文件):
"India1,India2,myIndia "
"Where,Here,Here "
"Here,Where,India,uyete"
"AFD,TTT"
解决方案:
from collections import OrderedDict
with open("stack.txt", "r") as f:
# read your data in and strip off any new-line characters
data = [eval(line).strip() for line in f.readlines()]
# get individual words into a list
individual_elements = [word for row in data for word in row.split(",")]
# remove duplicates and preserve order
uniques = OrderedDict.fromkeys(individual_elements)
# convert from OrderedDict object to plain list
final = [word for word in uniques]
获得所需的柱状输出:
print("\n".join(final))
哪个收益率:
India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT
答案 2 :(得分:0)
我不会给你完整的代码,但我会给你一些想法。
首先,您需要阅读文件的所有行:
lines = open("file.txt").readlines()
然后,从每一行中提取数据:
lines = [line.split(",") for line in lines]
您可以使用itertools.combinations
生成组合。对于每一行,打印行元素的组合。
如果您不关心元素的顺序,则可以使用set
获取唯一元素。在使用set
之前,您应首先展开列表lines
,可能使用itertools.chain.from_iterable
。
答案 3 :(得分:0)
您逐行阅读文本文件的代码很好。所以你仍然需要
您可以使用TextField tf = new TextField("My text field");
tf.setRequired(true);
tf.setReadOnly(true);
split
你想删除空格,所以我会line.split(',')
每个单元格:
strip
您可以使用[value.strip() for elem in line.split(',')]
set
最后,我认为在阅读文件时最好使用set(cells)
(上下文管理器)。把它们放在一起:
with
如果你想要更紧凑,你可以在一个列表理解中完成:
with open('df.txt', 'r') as f:
cells = []
for line in f:
cells += [value.strip() for value in line.split(',')]
cells = list(set(cells))