Question

我有一个包含两个数据集的文件，我想将其作为两列读入Python。

数据格式为：

xxx yyy    xxx yyy   xxx yyy

依此类推，所以我明白我需要以某种方式将其拆分。我是Python的新手（也是编程方面的新手），所以到目前为止我还有点挣扎。目前我试图使用：

def read(file):

    column1=[]
    column2=[]
    readfile = open(file, 'r')
    a = (readfile.read())
    readfile.close()

我如何将读入的文件拆分为column1和column2？

Answer 1

使用Python模块Pandas非常简单。假设您有一个这样的数据文件：

>cat data.txt
xxx  yyy  xxx  yyy  xxx yyy
xxx yyy    xxx yyy   xxx yyy
xxx yyy  xxx yyy   xxx yyy
xxx yyy    xxx yyy  xxx yyy
xxx yyy    xxx  yyy   xxx yyy

>from pandas import DataFrame
>from pandas import read_csv
>from pandas import concat
>dfin = read_csv("data.txt", header=None, prefix='X', delimiter=r"\s+")
> dfin
X0   X1   X2   X3   X4   X5
0  xxx  yyy  xxx  yyy  xxx  yyy
1  xxx  yyy  xxx  yyy  xxx  yyy
2  xxx  yyy  xxx  yyy  xxx  yyy
3  xxx  yyy  xxx  yyy  xxx  yyy
4  xxx  yyy  xxx  yyy  xxx  yyy
>dfout = DataFrame()
>dfout['X0'] = concat([dfin['X0'], dfin['X2'], dfin['X4']], axis=0, ignore_index=True)
>dfout['X1'] = concat([dfin['X1'], dfin['X3'], dfin['X5']], axis=0, ignore_index=True)
> dfout
 X0   X1
 0   xxx  yyy
 1   xxx  yyy
 2   xxx  yyy
 3   xxx  yyy
 4   xxx  yyy
 5   xxx  yyy
 6   xxx  yyy
 7   xxx  yyy
 8   xxx  yyy
 9   xxx  yyy
 10  xxx  yyy
 11  xxx  yyy
 12  xxx  yyy
 13  xxx  yyy
 14  xxx  yyy

希望它有所帮助。最好的。

Answer 2

这是一个关于在column1中获取xxx值和在column2中获取yyy值的简单示例。

重要！您的文件数据必须类似于：

xxx yyy xxx yyy xxx yyy
组之间有4个空格（xxx yyy xxx yyy），每对数据之间有1个空格（xxx yyy）

您可以使用例如另一个分隔符逻辑：

XXX，YYY / XXX，YYY / XXX，YYY
您只需要更改data_separator=','和column_separator='/'

或

XXX-YYY / XXX-YYY / XXX-YYY
您只需要更改data_separator='-'和column_separator='/'

def read(file):
    column1=[]
    column2= []
    readfile = open(file, 'r')
    data_separator = ' '  # one space to separate xxx and yyy
    column_separator = '    '  # 4 spaces to separate groups xxx,yyy    xxx,yyy

    for line in readfile.readlines():  # In case you have more than 1 line
         line = line.rstrip('\n')  # Remove EOF from line
         print line

         columns = line.split(column_separator)  # Get the data groups 
         # columns now is an array like ['xxx yyy', 'xxx yyy', 'xxx yyy']

         for column in columns:
             if not column: continue  # If column is empty, ignore it
             column1.append(column.split(data_separator)[0])
             column2.append(column.split(data_separator)[1])
    readfile.close()

在调用函数后，我有一个包含xxx yyy aaa bbb ttt hhh的文本文件，结果是：

column1 = ['xxx', 'aaa', 'ttt']
column2 = ['yyy', 'bbb', 'hhh']

Answer 3

在您的示例中，数据集的第二个分隔是3个空格... 所以我认为数据集是用至少两个空格分开的......

#reading a file seems not to be your problem ;)
#works also with more than 3/4/n spaces...
data = 'xxx yyy    xxx yyy             xxx yyy'

#reduce more than two spaces
while '   ' in data:
    data = data.replace('   ', '  ')

#split data-sets who are now separated trough two spaces
data = data.split('  ')

#split into cols for each data-set
data = [x.split(' ') for x in data]

#reshape for better (requested?) access
column1, column2 = zip(*data)

print column1
print column2

输出是：

('xxx', 'xxx', 'xxx')
('yyy', 'yyy', 'yyy')

希望它可以帮助你：）

如何拆分成列

3 个答案: