Question

我有一个像这样的文本文件。


iframeDoc.addEventListener('visibilitychange', visibilityChanged,false);

我想制作一个矩阵形式并将每个项目放入矩阵中。例如，矩阵的第一行将是：

[[1，A公司，曼哈顿，25,000]，[''，''，SK Ventures，25,000]，[''，''，AEA投资者，10,000]]

，或者

[[1，''，'']，[公司A，''，'']，[曼哈顿，SK Ventures，AEA投资者]，[25,000,25,000,10,000]]

为此，我想从文本文件的每一行解析文本。例如，从第一行开始，我可以创建[1，A公司，曼哈顿，25,000]。但是，我无法弄清楚到底是怎么做到的。每个文本都从相同的位置开始，但在不同的位置结束。有什么好办法吗？

谢谢。

Answer 1

根据您提供的数据*，如果行以数字或空格开头，则输入会更改，并且数据可以分隔为

（数字）（空格）（带1个空格的字母）（空格）（带1个空格的字母）（空格）（数字+逗号）

或

（空格）（带1个空格的字母）（空格）（数字+逗号）

这就是下面的两个正则表达式所代表的内容，他们使用来自前导数字的索引构建一个字典，每个字典都有一个公司名称以及公司和价值对列表。

我无法确定你的矩阵排列是什么。

import re

data = {}
f = open('data.txt')
for line in f:
    if re.match('^\d', line):
        matches = re.findall('^(\d+)\s+((\S\s|\s\S|\S)+)\s\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        idx, firm, x, company, y, value = matches[0]
        data[idx] = {}
        data[idx]['firm'] = firm.strip()
        data[idx]['company'] = [(company.strip(), value)]
    else:
        matches = re.findall('\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        company, x, value = matches[0]
        data[idx]['company'].append((company.strip(), value))

import pprint
pprint.pprint(data)

- ＆GT;

{'1': {'company': [('Manhattan (company name)', '25,000'),
                   ('SK Ventures', '25,000'),
                   ('AEA investors', '10,000')],
       'firm': 'firm A'},

 '2': {'company': [('Tencent collaboration', '16,000'),
                   ('id TechVentures', '4,000')],
       'firm': 'firm B'},

 '3': {'company': [('xxx', '625')], 
       'firm': 'firm C'}
}

*这适用于您的示例，但它可能无法很好地处理您的实际数据。 YMMV。

Answer 2

好吧，如果你知道所有的起始位置：

# 0123456789012345678901234567890123456789012345678901234567890
# 1       firm A         Manhattan (company name)     25,000 
#                        SK Ventures                  25,000
#                        AEA investors                10,000 
# 2       firm B         Tencent collaboration        16,000 
#                        id TechVentures              4,000 
# 3       firm C         xxx                          625 
# Field #1 is 8 wide (0 -> 7)
# Field #2 is 15 wide (8 -> 22)
# Field #3 is 19 wide (23 -> 41) 
# Field #4 is arbitrarily wide (42 -> end of line)
field_lengths = [ 8, 15, 19, ]
data = []
with open('/path/to/file', 'r') as f:
    row = f.readline()
    row = row.strip()
    pieces = []
    for x in field_lengths:
        piece = row[:x].strip()
        pieces.append(piece)
        row = row[x:]
    pieces.append(row)
    data.append(pieces)

Answer 3

如果我理解正确（虽然我不完全确定），这会产生我认为你想要的输出。

import re

with open('data.txt', 'r') as f:
    f_txt = f.read() # Change file object to text
    f_lines = re.split(r'\n(?=\d)', f_txt)
    matrix = []
    for line in f_lines:
        inner1 = line.split('\n')
        inner2 = [re.split(r'\s{2,}', l) for l in inner1]
        matrix.append(inner2)

print(matrix)
print('')
for row in matrix:
    print(row)

计划的输出：

[[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']], [['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']], [['3', 'firm C', 'xxx', '625']]]

[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']]
[['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']]
[['3', 'firm C', 'xxx', '625']]

我的理由是你希望矩阵的第一行是： [[1,Firm A,Manhattan,25,000],['',SK Ventures,25,000],['',AEA investors,10,000]]

但是，要实现更多行，我们会得到一个嵌套3级深度的列表。这是print(matrix)的输出。这可能有点难以使用，这就是为什么TessellatingHeckler的答案使用字典来存储数据，我认为这是一种更好的方式来访问你需要的东西。但是，如果列出的“矩阵”列表是你的后续，那么我上面写的代码就是这样。

Python：解析.txt文件中的文本

3 个答案: