Question

我有以下行的旧文本文件

float num = 1/2.0;
String temp = Float.toString(num);

第一列是名称，第二列是描述。在此示例中，描述列从每行的列23开始。有很多这样的文本文件，每个文件都有不同的列号，描述从该列号开始。没有办法（通过编程）区分名称和描述。对于特定的文本文件，我想找到描述开始的列号。这样一来，我就可以在文件中插入有关人的详细信息并保持格式。有没有一种方法可以找到每个文本文件的此列号？或通过其他任何方式添加新的名称描述条目以保持格式。

编辑：根据建议的答案，我实现了以下代码，将新条目添加到现有文本文件中

John Deer              Works in College
Alex H Johnson         Hobby is painting
David Martin Smith     Runs everyday to keep fit

此处with open (filename, 'r') as fr: descPos = [] for line in fr: pos = line.rfind(' ') #4 spaces if pos != -1: pos += 4 descPos.append(pos) descColumn = max(descPos, key = descPos.count) #The mode of descPos values will be the column position where description starts spacesBetweenNameAndDesc = descColumn - len(name) newEntry = name + ' '*spacesBetweenNameAndDesc + desc with open(file, 'w') as fw: fw.write(newEntry)和"name"是要附加的新名称和描述。这是在保持格式的同时添加新条目的最佳方法吗？

Answer 1

尝试以最佳方式回答问题。不确定为什么需要第二列的索引，但是假设您需要它，那么下面的代码将提供获取索引以及将字符串作为字符串list的方式

import re

temp="""
John Deer              Works in College
Alex H Johnson         Hobby is painting
David Martin Smith     Runs everyday to keep fit"""

for line in temp.split("\n"):
    m =re.finditer('\s{2,}\w',line)
    for i in m:
        print(i.end()-1) # gives you the columns index ignoring the first column

    lis = re.sub("\s{2,}",'\t',line).split("\t")
    if lis !=['']:
        print(lis)

出于完整性考虑，您还可以使用pandas和StringIO一起格式化数据。下面是创建数据框的示例

import sys
import re
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd
temp="""
John Deer              Works in College
Alex H Johnson         Hobby is painting
David Martin Smith     Runs everyday to keep fit"""

TESTDATA = StringIO(re.sub('\s{2,}','\t',temp))

df = pd.read_csv(TESTDATA, sep="\t",names=['Names','Description'])

Answer 2

您似乎试图获取第一个 word 字符在至少两个空格字符之后的行中的位置。

with open(filename) as fd:
    rx = re.compile(r'(?<=\s\s)\w+')
    # search if first 5 lines
    ix = max((rx.search(line).start() for line in itertools.islice(fd, 5)))

Answer 3

另一种可行的方法是：（a）输入文件中的至少一行在列之间至少有两个空格，（b）列文本不超过单个空格，并且（c）相同的文件列对齐：

def get_description_position(filename):
  with open(filename) as f:
    for line in f:
      pos = line.rfind('  ')+2          #-1 if not found
      if pos > 1: return pos            #return as soon as a row matches
  raise Exception('Could not find description column')

################################################################################

filename = '56259699.txt'               #whatever your input filename

################################################################################

try: col = get_description_position(filename)
except Exception as msg: print(msg)
else:
  with open(filename) as f:
    for line in f:
      name, desc = line[:col].strip(),line[col:].strip()
      print(f'{name:20s} {desc}')

Answer 4

您的措辞有点含糊，您没有任何示例编码，因此这可能是黑暗中的一枪。

无论如何，通过使用read_csv，read_excel或其他任何方法将文件转换为熊猫数据框，您可以在pandas中真正轻松地做到这一点。

据我了解，您希望从较大的数据集中取出两列作为新的数据框。

这就是我要做的：

df = pd.read_excel('your_file_here.xlsx')
name_description_df = df[['Name', 'Description']]

这能回答您的问题吗？另外，您到目前为止尝试了什么？

Answer 5

您可以将每行分成2个字符串，然后在第二个字符串中搜索第一个字符的索引。

例如：

x = "John Deer Works in College"

使用str.split method：

y = x.split("  ", maxsplit=1) #['John Deer', '            Works in College']

然后使用str.strip method除去前导空格（注意method参数中的双精度空格）：

z = y.strip('  ') #'Works in College'
character = z[0]  #'W'

现在，您可以使用str.find方法找到索引：

index = len(y[0]) + y[1].find(character) +2  #23

在拆分原始字符串时，添加的+2对应于删除的子字符串" "。

话虽如此，我鼓励您使用诸如.csv或.json之类的标准格式。这样做，您将能够在许多库中使用单一方法轻松解析它。

获取文件中字符串开头的列索引

5 个答案: