Question

所以我有一个程序，我应该在其中获取一个外部文件，在python中打开它，然后将每个单词和每个标点符号分开，包括逗号，撇号和句号。然后我应该将此文件保存为文本中每个单词和标点符号出现时的整数位置。

例如： - 我喜欢编码，因为代码很有趣。计算机的骨架。

在我的程序中，我必须将其保存为： -

1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14

（帮助那些不明白的人） 1-I，2-like，3-to-4-code，5-（，），6-because，7-is，8-fun 9-（。），10-A，11-computer，12-（＆＃39;），13-s，14-skeleton

所以这显示了每个单词的位置，即使它重复，它也显示了同一个单词的第一个出现位置

很抱歉很长的解释，但这是我的实际问题。到目前为止我已经这样做了： -

    file = open('newfiles.txt', 'r')
    with open('newfiles.txt','r') as file:
        for line in file:
            for word in line.split():
                 print(word)

结果如下： -

  They
  say
  it's
  a
  dog's
  life,.....

不幸的是，这种分割文件的方法不会将单词与标点符号分开，也不会水平打印。 .split不能在文件上工作，有没有人知道一种更有效的方法，我可以分割文件 - 来自标点符号的单词？然后将分隔的单词和标点符号一起存储在列表中？

Answer 1

内置字符串方法.split只能用于简单的分隔符。没有参数，它只是拆分空格。对于更复杂的拆分行为，最简单的方法是使用正则表达式：

>>> s = "I like to code, because to code is fun. A computer's skeleton."
>>> import re
>>> delim = re.compile(r"""\s|([,.;':"])""")
>>> tokens = filter(None, delim.split(s))
>>> idx = {}
>>> result = []
>>> i = 1
>>> for token in tokens:
...     if token in idx:
...         result.append(idx[token])
...     else:
...         result.append(i)
...         idx[token] = i
...         i += 1
...
>>> result
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]

另外，根据您的规范，我认为您不需要逐行遍历文件。你应该做的事情如下：

with open('my file.txt') as f:
    s = f.read()

将整个文件作为字符串放入s。请注意，我从未在open声明之前使用with，这没有任何意义。

Answer 2

使用正则表达式捕获相关的子字符串：

import re

my_string = "I like to code, because to code is fun. A computer's skeleton."
matched = re.findall("(\w+)([',.]?)", my_string) # Split up relevant pieces of text

过滤掉空匹配并添加到结果中：

result = []
for word, punc in matched:
    result.append(word)
    if punc: # Check if punctuation follows the word
        result.append(punc)

然后将结果写入您的文件：

with open("file.txt", "w") as f:
    f.writelines(result) # Write pieces on separate lines

正则表达式的工作方式是找到字母字符，然后检查是否有标点符号（可选）。

Answer 3

您可以使用正则表达式和拆分来解决此问题。希望这能指出你正确的方向。祝你好运！

import re
str1 = '''I like to code, because to code is fun. A computer's skeleton.'''

#Split your string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", str1) if x not in ['',' ']]
print matches
d = {}
i = 1
list_with_positions = []

#now build the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print list_with_positions

这是输出。请注意，最终期间的位置为＃9：

[＆＃39;我＆＃39;，＆＃39;喜欢＆＃39;，＆＃39;到＆＃39;，＆＃39;代码＆＃39;，＆＃39;，＆＃39;，＆＃39;因为＆＃39;，＆＃39;到＆＃39;，＆＃39;代码＆＃39;，＆＃39;是＆＃39;，＆＃39;有趣＆＃39;，＆＃39;。＆＃39;，＆＃39; A＆＃39;，＆＃39;计算机＆＃39;，＆＃34;＆＃39;＆＃34;，＆＃39; s＆＃39;，＆＃39;骨架＆＃39;，＆＃39;。＆＃39;]

[1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14,9]

在Python中编写打开的文件时的拆分功能

3 个答案: