Question

我正在使用递归神经网络，需要处理我的输入文本文件（包含树）来提取单词。输入文件如下所示：

（3（2（2）（2 Rock））（4（3（2）（4（2）（2）（2（2（2（2））（2（2 be）（2）（2）（2（2-21）（2（2（2 Century）（2's））（2（3 new）（2（2``）（2 Conan））））））））（2 ''））（2和））（3（2 that）（3（2 he）（3（2）s（3（2）（3（2 to）（4（3（2 make）（3）（3（2 a）（3飞溅））（2（2偶数）（3更大））））（2（2比）（2（2（2（2（1（2 Arnold）（2施瓦辛格））（ 2，））（2（2 Jean-Claud）（2（2 Van）（2 Damme））））（2或））（2（2 Steven）（2 Segal）））））））））））））（2。）））

（4（4（4（2））（4（3华丽）（3（2详细）（2续））））（2（2（2））（2``））（2（2））（2（2（2主）（2（2）（2（2）（2环））））（2（2''）（2三部曲））））））（2（3（ 2（2 is）（2（2 so）（2 huge）））（2（2 that）（3（2（2（2 a）（2列））（2（2））（2字）））（2（2（2 can）（1 not））（3恰当））（2（2 describe）（2（3（2（2 co-writer / director）（2（2 Peter）（3（2）杰克逊）（2's））））（3（2扩展）（2视觉）））（2（2））（2（2（2 JRR）（2（2托尔金）（2's）））（ 2中土）））））））））（2。）））

作为输出，我希望新文本文件中的单词列表为：

在

岩

是

注定

...

（忽略行之间的空格。）

我尝试在python中进行，但无法找到解决方案。另外，我读到awk可以用于文本处理，但无法生成任何工作代码。任何帮助表示赞赏。

Answer 1

您可以使用re.findall：

import re
with open('tree_file.txt') as f, open('word_list.txt', 'a') as f1:
   f1.write('\n'.join(set(re.findall("[a-zA-Z\-\.'/]+", f.read()))))

在文本上运行上面的代码时，输出为：

make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century

Answer 2

你可以使用正则表达式！

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

请注意re.findall会返回匹配列表，因此如果您想将它们全部打印出来，可以使用：

' '.join(results)

或其他任何您希望用空格而不是空格分隔单词的字符。

打破正则表达式模式我们有：

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If then numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

Answer 3

为了记录，我们可以选择扔掉什么而不是保留什么。例如，我们可以拆分parens，空格和数字。提醒由单词和标点符号组成。对于非拉丁文本和特殊字符，这可能很方便。

import re

# split on parens, numbers and spaces
spl = re.compile("\(|\s|[0-9]|\)")
words = filter(None, spl.split(string_to_split))

Answer 4

您可以使用re.compile：

import re
def getWords(text):
    return re.compile('[A-Za-z]').findall(text)

with open('input_file.txt') as f_in:
  with open('output_file.txt', 'a') as f_out:
    f_out.write('\n'.join(getWords(f_in.read())))

从文本文件

4 个答案: