您好,我正在尝试从txt文件中提取所有数据:
[2018-07-10 15:04:11] USER INPUT "hello"
[2018-07-10 15:04:12] SYSTEM RESPONSE: "Hello! How are you doing today"
[2018-07-10 15:04:42] USER INPUT "I am doing good thank you"
[2018-07-10 15:04:42] SYSTEM RESPONSE: "Good to know"
以一种方式,我只将列表中的双引号中的数据包含在内
["hello","Hello! How are you doing today","I am doing good thank you","Good to know"]
我正在尝试使用
corpus_raw = ""
for log_filename in log_filenames:
print("Reading '{0}'...".format(log_filename))
with codecs.open(log_filename, "rb", encoding='utf-8', errors='ignore') as log_file:
corpus_raw += log_file.read()
corpus_raw= re.findall(r'\[(.*?)\]\s+', corpus_raw)
print("Corpus is now {0} characters long".format(len(corpus_raw)))
print()
但是我无法获得任何结果。 任何建议将有所帮助!谢谢
答案 0 :(得分:1)
您可以使用.*?
:
import re
contents = [re.findall('"(.*?)"', i.strip('\n'))[0] for i in open('filename.txt')]
输出:
['hello', 'Hello! How are you doing today', 'I am doing good thank you', 'Good to know']
答案 1 :(得分:1)
您可以简单地将corpus_raw
除以"
并获得列表中的所有其他项目:
corpus_raw = ""
for log_filename in log_filenames:
print("Reading '{0}'...".format(log_filename))
with codecs.open(log_filename, "rb", encoding='utf-8', errors='ignore') as log_file:
corpus_raw += log_file.read()
print("Corpus is now {0} characters long".format(len(corpus_raw)))
print()
corpus_raw = corpus_raw.split('"')[1::2]
corpus_raw
将变为(鉴于您的示例输入):
['hello', 'Hello! How are you doing today', 'I am doing good thank you', 'Good to know']
答案 2 :(得分:0)
使用cut
:
$ cut -d'"' -f2 < so.txt
hello
Hello! How are you doing today
I am doing good thank you
Good to know