我有一个没有格式的制表符分隔文件。文件格式仅列为FILE。当我用文本编辑器打开它时,它看起来像:
Job Wanted_VERB "_. 2000 1 1
Job Wanted_VERB "_. 2001 1 1
Job Wanted_VERB "_. 2002 5 5
Job Wanted_VERB "_. 2004 2 2
Job Wanted_VERB "_. 2005 2 2
Job Wanted_VERB "_. 2006 2 2
Job Wanted_VERB "_. 2007 1 1
Job Well Done 1917 1 1
Job Well Done 1930 3 2
Job Well Done 1937 1 1
Job Well Done 1940 5 4
Job Well Done 1941 3 3
Job Well Done 1942 1 1
Job Well Done 1943 2 2
Job Well Done 1944 1 1
Job Well Done 1945 1 1
Job Well Done 1946 3 3
Job Well Done 1948 1 1
Job Well Done 1949 4 4
Job Well Done 1950 1 1
Job Well Done 1951 3 2
Job Well Done 1952 6 4
Job Well Done 1953 9 5
Job Well Done 1954 6 4
Job Well Done 1955 5 5
....
....
其中前三列为3克句子,其余列与词频相关。
这是一个巨大的文件,所以我只想解析只包含我正在寻找的3克单词的部分。例如,从上表中,我只想解析Job Well Done
部分。
Job Well Done 1917 1 1
Job Well Done 1930 3 2
Job Well Done 1937 1 1
Job Well Done 1940 5 4
Job Well Done 1941 3 3
Job Well Done 1942 1 1
Job Well Done 1943 2 2
Job Well Done 1944 1 1
Job Well Done 1945 1 1
Job Well Done 1946 3 3
Job Well Done 1948 1 1
Job Well Done 1949 4 4
Job Well Done 1950 1 1
Job Well Done 1951 3 2
Job Well Done 1952 6 4
Job Well Done 1953 9 5
Job Well Done 1954 6 4
Job Well Done 1955 5 5
我目前正在执行此操作来解析整个文件并将其放入列表中:
with open(file, 'rt', encoding='UTF8') as input:
z = [line.strip().split('\t') for line in input]
任何帮助?
答案 0 :(得分:0)
是的,将startwith添加为if语句,如下所示:
with open(file, 'rt', encoding='UTF8') as input:
z = [line.strip().split("\t") for line in f if line.startswith("Job Well Done")]