我有两个不同的文本文件,例如:
text1 = Movie1 interact comedi seri featur ...
Movie2 each week audienc write ...
Movie3 struggl make success relationship ....
text2 = Movie2 Action
Movie3 Drama
Movie4 Sci-fi
我想要的是
text3 = Movie2 each week audienc write ...
Movie3 struggl make success relationship ....
and text4 = Movie2 Action
Movie3 Drama
text1和text2只是说明性的,它们比这些更大。 text1包含许多电影的摘要,text2包含更多电影的类型信息。我想根据电影名称只将10000个交叉点提取到text3和text4中。如果我认为我是新手,我怎么能用Python做到这一点。
答案 0 :(得分:0)
假设您已经打开了每个文本文件:
def process_file(f):
return list(filter(lambda l: l.strip(), f.readlines())) # remove blank lines
def get_word(string): # try to get the first word of each line
try:
s = string.split(' ')
return s[0], string
except:
return None, string
def insert_line(string, dict): # insert the line into a dict
word, line = get_word(string) # with the first word as key
if word:
dict[word] = line
lines1 = process_file(file1)
lines2 = process_file(file2)
dict1 = {}
for line in lines1:
insert_line(line, dict1)
dict2 = {}
for line in lines2:
insert_line(line, dict2) # build dicts
set1 = set(dict1.keys()) # build sets with keys
set2 = set(dict2.keys())
intersection = set1 & set2 # get set intersection
intersection_lines = []
for key in intersection: # build list with intersection
intersection_lines.append(dict1[key])
在此脚本的末尾,intersection_lines
将包含您希望从file1获取的行。要对file2执行相同操作,您只需为dict1
交换dict2
即可。这样做很简单,因为交集操作已经在set
类中作为&
运算符实现。请注意,这仅适用于每行的第一个单词是唯一的。