两个文本文件的交叉

时间:2017-04-18 17:20:26

标签: python text

我有两个不同的文本文件,例如:

    text1 = Movie1 interact comedi seri featur ...
            Movie2 each week audienc write ...
            Movie3 struggl make success relationship ....

    text2 = Movie2 Action
            Movie3 Drama
            Movie4 Sci-fi

我想要的是

    text3 = Movie2 each week audienc write ...
            Movie3 struggl make success relationship ....
and text4 = Movie2 Action
            Movie3 Drama

text1和text2只是说明性的,它们比这些更大。 text1包含许多电影的摘要,text2包含更多电影的类型信息。我想根据电影名称只将10000个交叉点提取到text3和text4中。如果我认为我是新手,我怎么能用Python做到这一点。

1 个答案:

答案 0 :(得分:0)

假设您已经打开了每个文本文件:

def process_file(f):
    return list(filter(lambda l: l.strip(), f.readlines())) # remove blank lines

def get_word(string): # try to get the first word of each line
    try:
        s = string.split(' ')
        return s[0], string
    except:
        return None, string

def insert_line(string, dict):    # insert the line into a dict
    word, line = get_word(string) # with the first word as key
    if word:
        dict[word] = line

lines1 = process_file(file1)
lines2 = process_file(file2)
dict1 = {}
for line in lines1:
    insert_line(line, dict1)
dict2 = {}
for line in lines2:
    insert_line(line, dict2) # build dicts
set1 = set(dict1.keys())     # build sets with keys
set2 = set(dict2.keys())
intersection = set1 & set2   # get set intersection
intersection_lines = []
for key in intersection:     # build list with intersection
    intersection_lines.append(dict1[key])

在此脚本的末尾,intersection_lines将包含您希望从file1获取的行。要对file2执行相同操作,您只需为dict1交换dict2即可。这样做很简单,因为交集操作已经在set类中作为&运算符实现。请注意,这仅适用于每行的第一个单词是唯一的。