Question

“test_tweet1.txt”中有两句话

@francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
@mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<

在“Personal.txt”中

The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H

我的代码：

import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
    array.append(line)
    count1 = count1 + 1
    print "\n",count1, line
    ltext1 = line.split(" ")
    for i,text in enumerate(ltext1):
        if text in rpopular_person:
            print text
    text2 = ' '.join(ltext1)

代码的结果显示：

1 @francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the

2 @mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga

我尝试将“test_tweet1.txt”中的单词与“Personal.txt”匹配。

预期结果：

Tony
Romo

有什么建议吗？

Answer 1

您需要拆分rpopular_person以使其匹配单词而不是子字符串

rpopular_person = open('C:/Users/Personal.txt').read().split()

这给出了：

Tony
The

Romo没有出现的原因是你的线上分裂你有“Romo”。也许你应该在行中寻找rpopular_person，而不是相反。也许是这样的

popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
    print "\n", count1, line
    for person in popular_person:
        if person in line:
            print person

Answer 2

您的问题似乎是rpopular_person只是一个字符串。因此，当您询问'to' in rpopular_person之类的内容时，您会得到True的值，因为字符't', 'o'按顺序出现。我假设Personal.txt中的'the'也是如此。

你想要做的是将Personal.txt分成单个单词，就像分割你的推文一样。您还可以将生成的单词列表变为set，因为这样可以使查找速度更快。像这样：

people = set(popular_person.read().split())

值得注意的是，我正在调用split()，没有任何参数。这将拆分所有空格 - 换行符，制表符等。这样你可以像想要的那样单独获得所有东西。或者，如果您不实际上想要这个（因为这将根据您编辑的Personal.txt内容一直给您“The”的结果），请将其设为：

people = set(popular_person.read().split('\n'))

这样你就可以拆分新行了，所以你只需要查找全名匹配。

你没有得到“Romo”，因为这不是你推文中的一个词。你推文中的字是“Romo”。有一段时间。这很可能对你来说仍然是一个问题，所以我要做的是重新排列你的逻辑（假设速度不是问题）。不要查看推文中的每个单词，而是查看Personal.txt文件中的每个名称，看看它是in您的完整推文。这样您就不必处理标点符号等。以下是我重写您的功能的方法：

rpopular_person = set(personal.split())
with open("Personal.txt") as p:
    people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
    for tweet in tweets:
        for person in people:
            if person in tweet:
                print person

错误：匹配文件中的单词

2 个答案: