Question

我正在尝试编写一个小型搜寻器来搜寻多个Wikipedia页面。我想通过将包含名称列表的文件中的确切wikipage的超链接串联起来，使爬网更加动态。例如，“ deutsche_Schauspieler.txt”的第一行说“ Alfred Abel”，而连接的字符串将是“ https://de.wikipedia.org/wiki/Alfred Abel”。使用txt文件将导致标题为空，但是当我在脚本中使用字符串完成链接后，它便可以正常工作。

这是针对python 2.x的。我已经尝试过从“切换为”，尝试用+代替％s 尝试将整个字符串放入txt文件（以便第一行显示为“ http：// ...”，而不是“ Alfred Abel” 试图从“ Alfred Abel”切换到“ Alfred_Abel

from bs4 import BeautifulSoup
import requests

file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")

content = f.readlines()

for line in content:    
    link = "https://de.wikipedia.org/wiki/%s" % (str(line))
    response = requests.get(link)
    html = response.content
    soup = BeautifulSoup(html)
    heading = soup.find(id='Vorlage_Personendaten')
    uls = heading.find_all('td')
    for item in uls:
        file.write(item.text.encode('utf-8') + "\n")

f.close()
file.close()

我希望获得表“ Vorlage_Personendaten”的内容，如果我将第10行更改为

，则该表实际上可以工作

link = "https://de.wikipedia.org/wiki/Alfred Abel"
# link = "https://de.wikipedia.org/wiki/Alfred_Abel" also works

但是我希望它可以使用文本文件

Answer 1

就像您使用"Alfred Abel"的文本文件中的问题一样，这就是为什么您遇到以下异常的原因

uls = heading.find_all（'td'） AttributeError：“ NoneType”对象没有属性“ find_all”

请删除字符串引号"Alfred Abel"并在文本文件Alfred Abel中使用deutsche_Schauspieler.txt。它将按预期工作。

Answer 2

我自己找到了解决方案。尽管文件上没有多余的行，但是内容数组显示为 ['Alfred Abel \ n']，但是打印出数组的第一个索引将得到'Alfred Abel'。它仍然像数组中的字符串一样被解释，从而形成错误的链接。因此，您想从当前行移动last（！）字符。解决方案如下所示：

from bs4 import BeautifulSoup
import requests

file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")

content = f.readlines()
print (content)
for line in content:    
    line=line[:-1] #Note how this removes \n which are technically two characters
    link = "https://de.wikipedia.org/wiki/%s" % str(line)
    response = requests.get(link)
    html = response.content
    soup = BeautifulSoup(html,"html.parser")
    try:
        heading = soup.find(id='Vorlage_Personendaten')
        uls = heading.find_all('td')
        for item in uls:
            file.write(item.text.encode('utf-8') + "\n")
    except:
        print ("That did not work")
        pass

f.close()
file.close()

如何使用串联字符串获取请求的方法？

2 个答案: