Question

我想从https://www.sawmilldatabase.com/sawmill.php?id=1282用BeautifulSoup刮掉锯木厂老板（在“拥有者：”之后）。

我试图改编this very similar answer，但由于我不理解的原因，它不起作用。

<td>
   <a href="../company.php?id=729">AKD Softwoods </a>
</td>

的Python：

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')

soup = BeautifulSoup(page.text, 'html.parser')

lst = soup.find_all('TD')
for td in lst:
    if td.text == "Owned by":
        print("yes")
        print(lst[lst.index(td)+1].text)

Answer 1

我已经使用正则表达式来帮助我找到您正在寻找的元素。

代码：

import requests, re
from bs4 import BeautifulSoup

page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find('a', href=re.compile('company.php')).text)

输出：

AKD Softwoods

Answer 2

要解决您提交的代码，您不成功的原因是您使用if td.text == "Owned by"作为条件。虽然这似乎可以起作用，但它永远不会返回你想要的东西，因为你正在抓的网站放置了"Owned by: "之后的锯木厂老板。（如果您inspect the webpage，您会看到<td>标记为<td>Owned by: </d>）。

虽然"Owned by"和"Owned by: "之间的差异似乎微不足道，但它会对您的计划产生重大影响。只需将代码更改为if td.text == "Owned by: ":，您就会得到正确的答案：

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.sawmilldatabase.com/sawmill.php?id=1282')

soup = BeautifulSoup(page.text, 'html.parser')

lst = soup.find_all('td')
for td in lst:
    if td.text == "Owned by: ":
        print("yes")
        print(lst[lst.index(td)+1].text)

或者，您也可以使用if "Owned by" in td.text:作为条件，但如果其中包含其中包含该信息的<td>标记，则这并不完全理想。

希望它有所帮助！

修改

哦，也不要在TD中大写lst = soup.find_all('TD')。

Answer 3

以下方法怎么样!!如果您遵守此用法if sth.text=="sth else: "，则主要问题是反转逗号中的文本必须与存储在网页中的文本相同。如果您碰巧使用if sth.text=="sth else:"这个，它将不再有效，因为它的最后一部分的额外空间已被取出。试试这个：

import requests 
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://www.sawmilldatabase.com/sawmill.php?id=1282").text,"lxml")
for items in soup.select("table td"):
    if "Owned by:" in items.text:
        name = items.find_next_sibling().text
        print(name)

输出：

AKD Softwoods

使用BeautifulSoup刮取木材工业数据库

3 个答案: