我已经用python编写了一个脚本来从网页上抓取表格内容。在主表的第一列中有名称。有些名称具有指向另一个页面的链接,有些只是没有任何链接的名称。我的意图是在名称没有链接到另一个页面时解析行。但是,当名称链接到另一个页面时,脚本将首先解析主表中的相关行,然后跟随该链接来解析位于标题Companies
底部的表中该名称的关联信息。 。最后,将它们写入一个csv文件中。
到目前为止,我已经尝试过:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
first_table = [i.text for i in item.select("td")]
print(first_table)
else:
first_table = [i.text for i in item.select("td")]
print(first_table)
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info = [elem.text for elem in elems.select("td")]
print(associated_info)
我上面的脚本几乎可以执行所有操作,但是我不能创建任何逻辑来打印一次,而不能三次打印来获取所有数据,以便可以将它们写到一个csv文件中。
答案 0 :(得分:1)
将所有报废的数据放入列表中,这里我称为列表associated_info
,然后所有数据都放在一个位置,并且您可以遍历该列表以将其打印为CSV。 ..
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
associated_info.append([i.text for i in item.select("td")])
else:
associated_info.append([i.text for i in item.select("td")])
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info.append([elem.text for elem in elems.select("td")])
print(associated_info)