我需要有关从网页上的表格中提取或废弃数据的帮助。我正在使用美丽的汤。无法使用表号提取表。 6. Anyhelp将不胜感激:
需要表6中的所有行数据。单个网页中有几个表格但我需要的数据仅用于合规信息,不知道如何操作。
网址为here
我的代码如下:
link = ["http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=®istryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="]
for pagenum, links in enumerate(link[start:end]):
print(links)
r = requests.get(links)
time.sleep(random.randint(2,5))
soup = BeautifulSoup(r.content,"lxml")
tree = html.fromstring(str(soup))
value = []
data_block = soup.find_all("table", {"class": "bordertb"})
print (data_block)
output = []
for item in data_block:
table_data = item.find_all("td", {"class": "tabletitle"})[0].table
value.append([table_data])
print (value)
with open("Exhibit_2_EXP_data.tsv", "wb") as outfile:
outfile = unicodecsv.writer(outfile, delimiter="\t")
outfile.writerow(["Data_Output"])
for item in value:
outfile.writerow(item)
答案 0 :(得分:1)
试试这个。下面的脚本应该从该表中获取内容。要使其具体化,您应该从上一个表开始操作(因为它有一个唯一的ID),然后使用适当的方法,您可以访问所需表的内容。以下是我为实现同样目标所做的工作:
import requests
from bs4 import BeautifulSoup
url = "http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=®istryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.find(id="tblInstallationContacts").find_next_sibling().find_all("tr")[:-5]:
data = [item.get_text(strip=True) for item in items.find_all("td")]
print(data)