Question

我正在尝试学习网页抓取和python（以及编程），并找到了BeautifulSoup库，它似乎提供了很多可能性。

我正在尝试找出如何最好地从此页面中提取相关信息：

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

我可以详细了解这个，但基本上是公司名称，关于它的描述，联系方式，各种公司详情/统计e.t.c。

在此阶段，查看如何干净地隔离这些数据并将其删除，以便将所有数据全部放在CSV或以后的内容中。

我很困惑如何使用BS来获取不同的表数据。有很多tr和td标签，不知道如何锚定任何独特的东西。

我提出的最好的是以下代码作为开始：

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie

然后从那里使用regex e.t.c.从清理过的文本中提取数据。

但是使用BS树必须有更好的方法吗？或者这个网站的格式是BS不会提供更多的帮助？

没有寻找完整的解决方案，因为这是一个很大的问题，我想学习，但任何代码片段让我在路上将非常感激。

更新

感谢下面的@ZeroPiraeus，我开始明白如何解析表格。以下是他的代码的输出：

=== Personnel ===
bodytext    Ms Gail Morgan CEO
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext    Lisa Mayoh Sales Manager
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au

=== Company Details ===
bodytext    ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext    007 350 807
paraheading ABN:
bodytext    71 007 350 807
paraheading 
bodytext    Australian Owned
paraheading Annual Turnover:
bodytext    $5M - $10M
paraheading Number of Employees:
bodytext    6-10
paraheading QA:
bodytext    ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext    5 %
paraheading Industry Categories:
bodytext    AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext    lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext    http://www.aerospacematerials.com.au
paraheading Office:
bodytext    2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext    PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext    +61.3. 9464 4455
paraheading Fax:
bodytext    +61.3. 9464 4422

我的下一个问题是，将这些数据放入适合导入电子表格的CSV的最佳方法是什么？例如，有'ABN''ACN''公司网站'等e.t.c.作为列标题，然后将相应的数据作为行信息。

感谢您的帮助。

Answer 1

您的代码将完全取决于您想要的内容以及您希望如何存储它，但此代码段应该让您了解如何从页面中获取相关信息：

import requests

from bs4 import BeautifulSoup

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
    print "\n=== %s ===" % feature_heading.text
    details = feature_heading.find_next_sibling("td")
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
        print("\t".join([item["class"][0], " ".join(item.text.split())]))

我发现requests比urllib2更适合工作，但当然这取决于你。

修改

在回答您的后续问题时，您可以使用以下内容从已删除的数据中编写CSV文件：

import csv import requests from bs4 import BeautifulSoup columns = ["ACN", "ABN", "Annual Turnover", "QA"] urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc. with open("data.csv", "w") as csv_file: writer = csv.DictWriter(csv_file, columns) writer.writeheader() for url in urls: soup = BeautifulSoup(requests.get(url).text) row = {} for heading in soup.find_all("td", {"class": "paraheading"}): key = " ".join(heading.text.split()).rstrip(":") if key in columns: next_td = heading.find_next_sibling("td", {"class": "bodytext"}) value = " ".join(next_td.text.split()) row[key] = value writer.writerow(row)

Answer 2

我以前曾经走过这条路。我使用的html页面总是与表格格式相同，并且是公司内部的。我们确保客户知道如果他们更改了页面，很可能会破坏编程。根据这个规定，能够确定从tr和td的列表中的索引值所取得的所有内容。它远远没有理想的情况，即他们无法提供或不提供XML数据，但现在已经使用了近一年。如果那里有人知道更好的答案，我也想知道。那是我第一次也是唯一一次使用Beautiful Soup，从来没有需要它，但它的效果非常好。

用BeautifulSoup刮掉一系列表格

2 个答案: