Question

我正在尝试从以下网站的价格表中抓取数据：https://www.letsrecycle.com/prices/textiles/textile-prices-2012/

我无法同时使用read_html和BeautifulSoup来查找表，这很奇怪，因为我能够在其他类似页面（例如https://www.letsrecycle.com/prices/metals/steel-cans/steel-can-prices-2018/）上找到表

我尝试过使用不同的解析器，但这并没有帮助。我的代码的相关部分如下：

import pandas as pd
import html5lib
import requests
from bs4 import BeautifulSoup
import urllib    
url = 'http://www.letsrecycle.com/prices/textiles/textiles-prices-2012'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
dfs = pd.read_html(webpage)

我还尝试了各种BeautifulSoup解析器，例如：

soup = BeautifulSoup(webpage, "html5lib")
table = soup.findAll("table")
table

非常感谢

Answer 1

您需要标头：

import requests
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.letsrecycle.com/prices/textiles/textile-prices-2012/', headers=headers)
dfs = pd.read_html(r.text)
print(dfs)

输出：

[                  0          1          2  ...          4          5          6
0              2012    January   February  ...      April        May       June
1     Textile banks  270 - 340  260 - 350  ...  260 - 360  260 - 360  260 - 350
2  Shop collections  490 - 550  500 - 560  ...  500 - 560  500 - 570  510 - 580
3      Charity rags  580 - 650  600 - 670  ...  610 - 700  620 - 700  620 - 720

[4 rows x 7 columns],                   0          1          2  ...          4          5          6
0              2012       July     August  ...    October   November   December
1     Textile banks  250 - 350  250 - 330  ...  260 - 340  260 - 340  250 - 340
2  Shop collections  520 - 590  530 - 590  ...  530 - 580  540 - 590  530 - 580
3      Charity rags  630 - 730  640 - 740  ...  650 - 730  650 - 730  640 - 730

[4 rows x 7 columns]]

Answer 2

我发现了问题-这是一个不一致的URL。例如。 2006年的URL为： https://www.letsrecycle.com/prices/textiles/textiles-prices-2006/ （纺织品/纺织品-带有“ s”）

但对于2012年，URL为： https://www.letsrecycle.com/prices/textiles/textile-prices-2012/ （纺织品/纺织品-没有“ s”）

这就是为什么我的代码找不到任何表的原因。

python中的HTML表抓取-在某些页面上找不到表

2 个答案: