基本代码

import requests
from bs4 import BeautifulSoup
page = requests.get('http://182.148.109.184/enterprise- 
info!getCompanyInfo.action?companyid=1000356')

soup = BeautifulSoup(page.text, 'html.parser')
source_content = soup.find(class_='rightSide').find(class_='content register').find(class_='formestyle')

我想收集的信息

这个数字是在Chrome检查元素页面中捕获的。

也许中国人在这里不友善，我在这里创造了一个例子以便更好地说明。

<th> the variable name </th> => For example, "company name", "company location"
<td> the target data I want to save </td>

我的问题

根据我的基本代码，source_content里面没有任何信息。输出文件显示如下：

比较图1,2，我们可以看到经度，纬度的信息已经消失。

如何使用Python获取这些数据？任何建议将不胜感激

Answer 1

如果您在请求中提供Referer标题，则可以获取以下信息：

import requests
from bs4 import BeautifulSoup

url = 'http://182.148.109.184/enterprise-info!getCompanyInfo.action?companyid=1000356'
page = requests.get(url, headers={'Referer' : url})
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find(class_='formestyle')

for tr in table.find_all('tr'):
    row = [v.text for v in tr.find_all(['th', 'td'])]
    print(row)

这将显示以下类型的数据：

['地理坐标：', '经度：104.2153 \xa0\xa0纬度：31.3631']

如您所见，现在提供的信息。

使用Python抓取html内容中的内容

基本代码

我想收集的信息

我的问题

1 个答案: