Question

我正在尝试使用Beautifulsoup Python从网站上抓取一些数据，但它没有返回应返回的值。以下是我的代码。

import requests
from bs4 import BeautifulSoup

url = 'https://finance.naver.com/item/sise.nhn?code=005930'

# send a HTTP request to the URL of the webpage I want to access
r = requests.get(url)

data = r.text

# making the soup
soup = BeautifulSoup(data, 'html.parser')

print(soup.find('iframe', attrs={'title': '일별 시세'}))

它返回，

<iframe bottommargin="0" frameborder="0" height="360" marginheight="0" name="day" scrolling="no" src="/item/sise_day.nhn?code=005930" title="일별 시세" topmargin="0" width="100%"></iframe>

打印结果中没有HTML标记。但是，如果我看一下网页上的开发人员工具，它会清楚地显示'iframe'标签中有很多标签。

所以我的问题是，为什么我的代码没有返回里面的所有标签我从网页上的开发人员工具中看到的“ iframe”标签？

我尝试查找一些信息，但是没有一个给我明确的答案。是因为它是由javascript加载的吗？如果可以的话我该如何检查我的网页我要抓取的内容是由javascript加载的？

最后，如果要删除我想要的数据，我应该使用哪个模块/库由javascript加载？

Answer 1

该表格可在iframe中使用。您需要发送该iframe网址的请求。您可以使用熊猫read_html（）并获取表格。

function asyncFunction() {
  return new Promise((resolve) => {
    setTimeout(() => {
      console.log(2);

      resolve();
    }, 0);
  });
}

function abc() {
  return new Promise((resolve) => {
    console.log(0);
    console.log(1);

    asyncFunction()
      .then(() => {
        console.log(3);

        resolve();
      });
  });
}

abc();

输出：

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://finance.naver.com/item/sise.nhn?code=005930'

# send a HTTP request to the URL of the webpage I want to access
r = requests.get(url)

data = r.text

# making the soup
soup = BeautifulSoup(data, 'html.parser')

newurl="https://finance.naver.com" +soup.find('iframe', attrs={'title': '일별 시세'})['src']
dfs=pd.read_html(newurl)
df=dfs[0]
df = df.dropna(how='any',axis=0)
print(df)

BeautifulSoup不会返回应有的标签（空结果）

1 个答案: