Question

我试图抓取在线情节轴的文本以及与之相关的一些功能，例如文本的颜色，但很少使用抓取，所以真的很感激一些帮助。对于经常使用刮刀的人来说，这可能是一个简单的方法。这是我的代码：

from bs4 import BeautifulSoup
import requests

def get_IPF_transcriptome_groups():

url = "https://research.cchmc.org/pbge/lunggens/lungDisease/celltype_IPF.html?cid=1"
r = requests.get(url)
data=r.text
soup = BeautifulSoup(data)


for d in soup.find('div', attrs={'id':'wrapper'}).find(
        'div', attrs={'class':'content'}).find(
                'div', attrs={'id':'ResPanel'}).find(
                        'table', attrs={'id':'maintable'}).find(
                                'tbody'):
    print(d)

我收到错误：

    'tbody'):

    TypeError: 'NoneType' object is not iterable

我认为代码无法通过表格体。我想要解析的实际文本通过其他几个标签进行了更深入的处理，其中包括＆＃39; div＆＃39;，＆＃39; td＆＃39;＆＃39; tr＆＃39;＆＃39;＆＃39;＆＃39; ; g＆＃39;等看起来像下面这样：

<tspan style="fill:#006600;font-size:7px;">CC002_33_N709_S503_C10</tspan>

其中＆＃39; CC002_33_N709_S503_C10＆＃39;是一个示例参考和＆＃396600＆＃39;是指一种颜色。有（我认为）这样的540行。如果有人能提供帮助，真的会很棒吗？非常感谢

根据Uday的回应进行编辑：

感谢您的建议，我已经建立了“找到所有＆＃39;进入它并使用索引来检索下一个部分。这个建议here提到删除了＆＃39; t＆＃39;标记，因为它可能不是源代码的一部分。只需添加＆＃t; tspan＆＃39;似乎没有回报我需要的东西。这是我更新的代码：

    for d in soup.find('div', attrs={'id':'wrapper'}).find(
        'div', attrs={'class':'content'}).find(
                'div', attrs={'id':'ResPanel'}).find(
                        'table', attrs={'id':'maintable'}).findAll(
                                'tr')[2].findAll('td')[0].find('div', attrs={'id':'sigheatmapcontainer'}): 
                                        print(d)

任何进一步的建议都会非常有用吗？

Answer 1

您想要的数据是通过JavaScript使用POST请求从另一个URL获取的。（它不在此网页的源代码（HTML）中，甚至不会使用Dryscrape进行渲染。）它以JSON格式返回，它很好且易于解析。以下代码将获取所有数据。如何解释数据是另一个问题，但也许你比我更清楚。

from bs4 import BeautifulSoup
import requests
import json
# Fetch the data.
url = "https://research.cchmc.org/pbge/lunggens/celltypeIPF"
r = requests.post(url, data = {'id':'1'})
data=r.text
soup = BeautifulSoup(data, "lxml")
d = soup.find('p')
# now you have the json containing all the data.
jn = json.loads(d.text)
print(json.dumps(jn, indent=2))

输出漂亮的原始数据。

您可以按照您想要的方式解析JSON，例如如果你喜欢熊猫

from pandas.io.json import json_normalize
import pandas as pd
...
df = pd.DataFrame(json_normalize(jn))

web scrape嵌套文本功能

1 个答案: