如何使用scrapy在网页上的起点和终点之间抓取数据

时间:2016-09-21 00:18:33

标签: scrapy

我正在开展一项利益项目,在该项目中,我使用一系列药物名称来查找癌症治疗药物的副作用。我想存储超过30%的情况下发生的副作用。我是scrapy的完全新手(我已经使用了大约3天)。所以我在这个网页上http://www.chemocare.com/chemotherapy/drug-info/camptosar.aspx,我正试图获取句子之间出现的数据:

"对于服用Camptosar"的患者,以下副作用很常见(发生率超过30%)。 和 "这些副作用是接受Camptosar"的患者不太常见的副作用(发生在约10-29%)。

我正在寻找的数据似乎不是一个唯一标识的html,它只是简单的多个列表,所以我不能只说看这一部分并从这一部分获取数据。谷歌搜索没有得到任何我觉得有用的结果。

有没有办法在scrapy中这样做?我希望得到副作用的所有药物都会在每页的两个字符串之间列出副作用。在我需要http://www.chemocare.com/chemotherapy/drug-info/avastin.aspx的情况下,我试图抓取更多的页面。

1 个答案:

答案 0 :(得分:0)

您可以使用 Kayessian方法设置交集来使用两段之间的所有元素:

from lxml import html
import requests

# expressions to find the two p tags we want.
ns1 = "//p[contains(. , '(occurring in greater than 30%)')]"
ns2 = "//p[contains(., '(occurring in about 10-29%)')]"

tree = html.fromstring(requests.get("http://www.chemocare.com/chemotherapy/drug-info/camptosar.aspx").content)

x = """ {ns1}/following-sibling::*
        [count(. | {ns2}/preceding-sibling::*)
        =
         count({ns2}/preceding-sibling::*)
        ]
        """.format(ns1=ns1, ns2=ns2)

print(tree.xpath(x))

哪个会输出:

[<Element ul at 0x7f0c39edc998>, <Element blockquote at 0x7f0c39edc9f0>, <Element ul at 0x7f0c39edca48>, <Element strong at 0x7f0c39edcaa0>, <Element blockquote at 0x7f0c39edcaf8>, <Element ul at 0x7f0c39edcb50>]

如果你只想要ul,你会删除*并用ul替换它们:

x = """ {ns1}/following-sibling::ul
        [count(. | {ns2}/preceding-sibling::ul)
        =
         count({ns2}/preceding-sibling::ul)
        ]
        """.format(ns1=ns1, ns2=ns2)

然后会给你:

[<Element ul at 0x7fe3e8ac0aa0>, <Element ul at 0x7fe3e8ac0af8>, <Element ul at 0x7fe3e8ac0b50>]

为了获得链接和副作用,我们可以使用ul/li/a

In [12]: x = """ {ns1}/following-sibling::ul/li/a
   ....:         [count(. | {ns2}/preceding-sibling::ul/li/a)
   ....:         =
   ....:          count({ns2}/preceding-sibling::ul/li/a)
   ....:         ]
   ....:         """.format(ns1=ns1, ns2=ns2)

In [13]: for a in tree.xpath(x):
   ....:         print(a.text, a.xpath("@href"))
   ....:     
('Diarrhea', ['../side-effects/diarrhea-and-chemotherapy.aspx'])
('Nausea and vomiting', ['../side-effects/nausea-vomiting-chemotherapy.aspx'])
('Weakness', ['../side-effects/weakness.aspx'])
('Low white blood\r\n            cell count', ['../side-effects/low-blood-counts.aspx#LowWhite'])
('Low red blood cell\r\n            count', ['../side-effects/low-blood-counts.aspx#LowRed'])
('Hair loss', ['../side-effects/hair-loss-and-chemotherapy.aspx'])
('Poor appetite', ['../side-effects/cancer-and-chemobased-lack-of.aspx'])
('Fever', ['../side-effects/fever-neutropenic-fever-and-their-relationship-to-chemotherapy.aspx'])
('Weight loss', ['../side-effects/weight-changes.aspx'])

您可以看到与页面完全匹配的内容:

enter image description here 一个工作蜘蛛的例子:

class DrugsSpider(scrapy.Spider):
    name = "drug_spider"
    start_urls = ["http://www.chemocare.com/chemotherapy/drug-info/camptosar.aspx",
                  "http://www.chemocare.com/chemotherapy/drug-info/Eloxatin.aspx"]

    ns1 = "//p[contains(. , '(occurring in greater than 30%)')]"
    ns2 = "//p[contains(., '10-29%)')]"

    x = """ {ns1}/following-sibling::ul/li
            [count(. | {ns2}/preceding-sibling::ul/li)
            =
             count({ns2}/preceding-sibling::ul/li)
            ]
            """.format(ns1=ns1, ns2=ns2)

    def parse(self, response):
        for li in response.xpath(self.x):
            print(li.xpath("normalize-space(.)").extract())


from scrapy.crawler import CrawlerProcess

p = CrawlerProcess()
p.crawl(DrugsSpider)
p.start()

输出:

2016-09-22 22:43:27 [scrapy] DEBUG: Crawled (200) <GET http://www.chemocare.com/chemotherapy/drug-info/Eloxatin.aspx> (referer: None)
[u'Numbness and tingling (peripheral neuropathy) and cramping of the hands or feet often triggered by cold.\xa0\xa0 This symptom will generally lessen or go away between treatments, however as the number of treatments increase the numbness and tingling will take longer to lessen or go away.\xa0 Your health care professional will monitor this symptom with you and adjust your dose accordingly.']
[u'Nausea and vomiting']
[u'Diarrhea']
[u'Mouth sores\xa0']
[u'Low blood counts.\xa0 Your white and red blood cells and platelets may temporarily decrease.\xa0 This can put you at increased risk for infection, anemia and/or bleeding.']
[u'Fatigue']
[u'Loss of appetite']
2016-09-22 22:43:28 [scrapy] DEBUG: Crawled (200) <GET http://www.chemocare.com/chemotherapy/drug-info/camptosar.aspx> (referer: None)
[u'Diarrhea - two types:']
[u'Nausea and vomiting']
[u'Weakness']
[u'Low white blood cell count (this can put you at increased risk for infection)']
[u'Low red blood cell count (anemia)']
[u'Hair loss']
[u'Poor appetite']
[u'Fever']
[u'Weight loss']
2016-09-22 22:43:28 [scrapy] INFO: Closing spider (finished)