如何使用scrapy在多个html标签之间获取纯文本

时间:2016-06-14 14:49:39

标签: html web-scraping scrapy scrapy-spider

我正在尝试使用scrapy从给定的URL中获取多个标签中的所有文本。我是scrapy的新手。我不知道如何实现这一目标。通过示例和人们对stackoverflow的体验进行学习。 以下是我定位的代码列表。

<div class="TabsMenu fl coloropa2 fontreg"><p>root div<p>
<a class="sub_h" id="mtongue" href="#">Mother tongue</a>
<a class="sub_h" id="caste" href="#">Caste</a>

<a class="sub_h" id="scases" href="#">My name is nand </a> </div>
<div class="BrowseContent fl">
<figure style="display: block;" class="mtongue_h">
<figcaption>
<div class="fullwidth clearfix pl10">Div string for test</div>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
<div>

<select>
  <option value="volvo">Volvo</option>
  <option value="saab">Saab</option>

</select>

</div>
<li><a title="Hindi UP Matrimony" href="/hindi-up-matrimony-matrimonials"> Hindi-UP </a></li>

预期的结果将是

root div
Mother tongue
Caste
My name is nand
Div string for test
Coffee
Tea
Milk
Volvo
Saab
Hindi-UP

我试图通过Xpath。 这是蜘蛛代码快照

     def parse(self, response):
 for sel in response.xpath('//body'):

        lit = sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()
        print lit
        string1 = ''.join(lit).encode('utf-8').strip('\r\t\n')
        print string1
        para=sel.xpath('//p/text()').extract()
        span=sel.xpath('//span/text()').extract()
        div=sel.xpath('//div/text()').extract()
        strong=sel.xpath('//span/strong/text()').extract()
        link=sel.xpath('//a/text()').extract()
        string2 = ''.join(para).encode('utf-8').strip('\r\t\n')
        string3 = ''.join(span).encode('utf-8').strip('\r\t\n')
        string4 = ''.join(div).encode('utf-8').strip('\r\t\n')
        string5 = ''.join(strong).encode('utf-8').strip('\r\t\n')
        string6 = ''.join(link).encode('utf-8').strip('\r\t\n')
        string=string6+string5+string4+string3+string2
        print string

项目的代码捕捉

class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
para=scrapy.Field()
strong=scrapy.Filed()
span=scrapy.Filed()
div=scrapy.Filed()

这是输出

BROWSE PROFILES BYMother tongueCasteReligionCityOccupationStateNRISpecial Cases Hindi-Delhi  Marathi  Hindi-UP  Punjabi  Telugu  Bengali  Tamil  Gujarati  Malayalam  Kannada  Hindi-MP  Bihari RajasthaniOriyaKonkaniHimachaliHaryanviAssameseKashmiriSikkim/NepaliHindi Brahmin  Sunni  Kayastha  Rajput  Maratha  Khatri  Aggarwal  Arora  Kshatriya  Shwetamber  Yadav  Sindhi  Bania Scheduled CasteNairLingayatJatCatholic - RomanPatelDigamberSikh-JatGuptaCatholicTeliVishwakarmaBrahmin IyerVaishnavJaiswalGujjarSyrianAdi DravidaArya VysyaBalija NaiduBhandariBillavaAnavilGoswamiBrahmin HavyakaKumaoniMadhwaNagarSmarthaVaidikiViswaBuntChambharChaurasiaChettiarDevangaDhangarEzhavasGoudGowda Brahmin IyengarMarwariJatavKammaKapuKhandayatKoliKoshtiKunbiKurubaKushwahaLeva PatidarLohanaMaheshwariMahisyaMaliMauryaMenonMudaliarMudaliar ArcotMogaveeraNadarNaiduNambiarNepaliPadmashaliPatilPillaiPrajapatiReddySadgopeShimpiSomvanshiSonarSutarSwarnkarThevarThiyyaVaishVaishyaVanniyarVarshneyVeerashaivaVellalarVysyaGursikhRamgarhiaSainiMallahShahDhobi-KalarKambojKashmiri PanditRigvediVokkaligaBhavasar KshatriyaAgnikula Audichya Baidya Baishya Bhumihar Bohra Chamar Chasa Chaudhary Chhetri Dhiman Garhwali Gudia Havyaka Kammavar Karana Khandelwal Knanaya Kumbhar Mahajan Mukkulathor Pareek Sourashtra Tanti Thakur Vanjari Vokkaliga Daivadnya Kashyap Kutchi OBC Hindu  Muslim  Christian  Sikh  Jain  Buddhist  Parsi  Jewish  New Delhi  Mumbai  Bangalore  Pune  Hyderabad  Kolkata  Chennai  Lucknow  Ahmedabad  Chandigarh  Nagpur JaipurGurgaonBhopalNoidaIndorePatnaBhubaneshwarGhaziabadKanpurFaridabadLudhianaThaneAlabamaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict ColumbiaFloridaIndianaIowaKansasKentuckyMassachusettsMichiganMinnesotaMississippiNew JerseyNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaSouth CarolinaTennesseeTexasVirginiaWashingtonMangalorean  IT Software  Teacher  CA/Accountant  Businessman  Doctors/Nurse  Govt. Services  Lawyers  Defence  IAS  Maharashtra  Uttar Pradesh 

此代码快照给出所有文本字符串,但所有文本都在一起没有空格。可以在新行中获取每个短语并在单词之间放置空格。 是否有任何有效的方法,以便使用废料。以后我想将它们保存在一个文件。可以使用一些代码快照指导我。

1 个答案:

答案 0 :(得分:1)

@paultrmbrth建议我这个解决方案,它对我有用

def parse_item(self,response):


        with open(text, 'wb') as f:
            f.write("".join(response.xpath('//body//*[not(self::script or self::style)]/text()').extract() ).encode('utf-8'))

        item = DmozItem()
        yield item