Python div类提取内容

时间:2018-04-10 16:14:35

标签: python string beautifulsoup

我正在尝试在字符串"<div><p class='entete_propriete'>DNA sequence </p>""</div>"之间提取文字:

    handle = open(i, 'r')
    name = i.split('=')[1]
    print name
    soup = BeautifulSoup(handle,"lxml")
    for item in soup:
        seq = soup.findAll(seq) 
        print seq

    <section>
    <div><p class='entete_propriete' align='center'>Ends</p>
<br><span class='entete_propriete'>IR Length : </span>44/49<br><br><span class='entete_propriete_bis'>IRL : </span><span class='seq'>GAGGGTCGGCAGGGATTCGTGTAAAACACAGCCAAAAGTGAGCTAACTCC</span><br><span class='entete_propriete_bis'>IRR : </span><span class='seq'>GAGGGTCGACAGGGATTTGTGTAAAAAACAGCCAAAATTGAGCTAAATCT</span><br>   </div>
    <div><p class='entete_propriete' align='center'>Insertion site</p><br>
<table><tr><th>Left flank</th><th>Direct repeat</th><th>Right flank</th><th>DR Length</th></tr><tr> <td class='seq' align='right'>TCCACTACCT</td><td class='seq' align='center'></td><td class='seq' align='left'>TCGTTGAGCA</td><td class='seq' align='center'>0</td></tr></table> </div>
    <div class="piedSection"></div>       
    </section>
        <section>
    <div id=seq_ident><p>IS1007</p><ul><li><span class='entete_propriete'>Family </span>IS6</li><li><span class='entete_propriete'>Group </span></li></ul></div><span class='entete_propriete'> MGE type </span>IS<span class='entete_propriete_decal'>Related element(s) : </span><br><span class='entete_propriete'>Isoform </span><span class='entete_propriete_decal'>Synonym(s) </span>    <div class="piedSection"></div>
        </section>
            <div><p class='entete_propriete'>DNA sequence </p>
    <div class='seq'>GGCACTGTTGCAAATAGGCTGACATGATAAGCTAAATATCTTATTTATTTCGAGATACAGCAGATGAATCCCTTCCACGGTCGGCACTTTCAAGGTGAAA<br />
GAGAAGTTTGGCTAGTAAATAGAGTTTTCGGTCTCTAAGCTTTTTTGAAGGGAAAATCATTGACTCAGAT<br />
CCCTATTTGCAACAGTGCC </div> 

输出是这样的:

IS1007
[]
[]

如果我能理解,我可以删除{\<\br/>}。

TATCTTATTTATTTCGAGATACAGCAGATGAATCCCTTCCACGGTCGGCACTTTCAAGGTGAAA<br />
GAGAAGTTTGGCTAGTAAATAGAGTTTTCGGTCTCTAAGCTTTTTTGAAGGGAAAATCACTCAG<br />
ATCCCTATTTGCAACAGTGCC </div> 

任何提取以下序列的建议:

\<\div\>\<\p class='entete_propriete'>DNA sequence \<\/\p\>
    <div class='seq'>

\<\div\>

1 个答案:

答案 0 :(得分:0)

您可以使用get_text()方法或text属性从标记中获取数据

for item in soup:
   seq = soup.findAll(seq) 
   print seq.get_text()

您也可以使用seq.text