使用python LXML从html网页中提取信息

时间:2015-07-16 20:06:43

标签: python html beautifulsoup lxml python-requests

我正在尝试制作一个python脚本,用我有限的知识从网页中抓取特定信息。但我想我的有限知识是不够的。 我需要提取7-8条信息。标签如下 -

1

   boolean _Is_Comming_From_Notification = intent.getBooleanExtra("is_Comming_Form_Notification", false);

2

 @Override
   public void onBackPressed() {

       if (_Is_Comming_From_Notification ) {
        Intent intent = new Intent(this, App_Home_Page.class);
        startActivity(intent);
       }

       super.onBackPressed();
   }

如果我知道如何从这样的href标签中提取信息。我将能够自己完成其余的工作。

如果有人可以帮助我编写代码添加这个信息在csv文件中将非常感激。

我已开始使用此代码

<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>

2 个答案:

答案 0 :(得分:1)

您可以使用lxml和csv模块执行您想要的操作。 lxml支持xpath表达式来选择你想要的元素。

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

答案 1 :(得分:0)

以下是如何使用lxml以及使用curl的某些内容来取消ID:

curl some.html | python extract.py

extract.py:

from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)

some.html:

<html>
    <body>
        <div id="nope">nope</div>
        <div id="postingbody">yep</div>
    </body>
</html>

另见: