我正在尝试制作一个python脚本,用我有限的知识从网页中抓取特定信息。但我想我的有限知识是不够的。 我需要提取7-8条信息。标签如下 -
1
boolean _Is_Comming_From_Notification = intent.getBooleanExtra("is_Comming_Form_Notification", false);
2
@Override
public void onBackPressed() {
if (_Is_Comming_From_Notification ) {
Intent intent = new Intent(this, App_Home_Page.class);
startActivity(intent);
}
super.onBackPressed();
}
如果我知道如何从这样的href标签中提取信息。我将能够自己完成其余的工作。
如果有人可以帮助我编写代码添加这个信息在csv文件中将非常感激。
我已开始使用此代码
<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>
答案 0 :(得分:1)
您可以使用lxml和csv模块执行您想要的操作。 lxml支持xpath表达式来选择你想要的元素。
from lxml import etree
from StringIO import StringIO
from csv import DictWriter
f= StringIO('''
<html><body>
<a class="ui-magnifier-glass"
href="here goes the link that i want to extract"
data-spm-anchor-id="0.0.0.0"
style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
></a>
<a href="link to extract"
title="title to extract"
rel="category tag"
data-spm-anchor-id="0.0.0.0"
>or maybe this word instead of title</a>
</body></html>
''')
doc = etree.parse(f)
data=[]
# Get all links with data-spm-anchor-id="0.0.0.0"
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')
# Iterate thru each element containing an <a></a> tag element
for elem in r:
# You can access the attributes with get
link=elem.get('href')
title=elem.get('title')
# and the text inside the tag is accessable with text
text=elem.text
data.append({
'link': link,
'title': title,
'text': text
})
with open('file.csv', 'w') as csvfile:
fieldnames=['link', 'title', 'text']
writer = DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
答案 1 :(得分:0)
以下是如何使用lxml
以及使用curl
的某些内容来取消ID:
curl some.html | python extract.py
extract.py:
from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)
some.html:
<html>
<body>
<div id="nope">nope</div>
<div id="postingbody">yep</div>
</body>
</html>
另见: