我有以下html:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scrapy</title>
</head>
<body>
<table style="border: #ffffff 0px solid" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td align="center">
<div style="margin-top:7px;margin-bottom:7px;font-size:16px;font-weight:bold;font-color:white" width="100%">
Scrapy Rocks
</div>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%" style="margin-top:25px">
<tr>
<td align="left" valign="top"></td>
<td valign="top">
<font size="-1">
<div style="margin-right:10; margin-top:5; text-align: right">
<a href="/aaa.html" target="_top">AAA</a> |
<a href="/bbb.html" target="_top">BBB</a> |
<a href="/ccc.html" target="_top">CCC</a>
</div>
</font>
</td>
</tr>
<tr>
<td align="left" valign="top">
<div>
<a href="http://example.com" target="_blank">
<img src="/images/a.jpg" border="0" vspace="0" width="100" height="100" valign="middle"/>
</a>
<a href="/index.html">
<img src="/images/aaa.gif" border="0" vspace="0" width="100" height="100" valign="middle"/>
</a>
</div>
</td>
<td valign="top">
<div style="margin-right:10; margin-top:5; text-align: right"></div>
</td>
</tr>
</table>
<hr size=1>
<h2 style="margin-top: 36px; margin-bottom: 24px">
Abcd efgh for 2017
</h2>
Part 1 |
Part 2 |
Part 3 |
Part 4 |
<a href="#">A very bold title</a>
<hr size="1" style="margin-top: 36px; margin-bottom: 24px">
<a name="part1"></a>
<h3>Part 1</h3>
<ul>
</ul>
<a name="part2"></a>
<h3>Part 2</h3>
<ul>
</ul>
<a name="part3"></a>
<h3>Part 3</h3>
<ul>
</ul>
<a name="part4"></a>
<h3>Part 4</h3>
<ul>
</ul>
<div style="margin-top: 36px; margin-bottom: 24px">
<a name="non_rep"></a>
<h3>Abcd efgh</h3>
</div>
<b>January 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/a/1.htm">Title 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/a/11.htm">Title 2</a>
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/a/2.htm">Title A</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/a/22.htm">Title B</a>
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/a/3.htm">Some text 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/a/33.htm">Some Text 2</a>
</li>
</ul>
</ul>
<b>February 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/b/1.htm">Title 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/b/11.htm">Title 2</a>
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/b/2.htm">Title A</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/b/22.htm">Title B</a>
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/b/3.htm">Some text 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/b/33.htm">Some Text 2</a>
</li>
</ul>
</ul>
<b>March 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/c/1.htm">Title 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/c/11.htm">Title 2</a>
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/c/2.htm">Title A</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/c/22.htm">Title B</a>
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
<a href="/cgi-bin/o.pl?file=/c/3.htm">Some text 1</a>
</li>
<br>
<li>
<a href="/cgi-bin/o.pl?file=/c/33.htm">Some Text 2</a>
</li>
</ul>
</ul>
</body>
</html>
我需要的是在body标签之间提取文本(使用Scrapy xpath),但我根本不想要表格文本。
我试图获取所有文本的内容是:
def parse(self, response):
"""
-*-
"""
item = DummyItem()
title = response.xpath('//title/text()').extract()
body = "\n ".join(
response.xpath(
'//body//*[not(self::script or self::style)]/text()'
).extract()
)
item['title'] = title
item['body'] = body
yield item
在上面的节中,我设法提取所有我不想要的文本,包括表格。 然后我更换了&#34;身体&#34;用:
body = "\n ".join(
response.xpath(
'//body//*[not(self::table or self::script or self::style)]/text()'
).extract()
)
它没有完成这项工作。仍然提取表文本。
关于如何解决它的任何想法?
答案 0 :(得分:2)
您希望“所有不在 Intent i = this.getIntent();
u = i.getStringExtra("uname");
p = i.getStringExtra("pass");`
”中的文本节点,或“所有没有<table>
祖先的文本节点”
XPath中的<table>
。
/html/body//text()[not(ancestor::table)]
现在您只需要从结果项中删除空格并从列表中删除空字符串。
text_nodes = response.xpath("/html/body//text()[not(ancestor::table)]").extract()