我是scrapy的新手并尝试使用它来提取以下数据 来自以下示例html代码的“name”,“address”,“state”,“postal_code”:
<div id="superheroes">
<table width="100%" border="0" ">
<tr>
<td valign="top">
<h2>Superheroes in New York</h2>
<hr/>
</td>
</tr>
<tr valign="top">
<td width="75%">
<h2>Peter Parker</h2>
<hr />
<table width="100%">
<tr valign="top">
<td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td>
<td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
<strong>State:</strong>New York<br/>
<strong>Postal Code:</strong>12345<br/>
<strong>Telephone:</strong> 555-123-4567</td>
</tr>
<tr>
<td height="18" valign="top"> </td>
<td align="right" valign="top"><a href="spiderman"><strong>Read More</strong></a></td>
</tr>
</table>
<h2>Tony Stark</h2>
<hr />
<table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top">
<tr valign="top">
<td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td>
<td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
<strong>State:</strong> New York<br/>
<strong>Postal Code:</strong> 54321<br/>
<strong>Telephone:</strong> 555-987-6543</td>
</tr>
<tr>
<td height="18" valign="top"> </td>
<td align="right" valign="top"><a href="iron_man"><strong>Read More</strong></a></td>
</tr>
</table>
</td>
<td width="25%">
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
</script>
</td>
</tr>
</table>
</div>
我的superheroes.py包含以下代码:
from scrapy.spider import CrawlSpider, Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from superheroes.items import Superheroes
items = []
class MySpider(CrawlSpider):
name = "superheroes"
allowed_domains = ["www.somedomain.com"]
start_urls = ["http://www.somedomain.com/ny"]
rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
for table in tables:
item = Superheroes()
item['name'] = table.xpath('h2/text()').extract()
item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract()
item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract()
item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract()
items.append(item)
return items
我的items.py包含:
import scrapy
class Superheroes(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
state = scrapy.Field()
postal_code = scrapy.Field()
当我运行“scrapy runspider superheroes.py -o super_db -t csv”时,输出文件为空。
有人可以帮我解决上面代码中的任何错误吗?
非常感谢你的帮助!
答案 0 :(得分:1)
您应该在fn main() {
let x: Result<u8, std::num::ParseIntError> = foo();
println!("{:?}", x);
}
fn foo<T: num_traits::Num>() -> Result<T, <T as num_traits::Num>::FromStrRadixErr> {
T::from_str_radix("4242", 10)
}
周期和for
每个项目中更改xpath表达式,而不是yield
数组
return
答案 1 :(得分:1)
您的代码存在两个问题。首先,您的parse_item
方法似乎没有缩进(至少,它在您的问题中看起来如何),因此不会包含在MySpider类中。 superheroes.py中从def parse_item(self, response):
开始的每一行都需要在它前面有两个空格。
第二个问题是rules
声明应该为页面中找到的每个链接(即parse_item
)调用SgmlLinkExtractor
。您可以在输出中看到它尝试获取/iron_man
和/spiderman
- 这些页面的输出将传递给parse_item
。
要使用start_urls
处理parse_item
,您需要将其重命名为parse_start_url
。如果您只有一个页面正在处理,您甚至可以摆脱rules
! (请参阅documentation关于parse_start_url
)。
您更新的类看起来像这样(请注意,我还在方法中移动了items
;没有必要将其声明为全局):
class MySpider(CrawlSpider):
name = "superheroes"
allowed_domains = ["localhost"]
start_urls = ["http://localhost:8000/page.html"]
# indentation!
def parse_start_url(self, response):
sel = Selector(response)
headers = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]/h2')
for header in headers:
item = Superheroes()
item['name'] = header.xpath('text()')[0].extract()
table = header.xpath('following-sibling::table')
item['address'] = table.xpath('tr[1]/td[2]/strong[1]/following-sibling::text()')[0].extract().strip()
item['state'] = table.xpath('tr[1]/td[2]/strong[2]/following-sibling::text()')[0].extract().strip()
item['postal_code'] = table.xpath('tr[1]/td[2]/strong[3]/following-sibling::text()')[0].extract().strip()
yield item
编辑:感谢@Daniil Mashkin指出原始xpath表达式不起作用。我在上面的代码中纠正了它们。干杯!