Question

我使用scrapy从网页获取数据。我遇到了如下问题。

<li>
<a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
<b>
man
</b>
X -
<i>
Escherichia coli
</i>
</a>
<br>
</li>

在网页中，记录的名称如下所示：

我想在<a>标签中获取内容（例如：<强烈的> X-Escherichia coli ），并且不想获得其他标签。这是我的代码：

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]')
    base_url = "http://www.metacyc.org/META"
for site in sites:
    item = MetaCyc()
    name_tmp = map(unicode.strip, site.xpath('text()').extract())
    item['Name'] = unicode(name_tmp).encode('utf-8')
    item['Link'] = map(unicode.strip, site.xpath('@href').extract())
    yield item

我尝试将 unicode 转换为 utf-8 ，但结果仍然如下所示：

{"Link": ["NEW-IMAGE?type=GENE&object=EG10567"], "Name": "[u'X -']"}

有时记录中会遗漏一些字符。所以我想知道如何从HTML代码中获取完整和正确的格式数据。

Answer 1

我建议您使用XPath's normalize-space()

normalize-space函数返回带有空格的参数字符串，该空格通过剥离前导和尾随空格并用空格替换空格字符序列来规范化。空格字符与XML生成中允许的空格字符相同。如果省略该参数，则默认为转换为字符串的上下文节点，换句话说，是上下文节点的字符串值。

>>> html = """<li>
... <a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
... <b>
... man
... </b>
... X -
... <i>
... Escherichia coli
... </i>
... </a>
... <br>
... </li>"""
>>> import scrapy
>>> selector = scrapy.Selector(text=html)

>>>
>>> links = selector.xpath('//li/a[contains(@href,"NEW-IMAGE")]')
>>> for link in links:
...     item = {}
...     item['Name'] = link.xpath('normalize-space(.)').extract_first()
...     item['Link'] = link.xpath('@href').extract_first()
...     print(item)
... 
{'Link': u'NEW-IMAGE?type=GENE&object=EG10567', 'Name': u'man X - Escherichia coli'}
>>>

Answer 2

如果您想获取//$xml = file_get_contents("thexmlfile.xml"); $xml= $propertyXml->asXML(); $ch = curl_init(); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2); curl_setopt($ch, CURLOPT_CAINFO, getcwd() . '\pemfile.pem'); curl_setopt($ch, CURLOPT_URL, "https://adfapi.adftest.rightmove.com/v1/property/sendpropertydetails"); curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml')); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_SSH_PRIVATE_KEYFILE, getcwd() . '\myjks.jks'); curl_setopt($ch, CURLOPT_SSLCERT, getcwd() . '\pemfile.pem'); curl_setopt($ch, CURLOPT_SSLCERTPASSWD, "thesslpassword"); curl_setopt($ch, CURLOPT_POSTFIELDS, $xml); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt($ch, CURLOPT_REFERER, "https://adfapi.adftest.rightmove.com/v1/property/sendpropertydetails"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_VERBOSE , 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); $ch_result = curl_exec($ch); print curl_errno($ch); print curl_error($ch); echo "Result = ".$ch_result; curl_close($ch);代码及其子代的文字，则需要使用a代替//text()

试试这个：

text()

您可以使用其他模块name_tmp = map(unicode.strip, site.xpath('//text()').extract())来仅获取特定标记的文本。

html2text

Answer 3

我想在<a>标签中获取内容（例如：<强烈>男性X大肠杆菌），并且不想获得其他标签。

部分问题是文本并非全部包含在<a>标记中。其中一些嵌套在<i>标记下面的<a>标记标记中。要将完整链接文本作为字符串获取：

item_name = " ".join([word.strip() for word in sel.xpath('//li/a[contains(@href,"NEW-IMAGE")]//text()').extract() if len(word.strip())]) # => item_name = 'man X - Escherichia coli'

//a//text()表示以递归方式抓取文档中所有<a>标记及其子项下的所有文字。您的sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]/text()').extract()会得到＆＃34;有些文字＆＃34;

<a href="../">Some text</a>

但是会省略＆＃34;还有一些＆＃34;在<b>标记内：

<a href="../">Some text<b>And some more here</b></a>

如何使用Scrapy获取完整的链接文本

3 个答案: