I am using scrapy to extract data from web. I am trying to extract the text of anchor tags under a span tag as shown below:
<span>.....</span>
<span id = "size_selection_list">
<a>....</a>
<a>....</a>
.
.
.
<a>
</span>
I am using the following xpath logic:
t = sel.xpath('//div[starts-with(@id,"size_selection_container")]/span[2]')
for x in t.xpath('.//a'):
....
The problem is that the span element is reached but the <a>
tags are not iterated. What is the mistake here? Also the <a>
has an href which has javascript. Is this the reason for the problem?
答案 0 :(得分:0)
如果我愿意,我会使用requests
和BeautifulSoup4
。
请注意,此代码未经测试,但应该工作。
import requests
from bs4 import BeautifulSoup
r = requests.get(yoururlhere).text
soup = BeautifulSoup(r, 'html.parser') #You can use LXML or other things, I am using the standard parser for compatibility
span = div.find('div', {'class': 'theclass'}
tags = span.findAll('a', href=True)
for i in tags:
print(i.getText()) #getText might not be a function, consider removing the extra ()
print(i['href']) #<-- This is the links, above is the text
我希望这有效,请让我知道
答案 1 :(得分:0)
这是我能做的一切,你的HTML代码不完整。
import lxml.html
string = '''<span>.....</span>
<span id = "size_selection_list">
<a>....</a>
<a>....</a>
.
.
.
<a>....</a>
</span>'''
html = lxml.html.fromstring(string)
for a in html.xpath('//span[@id="size_selection_list"]//a'):
print(a.tag)
出:
a
a
a