Since XPath does not support to extract text from multi nodes, I decided to write for loop to get 30 stuffs.
for i in range(1,31):
content = "string(//div[@id='article']/p[" + (print(i)) + "]/.)"
print(content)
I imagined it would return like,
"string(//div[@id='article']/p[1]/.)"
"string(//div[@id='article']/p[2]/.)"
"string(//div[@id='article']/p[3]/.)"
....
"string(//div[@id='article']/p[30]/.)"
However, obviously it does not work as I expected.. I got error message as following.
TypeError: Can't convert 'NoneType' object to str implicitly
What should I do? Any other elegant approach to solve this problem?
答案 0 :(得分:1)
In Python3, print
is a function which prints to the screen and returns None
. (In Python2, print
is a statement and the code would have raised an error since you can't put a statement in the middle of an expression.) Instead, to build a string use the format
method:
content = "string(//div[@id='article']/p[{}]/.)".format(i)
And by the way, you should be able to use position()
just fine with lxml. For instance,
import lxml.html as LH
content = '''\
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
<book>
<title lang="eng">Things Fall Apart</title>
<price>19.99</price>
</book>
<book>
<title lang="eng">Blood Meridian</title>
<price>9.99</price>
</book>
</bookstore>'''
root = LH.fromstring(content)
# Compare with https://stackoverflow.com/a/39242701/190597
print(root.xpath('//book[position()>=1 and position()<=last()]/title/text()'))
# ['Harry Potter', 'Learning XML', 'Things Fall Apart', 'Blood Meridian']
# But note that it is equivalent to
print(root.xpath('//book/title/text()'))
# ['Harry Potter', 'Learning XML', 'Things Fall Apart', 'Blood Meridian']
print(root.xpath('//book[position()<3]'))
prints
['Harry Potter', 'Learning XML']
which shows that you can select the first N
books
without having to loop.
As Tomalak mentions, the XPath string
function only returns the string representation of the first node. For example,
print(root.xpath('string(//book[position()<3]/title/text())'))
only prints
Harry Potter
If you want a list of strings, then don't use string
.
If, as Daniel Haley points out, the desired text is in a mixture of nested nodes and child elements, e.g. <title lang="eng">Harry <b>Potter</b></title>
, then you can extract the desired text using the text_content
method:
[title.text_content() for title in root.xpath('//book[position()<3]/title')]
答案 1 :(得分:1)
The trailing /.
in your xpath is invalid.
Try:
content = "string(//div[@id='article']/p[" + (print(i)) + "])"
Full example:
import lxml.html
html = """<tag1>
<tag2>
<div id="article">
<p> stuff1 </p>
<p> stuff2 </p>
<p> stuff30 <b>more stuff</b></p>
</div>
</tag2>
</tag1>"""
root = lxml.html.fromstring(html)
for i in range(1,4):
content = root.xpath("string(//div[@id='article']/p[" + str(i) + "])")
print(content)
#stuff1
#stuff2
#stuff30 more stuff