Question

我创建了一个脚本，可以从锚标记内部获取href链接以及文本。

这是我的python代码：

import re
import cssselect
from lxml import html

mainTree = html.fromstring('<a href="https://www.example.com/laptops/" title="Laptops"><div class="subCategoryItem">Laptops <span class="cnv-items">(229)</span></div></a>')

for links in mainTree.cssselect('a'):
    urls = [links.get('href')]
    texts = re.findall(re.compile(u'[A-z- &]+'), links.text_content())

    for text in texts:
        print (text)

    for url in urls:
        print (url)

输出：

Laptops 
https://www.example.com/laptops/

我可以这样做而不是使用两个for循环吗？

for text, url in texts, urls:
    print (text)
    print (url)

Answer 1

您可以使用zip功能：

for text, url in zip(texts, urls):
    print (text)
    print (url)

它的作用是 zips 两个或更多个iterables。它们也不必具有相同的大小。

>>> l1 = range(5)
>>> l2 = range(6)
>>> list(zip(l1,l2)) #produces
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
>>>

Answer 2

让我们来看看你在这里尝试做什么：

for text, url in texts, urls:
    print (text)
    print (url)

text, url之后的for部分表示将in后指示的tuple解包为两部分＆＃39;。如果元组没有两个部分，那么你将获得ValueError。

您正在迭代的两个列表都有单个值，只需在它们之间加,就不会做您正在寻找的内容。正如另一个答案中所建议的那样，您可以zip将它们组合成一个数组：

for text, url in zip(texts, urls):
    print (text)
    print (url)

zip会返回一个列表，其中每个元素都是一个元组，由每个提供的列表中的一个元素组成。这很好用，但是没有解决两次不循环列表的问题：你仍然这样做，一次是zip，一次是解压缩。您的更深层次的问题是您是如何获得价值的。

您似乎正在逐步浏览每个链接，然后为每个链接您将获取网址和文本并将其放入列表中。然后，您将在这些列表中打印所有内容。这些列表的长度是否大于1？

get函数只返回一个值：

urls = [links.get('href')]  //Gets one value and puts it in a list of length one

将它放入列表中没有意义。至于你的正则表达式搜索，它理论上可以返回多个值，但是如果你使用re.search()，那么你只能得到第一个匹配，而不需要担心其他值。这就是您目前正在做的事情：

for each link in the document
  put the url into a list
  put all the matching text into a list
  for each url in the list print it
  for each text in the list print it

什么时候可以简化为：

for each link in the document
  print the url
  find the first text and print it

然后你不必担心额外的for循环和压缩。这重构为：

for links in mainTree.cssselect('a'):
    print(links.get('href'))
    print(re.search(re.compile(u'[A-z- &]+'), links.text_content()))

在python中使用for循环

2 个答案: