Question

您好，开发人员，在那里

我是Python的新手，我需要编写一个网络抓取工具，以从Scholar Google捕获信息。

我最终对该函数进行了编码，以使用Xpath获取值：

thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []

for t in thread:
    if not atr:
        xThread = t.text 
    else:
        xThread = t.get_attribute('href')

    xArray.append(xThread)

    return xArray

我不知道这是好是坏的解决方案。因此，我谦虚地接受任何建议以使其更好地工作。

无论如何，我的实际问题是我从要抓取的页面中获取所有作者的姓名，而我真正需要的是按结果分组的姓名。当我要求打印结果时，我希望可以有以下内容：

[[author1, author2,author 3],[author 4,author 5,author6]]

我现在得到的是：

[author1,author3,author4,author5,author6]

结构如下：

<div class="gs_a">
    LR Hisch,
<a href="/citations?user=xuBuLKYAAAAJ&amp;hl=es&amp;oi=sra">AM Gobin</a>
    ,AR Lowery,
<a href="/citations?user=ziumTX0AAAAJ&amp;hl=es&amp;oi=sra">F Tam</a>
 ... -Annals of biomedical ...,2006 - Springer
</div>

对于不同的文档和作者，整个页面都重复相同的结构。

这是我前面解释的函数的调用：

authors = (clothoSpins(".//*[@class='gs_a']//a"))

哪个可以让我获得全部作者列表。

Answer 1

这是逻辑（以下代码中使用了硒，但根据需要进行更新）。

逻辑：

url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=python&btnG="
driver.get(url)
# get the authors and add to list
listBooks = []
books = driver.find_elements_by_xpath("//div[@class='gs_a']")
for bookNum in range(len(books)):
    auths = []
    authors = driver.find_elements_by_xpath("(//div[@class='gs_a'])[%s]/a|(//div[@class='gs_a'])[%s]/self::*[not(a)]"%(bookNum+1,bookNum+1))
    for author in authors:
        auths.append(author.text)
    listBooks.append(auths)

输出：

[['F Pedregosa', 'G Varoquaux', 'A Gramfort'], ['PD Adams', 'PV Afonine'], ['TE Oliphant'], ['JW Peirce'], ['S Anders', 'PT Pyl', 'W Huber'], ['MF Sanner'], ['S Bird', 'E Klein'], ['M Lutz - 2001 - books.google.com'], ['G Rossum - 1995 - dl.acm.org'], ['W McKinney - … of the 9th Python in Science Conference, 2010 - pdfs.semanticscholar.org']]

截屏：

获取分成单独数组的元素列表

1 个答案: