Question

我正在尝试执行以下操作 -

转到网页，输入搜索字词。
从中获取一些数据。
反过来又有多个网址。我需要解析它们中的每一个以从中获取一些数据。

我可以做1和2.我不明白我如何能够访问所有URL并从中获取数据（所有URL中都相似，但不相同）。

编辑：更多信息 - 我从csv文件输入搜索词，从每个页面获取一些ID（带URL）。我想转到所有这些网址，以便从以下页面获取更多ID。我想将所有这些写入CSV文件。基本上，我希望我的输出是这样的

Level1 ID1   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID
Level1 ID2   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID

每个Level1 ID可以有多个Level2 ID。但每个Level2 ID只有一个相应的Level3 ID。

我到目前为止编写的代码：

import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen

colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header


def download_gsm_number(gse_id):
    url = "http://www.example.com" + id
    readurl = urlopen(url)
    soup = BeautifulSoup(readurl)
    soup1 = str(soup)
    gsm_data = readurl.read()
    #url_file_handle.close()
    pattern=re.compile(r'''some(.*?)pattern''')  
    data = pattern.findall(soup1)
    col_width = max(len(word) for row in data for word in row)
    for row in data:
        lines = "".join(row.ljust(col_width))
        sequence = ''.join([c for c in lines])
        print sequence

但是这会将所有ID同时带入URL。正如我之前提到的，我需要从level1 id中获取level2 id作为输入。此外，从level2 ID，我需要level3 ID。基本上，如果我从中得到一个部分（获得level2或level3 id），我可以弄清楚剩下的部分。

Answer 1

我相信你的答案是urllib。

实际上就像去一样容易：

web_page = urllib.urlopen(url_string)

然后你就可以进行正常的文件操作，例如：

read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()

从那里我建议使用BeautifulSoup来解析它，这很简单：

soup = BeautifulSoup(web_page.read())

然后你就可以对它进行所有精彩的BeautifulSoup操作。

我认为Scrapy太过分了，而且还有更多的开销。 BeautifulSoup有一些很棒的文档，例子，并且简单易用。

使用Python从多个URL获取数据

1 个答案: