Python 3 - 在类外发出调用类方法

时间:2018-04-26 10:46:34

标签: python function class oop web-scraping

我已经搜索了很长时间才得到答案,得到了一些结果,但这些似乎与我的用例无关。我似乎仍然无法理解为什么我在代码中找不到问题。

我有一个完全正常工作的刮刀,我想用OOP重写,但是在类外调用一个类方法时遇到问题(所有在for循环中)。任何帮助将不胜感激。

我的代码:

class IndeedScraper(object):
    def __init__(self, role, max_pages):
        self.role = role
        self.max_pages = max_pages
        self.url = "https://ie.indeed.com/jobs?as_and={}&radius=25&l=Dublin&fromage=3&limit=50&sort=date".format(role)

    # Finds number of pages resulting from search term provided
     def find_pages(self):

        return pages   # Returns a List of URLs

    # Parses relevant information from each page    
    def find_info(self):

      return l    # Returns a List of Dictionaries with the parsed information

if __name__ == '__main__':

    role = str(input("Enter role to search: "))
    max_pages = int(input('Enter number of pages to scrape: '))

    scraper = IndeedScraper(role, max_pages)

    l_main = []
    pages = scraper.find_pages()

    for i in pages[:max_pages]:
        html_page = urllib.request.urlopen(i)
        source = BeautifulSoup(html_page, "html5lib")
        print("Scraping Page number: " + i)
        results = scraper.find_info(source)  # THIS IS WHERE I DON'T KNOW HOW TO CALL THE 'find_info' function to make it work
        l_main.extend(results)

    # Put all results into a DataFrame
    df = pd.DataFrame(l_main)
    df = df[['Date', 'Company', 'Role', 'URL']]
    df=df.dropna()
    df.sort_values(by=['Date'], inplace=True, ascending=False)
    df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)

显示错误:

Traceback (most recent call last):
  File "class_indeed_TEST.py", line 99, in <module>
    df = df[['Date', 'Company', 'Role', 'URL']]
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 2133, in __getitem__
    return self._getitem_array(key)
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 2177, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/indexing.py", line 1269, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: "['Date' 'Company' 'Role' 'URL'] not in index"

1 个答案:

答案 0 :(得分:1)

问题出在find_pages函数中。这是您发布的原始find_pages函数:

class=pagination

当您调用角色“qa”的脚本时,该网站会返回同一page的所有结果。因此,该页面没有for a in source.find_all('div', class_='pagination'): 的任何div。所以,这一行返回一个空列表:

    for link in a.find_all('a', href=True):

...这意味着它也是空的:

find_pages

所以你从# Finds number of pages resulting from search term provided def find_pages(self): pages = [] html_page = urllib.request.urlopen(self.url) source = BeautifulSoup(html_page, "html5lib") base_url = 'https://ie.indeed.com' # <edited code> pagination_divs = source.find_all('div', class_='pagination') if not pagination_divs: return [base_url + '/jobs?q={}&l=Dublin&sort=date&limit=50&radius=25&start=0'.format(self.role)] for a in pagination_divs: for link in a.find_all('a', href=True): pages.append(base_url + link['href']) # </edited code> pages.insert(0, base_url + '/jobs?q=test&l=Dublin&sort=date&limit=50&radius=25&start=0') pages.pop() return pages 函数返回一个空列表,最终pandas正在创建一个空数据帧。

要解决这个问题,只需添加一个条件来检查分页div是否为空,如下所示:

<%@ Language=VBScript %>
<script language="JScript" runat="server">
   PASTE DES JAVASCRIPT SOURCE HERE, or include by adding src="..." in previous line 
</script>
<%
key = "this is a 24 byte key !!"
message = "This is a test message."
' Use TripleDES (24-byte key) in ECB mode (0, Null iv) with 0 padding
encrypted = des(key, message, 1, 0, Null, 0)
decrypted = des(key, encrypted, 0, 0, Null, 0)
Response.Write "<PRE>"
Response.Write "Key: " & key & vbCrLf
Response.Write "Message(length=" & Len(message) & "): " & message & vbCrLf
Response.Write "Encrypted 3DES ECB: " & stringToHex(encrypted) & vbCrLf
Response.Write "Decrypted 3DES ECB: " & decrypted
Response.Write "</PRE>"
%>

注意:根据您要实现的目标,当div不存在时,您可能需要编辑代码以执行其他操作。