我已经搜索了很长时间才得到答案,得到了一些结果,但这些似乎与我的用例无关。我似乎仍然无法理解为什么我在代码中找不到问题。
我有一个完全正常工作的刮刀,我想用OOP重写,但是在类外调用一个类方法时遇到问题(所有在for循环中)。任何帮助将不胜感激。
我的代码:
class IndeedScraper(object):
def __init__(self, role, max_pages):
self.role = role
self.max_pages = max_pages
self.url = "https://ie.indeed.com/jobs?as_and={}&radius=25&l=Dublin&fromage=3&limit=50&sort=date".format(role)
# Finds number of pages resulting from search term provided
def find_pages(self):
return pages # Returns a List of URLs
# Parses relevant information from each page
def find_info(self):
return l # Returns a List of Dictionaries with the parsed information
if __name__ == '__main__':
role = str(input("Enter role to search: "))
max_pages = int(input('Enter number of pages to scrape: '))
scraper = IndeedScraper(role, max_pages)
l_main = []
pages = scraper.find_pages()
for i in pages[:max_pages]:
html_page = urllib.request.urlopen(i)
source = BeautifulSoup(html_page, "html5lib")
print("Scraping Page number: " + i)
results = scraper.find_info(source) # THIS IS WHERE I DON'T KNOW HOW TO CALL THE 'find_info' function to make it work
l_main.extend(results)
# Put all results into a DataFrame
df = pd.DataFrame(l_main)
df = df[['Date', 'Company', 'Role', 'URL']]
df=df.dropna()
df.sort_values(by=['Date'], inplace=True, ascending=False)
df.to_csv("csv_files/pandas_data.csv", mode='a', header=True, index=False)
显示错误:
Traceback (most recent call last):
File "class_indeed_TEST.py", line 99, in <module>
df = df[['Date', 'Company', 'Role', 'URL']]
File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/usr/local/lib/python3.4/dist-packages/pandas/core/indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['Date' 'Company' 'Role' 'URL'] not in index"
答案 0 :(得分:1)
问题出在find_pages函数中。这是您发布的原始find_pages函数:
class=pagination
当您调用角色“qa”的脚本时,该网站会返回同一page的所有结果。因此,该页面没有for a in source.find_all('div', class_='pagination'):
的任何div。所以,这一行返回一个空列表:
for link in a.find_all('a', href=True):
...这意味着它也是空的:
find_pages
所以你从# Finds number of pages resulting from search term provided
def find_pages(self):
pages = []
html_page = urllib.request.urlopen(self.url)
source = BeautifulSoup(html_page, "html5lib")
base_url = 'https://ie.indeed.com'
# <edited code>
pagination_divs = source.find_all('div', class_='pagination')
if not pagination_divs:
return [base_url + '/jobs?q={}&l=Dublin&sort=date&limit=50&radius=25&start=0'.format(self.role)]
for a in pagination_divs:
for link in a.find_all('a', href=True):
pages.append(base_url + link['href'])
# </edited code>
pages.insert(0, base_url + '/jobs?q=test&l=Dublin&sort=date&limit=50&radius=25&start=0')
pages.pop()
return pages
函数返回一个空列表,最终pandas正在创建一个空数据帧。
要解决这个问题,只需添加一个条件来检查分页div是否为空,如下所示:
<%@ Language=VBScript %>
<script language="JScript" runat="server">
PASTE DES JAVASCRIPT SOURCE HERE, or include by adding src="..." in previous line
</script>
<%
key = "this is a 24 byte key !!"
message = "This is a test message."
' Use TripleDES (24-byte key) in ECB mode (0, Null iv) with 0 padding
encrypted = des(key, message, 1, 0, Null, 0)
decrypted = des(key, encrypted, 0, 0, Null, 0)
Response.Write "<PRE>"
Response.Write "Key: " & key & vbCrLf
Response.Write "Message(length=" & Len(message) & "): " & message & vbCrLf
Response.Write "Encrypted 3DES ECB: " & stringToHex(encrypted) & vbCrLf
Response.Write "Decrypted 3DES ECB: " & decrypted
Response.Write "</PRE>"
%>
注意:根据您要实现的目标,当div不存在时,您可能需要编辑代码以执行其他操作。