Question

我试图通过使用beautifulsoup和selenium来抓取动态网站。我想要过滤并放入CSV的属性包含在＆lt; script＆gt; 标记中。我想提取

＆＃xA;＆＃xA;

脚本：＆＃xA; ＆＃XA; window.IS24 = window.IS24 || {};＆＃XA; IS24.ssoAppName =“search”;＆＃xA; IS24.applicationContext =“/ Suche / error-reporter”;＆＃xA; IS24.ab = {};＆＃xA; IS24.feature = {“SEARCH_BY_TELEKOM_SPEED_ENABLED”：true，＆＃xA; IS24.resultList = {＆＃xA; angularDebugInfoEnabled：false，＆＃xA; navigationBarUrl：“/ Suche / ST / Haus-Kauf”，

＆＃xA;＆＃xA;

  nextPage：“/ Suche / ST / P-2 / Haus-Kauf？pagerReporting = true ”，＆＃XA;＆＃XA; searchUrl：“/ Haus-Kauf”，＆＃xA; isMobile：false，＆＃xA; isTablet：false，＆＃xA; query：＆＃xA; {“realEstateType”：“HOUSE_BUY”，“otpEnabled”：true，“sortingCode”：0，“location”：＆＃xA; {“isGeoHierarchySearch”：true，＆＃xA; Schulze“，” referrer“：[”RESULT_LIST_GROUPED“]，”** attributes“：[＆＃xA; {”title“：”Kaufpreis“，”value“：”249.012,75€“}，＆＃xA; {”title“： “Wohnfläche”，“value”：“129,87m²”}，{“title”：“Zimmer”，“value”：“4”}，＆＃xA; {“title”：“Grundstück”，“value” ：“400m²”}，“checkedAttributes”：[“Gäste -  **＆＃xA;

＆＃xA;＆＃xA;

我不知道如何提取属性最后变成了CSV。你可以帮我解释一下代码吗？

＆＃xA;

Answer 1

以下是如何使用beautifulSoup从标记中提取属性值。

import urllib2
from bs4 import BeautifulSoup

req = urllib2.Request('http://website_to_grab_things_from.com')
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
alltext = soup.getText()

#soup.findAll('TAGNAME', {'ATTR_NAME' :'ATTR_VALUE'})
result = soup.findAll('div', {'class' :'teaser-text'})

使用BeautifulSoup和Selenium刮取动态网站以获取<script tag =“”>中的元素

1 个答案: