从网页上可能嵌套的<span>中提取所有文本

时间:2016-12-09 18:48:19

标签: javascript jquery python web-scraping beautifulsoup

我有一个网页,其中包含<span class="x"></span>个标签中包含的各种文字摘要。我想生成每个这样的代码片段的有序列表。直截了当。

皱纹:经常会发生嵌套在外部的<span class="x">标签,我不在乎。基本上,我想要一个至少在一个<span class="x">标签内的每个字符串的列表,但是应该忽略并丢弃任何其他嵌套的这样的标签。

以下是一些HTML示例:

<p>
  Outer text. <span class="x">Inside a single span.</span> Back to outer text once more. <span class="x"><span class="x">Inside two spans</span> or just one</span>. Perhaps a <span class="x">single span contains <span class="x">several</span> 
  <span class="x">nests</span>  <span class="x">within <span class="x">it</span>
  </span>!</span>
</p>
<span>Maybe there's a span out here.</span><span>(Or two.)</span>
<p>
  <table>
    <tr>
      <td>
        <span class="x">Or <span class="x">in</span><span class="x">here</span></span>.
      </td>
    </tr>
  </table>
</p>
<p>
  <span>No.</span>  <span>Still no, but<span class="x">yes</span>.</span>
</p>

以及我想要的输出:

[ "Inside a single span.",
  "Inside two spans or just one",
  "single span contains several nests within it!",
  "Maybe there's a span out here.",
  "(Or two.)",
  "Or inhere",
  "yes" ]

此示例的具体功能我想引起注意:

  • 最外面的跨度可以出现在较大的HTML文档中的任何深度。
  • 跨度可以任意嵌套。 (虽然在实践中我到目前为止还没有找到任何超过3或4层的实例)
  • 相邻外跨之间可能存在也可能不存在空白;我希望他们的内容解析为单独的字符串。
  • 不需要没有“x”类的跨度标签。
  • 相邻内部标签之间可能有也可能没有空格;我想保留原样。
  • 我预计不会有任何<span class="x">标记包含除其他嵌套<span class="x">标记之外的任何HTML标记

我会对JavaScript + jQuery解决方案或Python3 + BeautifulSoup解决方案感到满意,或者如果它比其中任何一个更适合手头的任务,我会很满意。

7 个答案:

答案 0 :(得分:1)

尝试:

$('span.x').each(function(index, el) {
console.log(el.childNodes[0].textContent)
});

$('span.x').each(function(index, el) {
 $(el).text();
});

这是当然的jquery例子。 它将在控制台中列出所有跨度文本值。

只需使用此代码段构建您的有序列表。

答案 1 :(得分:1)

您可以通过简单的jQuery语句获得JavaScript的完整文本列表:

$("span.x").map(function(e) {return $(this).text() == "" ? null : $(this).text()})

由您决定如何使用它。

答案 2 :(得分:1)

JS解决方案:

<?php

require_once('.config.inc.php');


$serviceUrl = "https://mws.amazonservices.com/Products/2011-10-01";

 $config = array (
   'ServiceURL' => $serviceUrl,
   'ProxyHost' => null,
   'ProxyPort' => -1,
   'ProxyUsername' => null,
   'ProxyPassword' => null,
   'MaxErrorRetry' => 3,
 );

 $service = new MarketplaceWebServiceProducts_Client(
        AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY,
        APPLICATION_NAME,
        APPLICATION_VERSION,
        $config);


//First we set up all the list variables 
$FeesEstimateRequest = new MarketplaceWebServiceProducts_Model_FeesEstimateRequest();
$FeesEstimateRequest->setMarketplaceId('ATVPDKIKX0DER'); // Amazon.com marketplace id
$FeesEstimateRequest->setIdType('SellerSKU');             // IdType values: ASIN, SellerSKU, SellerSKU in your case
$FeesEstimateRequest->setIdValue('XXXXXXXXXX');       // The value of the id you have entered
$FeesEstimateRequest->setIdentifier('request1');          // A identifier for the item you have requested, this is for your own use 
$FeesEstimateRequest->setIsAmazonFulfilled(FALSE);        // Fullfilled by Amazon? true if the offer is fulfilled by Amazon.    

//To set up the $PriceToEstimateFees object we need two instances of the object MarketplaceWebServiceProducts_Model_MoneyType
//@ set up for both cases: Listing Price and Shipping Price  
//New object MoneyType, setting up the currency and amount for listing price
$MoneyTypeListingPrice = new MarketplaceWebServiceProducts_Model_MoneyType();
$MoneyTypeListingPrice->setCurrencyCode('USD'); // String, the currency code of the price : USD in this example for amazon.com marketplace
$MoneyTypeListingPrice->setAmount('0.00');      // String, the price of the item 

//New object MoneyType, setting up the currency and amount for shipping price
$MoneyTypeShipping = new MarketplaceWebServiceProducts_Model_MoneyType();
$MoneyTypeShipping->setCurrencyCode('USD'); // String, the currency code of the price : USD in this example for amazon.com marketplace
$MoneyTypeShipping->setAmount('0.00');       // String, the price of the item 

//Setting up the prices: Listing Price and Shipping Price
$PriceToEstimateFees = new MarketplaceWebServiceProducts_Model_PriceToEstimateFees();
$PriceToEstimateFees->setListingPrice($MoneyTypeListingPrice);
$PriceToEstimateFees->setShipping($MoneyTypeShipping);

//Finally setting up the $PriceToEstimateFees object to the $FeesEstimateRequest object 
$FeesEstimateRequest->setPriceToEstimateFees($PriceToEstimateFees); // The product price that the fee estimate is based on. 

//setting up the final required parameter in the $FeesEstimateRequestList object
$FeesEstimateRequestList = new MarketplaceWebServiceProducts_Model_FeesEstimateRequestList();
$FeesEstimateRequestList->setFeesEstimateRequest($FeesEstimateRequest);

// Last step : sending the $FeesEstimateRequestList object into $request
$request = new MarketplaceWebServiceProducts_Model_GetMyFeesEstimateRequest();
$request->setSellerId(MERCHANT_ID);
$request->setFeesEstimateRequestList($FeesEstimateRequestList);
// object or array of parameters
invokeGetMyFeesEstimate($service, $request);



 function invokeGetMyFeesEstimate(MarketplaceWebServiceProducts_Interface $service, $request)
  {
  try {
    $response = $service->GetMyFeesEstimate($request);

    echo ("Service Response\n");
    echo ("=============================================================================\n");

    $dom = new DOMDocument();
    $dom->loadXML($response->toXML());
    $dom->preserveWhiteSpace = false;
    $dom->formatOutput = true;
    echo $dom->saveXML();
    echo("ResponseHeaderMetadata: " . $response->getResponseHeaderMetadata() . "\n");

 } catch (MarketplaceWebServiceProducts_Exception $ex) {
    echo("Caught Exception: " . $ex->getMessage() . "\n");
    echo("Response Status Code: " . $ex->getStatusCode() . "\n");
    echo("Error Code: " . $ex->getErrorCode() . "\n");
    echo("Error Type: " . $ex->getErrorType() . "\n");
    echo("Request ID: " . $ex->getRequestId() . "\n");
    echo("XML: " . $ex->getXML() . "\n");
    echo("ResponseHeaderMetadata: " . $ex->getResponseHeaderMetadata() . "\n");
 }
}

此函数循环遍历html树的所有元素。如果其中一个元素是x类,则连接所有内部结果,并添加直接textNodes

注意: 这使用ES6。如果你不知道那是什么,请写评论,所以我向你解释

答案 3 :(得分:1)

用空格替换内跨标签应该可以完成这项任务:

var st = [];
$("span.x").map(function(e) {
    st.push($(this).html().replace('<span class="x">','').replace('</span>',''));
});

console.log(st);

这有点脏,但你明白了

答案 4 :(得分:1)

首先使用类x获得最多的跨度,但检查它没有类x的父级。然后得到innerText这些。

var topMost = $('span.x').filter(function() {
  return !$(this).parents('.x').length;
});

var texts = topMost.map(function() {
  return this.innerText;
});

console.log(texts);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<p>
  Outer text. <span class="x">Inside a single span.</span> Back to outer text once more. <span class="x"><span class="x">Inside two spans</span> or just one</span>. Perhaps a <span class="x">single span contains <span class="x">several</span> 
  <span class="x">nests</span>  <span class="x">within <span class="x">it</span>
  </span>!</span>
</p>
<span>Maybe there's a span out here.</span><span>(Or two.)</span>
<p>
  <table>
    <tr>
      <td>
        <span class="x">Or <span class="x">in</span><span class="x">here</span></span>.
      </td>
    </tr>
  </table>
</p>
<p>
  <span>No.</span> <span>Still no, but<span class="x">yes</span>.</span>
</p>

答案 5 :(得分:1)

不如其他解决方案那么优雅......

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

spans = soup.find_all('span', {'class':'x'})

children = []
for span in spans:
    chilren.extend(span.findChildren())

children = [child.text for child in children]

results = [span.text for span in spans if span.text not in children]

答案 6 :(得分:0)

受到众多回应的启发,我自己写了一个BeautifulSoup解决方案。它的工作原理是在html中重复找到下一个<span class="x">,然后在找到下一个标签之前从其中删除所有标签。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

current_span = soup.head
while True:
    current_span = current_span.find_next("span", class_="x")
    if current_span:
        current_span.string = "".join(current_span.strings)
    else: break

return [span.string for span in soup.find_all("span", class_="x")]