很抱歉再次发布此消息。当我运行以下代码strip_html()时,我收到此错误UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 45: ordinal not in range(128)
:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html )
return s.get_data()
在这一串文字上:
"<p>We’re implementing the PayPal MECL library in a client’s app but we’re experiencing some poor user experience that we don’t seem to be able to change. \nWhen the PayPal experience is complete, PayPal show a “Please wait while we transfer you to the business site...” message. Obviously this is an iOS app not a “business site”...</p>\n\n<p>The flow functions by dismissing the web view on completion of the PayPal experience by listening for new URL requests within the UIWebViewDelegate method:</p>\n\n<pre><code>- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request navigationType:(UIWebViewNavigationType)navigationType\n</code></pre>\n\n<p>This issue seems to be that PayPal update their web view with the message via editing the DOM (JS or some such) which does not create a new web request and therefor no shouldStartLoadWithRequest fired. Note: A new request is made after a second or so when redirected but that’s too late, the inappropriate copy has been presented to the user.</p>\n\n<p>Has anyone working with MECL on iOS or Android managed to alter this copy/experience either via the <a href=\"https://cms.paypal.com/uk/cgi-bin/?cmd=_render-content&content_ID=developer/e_howto_api_nvp_r_SetExpressCheckout\" rel=\"nofollow\">SetExpressCheckout</a> server call or configuration of the <a href=\"https://cms.paypal.com/uk/cgi-bin/?cmd=_render-content&content_ID=developer/e_howto_api_WPECOnMobileDevices\" rel=\"nofollow\">MECL URL get params</a>?I ’ve been unable to find a resolution on this so far but will post a solution if we find one. Any help would be greatly appreciated as we don’t seem to be able to find a solution in PayPals documentation...</p>\n\n<p><strong>NOTE:</strong> Also we have a similar UX issue when pressing the cancel button on the PayPal web view that causes a redirect, but with a similar bad piece of copy presented before hand “Cancel this purchase and return to the seller’s website?”. This is worded as a confirmation dialogue but there are no buttons presented and it redirects anyway. Mad UX. Again if anyone knows a solution to either if these please post.</p>\n\n<p><img src=\"http://i.stack.imgur.com/gc4zq.png\" alt=\""Please wait while we transfer you to the business site..." image\"></p>\n\n<p><img src=\"http://i.stack.imgur.com/cztum.png\" alt=\""Cancel this purchase and return to the seller’s website?" image\"></p>\n"
我正在处理600万个文档,到目前为止(10%的时间)我点击了上面的错误消息。如果我在调用a.decode("utf-8")
函数之前执行strip_tags
,我可以为上述文本修复此问题,但我的代码将继续工作。
关于我能做什么的任何想法?我很想用正则表达式去除HTML标签(我知道这是错的)。
谢谢。