/ n美丽的汤文

时间:2016-09-04 22:47:13

标签: python beautifulsoup

我试图获取一段YouTube视频的成绩单以进行一些NLP工作,我想我可以搞定,但也有一些问题。例如:

from xml.etree import cElementTree as ET
from bs4 import BeautifulSoup as bs
from urllib2 import urlopen

URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM'
def make_soup(url):
    html = urlopen(url).read()
    return bs(html, "lxml")

soup = make_soup(URL)
takeaways = soup.findAll('text')

All_text = []
for i in takeaways:
    root = ET.fromstring(str(i))
    reslist = list(root.iter())
    try:
        result = ' '.join([element.text for element in reslist])
    except:
        pass
    All_text.append(result)

其中一行的示例结果:

'Let's learn a little bit\nabout the dot product.'

这似乎可以获得成绩单,但我也得到/ n这是xml的返回字符,我也得到这个奇怪的字符代替撇号,我认为是由于编码?

有人知道如何清理这两个吗?

2 个答案:

答案 0 :(得分:5)

\n是换行符,如果您不想要它们必须手动替换,html entities可以使用HTMLParser使用python2或html.parser取消转义python3在另一个答案中提到。

此外,由于您正在解析 xml 并且安装了 lxml ,因此您的代码可以简化为:

import lxml.etree as et
from HTMLParser import HTMLParser
unescape = HTMLParser().unescape

URL = 'http://video.google.com/timedtext?lang=en&v=KDHuWxy53uM'
tree = et.parse(URL)
print([unescape(t.replace("\n", " ")) for t in tree.xpath('//text/text()')])

哪会给你:

[u"Let's learn a little bit about the dot product.", 'The dot product, frankly, out of the two ways of multiplying', 'vectors, I think is the easier one.', 'So what does the dot product do?', u"Why don't I give you the definition, and then I'll give", 'you an intuition.', u"So if I have two vectors; vector a dot vector b-- that's", 'how I draw my arrows.', 'I can draw my arrows like that.', 'That is equal to the magnitude of vector a times the', 'magnitude of vector b times cosine of the', 'angle between them.', 'Now where does this come from?', 'This might seem a little arbitrary, but I think with a', 'visual explanation, it will make a little bit more sense.', 'So let me draw, arbitrarily, these two vectors.', 'So that is my vector a-- nice big and fat vector.', u"It's good for showing the point.", 'And let me draw vector b like that.', 'Vector b.', 'And then let me draw the cosine, or let me, at least,', 'draw the angle between them.', 'This is theta.', u"So there's two ways of view this.", 'Let me label them.', 'This is vector a.', u"I'm trying to be color consistent.", 'This is vector b.', u"So there's two ways of viewing this product.", 'You could view it as vector a-- because multiplication is', 'associative, you could switch the order.', 'So this could also be written as, the magnitude of vector a', u"times cosine of theta, times-- and I'll do it in color", 'appropriate-- vector b.', 'And this times, this is the dot product.', u"I almost don't have to write it.", 'This is just regular multiplication, because these', 'are all scalar quantities.', u"When you see the dot between vectors, you're talking about", 'the vector dot product.', 'So if we were to just rearrange this expression this', 'way, what does it mean?', 'What is a cosine of theta?', 'Let me ask you a question.', 'If I were to drop a right angle, right here,', u"perpendicular to b-- so let's just drop a right angle", 'there-- cosine of theta soh-coh-toa so, cah cosine--', 'is equal to adjacent of a hypotenuse, right?', u"Well, what's the adjacent?", u"It's equal to this.", 'And the hypotenuse is equal to the magnitude of a, right?', 'Let me re-write that.', 'So cosine of theta-- and this applies to the a vector.', 'Cosine of theta of this angle is equal to ajacent, which', u"is-- I don't know what you could call this-- let's call", 'this the projection of a onto b.', u"It's like if you were to shine a light perpendicular to b--", 'if there was a light source here and the light was', 'straight down, it would be the shadow of a onto b.', 'Or you could almost think of it as the part of a that goes', 'in the same direction of b.', 'So this projection, they call it-- at least the way I get', 'the intuition of what a projection is, I kind of view', 'it as a shadow.', 'If you had a light source that came up perpendicular, what', 'would be the shadow of that vector on to this one?', 'So if you think about it, this shadow right here-- you could', 'call that, the projection of a onto b.', u"Or, I don't know.", u"Let's just call it, a sub b.", u"And it's the magnitude of it, right?", u"It's how much of vector a goes on vector b over-- that's the", 'adjacent side-- over the hypotenuse.', 'The hypotenuse is just the magnitude of vector a.', u"It's just our basic calculus.", 'Or another way you could view it, just multiply both sides', 'by the magnitude of vector a.', 'You get the projection of a onto b, which is just a fancy', 'way of saying, this side; the part of a that goes in the', 'same direction as b-- is another way to say it-- is', 'equal to just multiplying both sides times the magnitude of a', 'is equal to the magnitude of a, cosine of theta.', 'Which is exactly what we have up here.', 'And the definition of the dot product.', 'So another way of visualizing the dot product is, you could', 'replace this term with the magnitude of the projection of', 'a onto b-- which is just this-- times the', 'magnitude of b.', u"That's interesting.", u"All the dot product of two vectors is-- let's just take", 'one vector.', u"Let's figure out how much of that vector-- what component", u"of it's magnitude-- goes in the same direction as the", u"other vector, and let's just multiply them.", 'And where is that useful?', 'Well, think about it.', 'What about work?', 'When we learned work in physics?', 'Work is force times distance.', u"But it's not just the total force", 'times the total distance.', u"It's the force going in the same", 'direction as the distance.', u"You should review the physics playlist if you're watching", u"this within the calculus playlist. Let's say I have a", '10 newton object.', u"It's sitting on ice, so there's no friction.", u"We don't want to worry about fiction right now.", u"And let's say I pull on it.", u"Let's say my force vector-- This is my force vector.", u"Let's say my force vector is 100 newtons.", u"I'm making the numbers up.", '100 newtons.', u"And Let's say I slide it to the right, so my distance", 'vector is 10 meters parallel to the ground.', 'And the angle between them is equal to 60 degrees, which is', 'the same thing is pi over 3.', u"We'll stick to degrees.", u"It's a little bit more intuitive.", u"It's 60 degrees.", 'This distance right here is 10 meters.', 'So my question is, by pulling on this rope, or whatever, at', 'the 60 degree angle, with a force of 100 newtons, and', 'pulling this block to the right for 10 meters, how much', 'work am I doing?', 'Well, work is force times the distance, but not just the', 'total force.', 'The magnitude of the force in the direction of the distance.', u"So what's the magnitude of the force in the", 'direction of the distance?', 'It would be the horizontal component of this force', 'vector, right?', 'So it would be 100 newtons times the', 'cosine of 60 degrees.', 'It will tell you how much of that 100', 'newtons goes to the right.', 'Or another way you could view it if this', 'is the force vector.', 'And this down here is the distance vector.', 'You could say that the total work you performed is equal to', 'the force vector dot the distance vector, using the dot', 'product-- taking the dot product, to the force and the', 'distance factor.', 'And we know that the definition is the magnitude of', 'the force vector, which is 100 newtons, times the magnitude', 'of the distance vector, which is 10 meters, times the cosine', 'of the angle between them.', 'Cosine of the angle is 60 degrees.', u"So that's equal to 1,000 newton meters", 'times cosine of 60.', 'Cosine of 60 is what?', u"It's square root of 3 over 2.", 'Square root of 3 over 2, if I remember correctly.', 'So times the square root of 3 over 2.', 'So the 2 becomes 500.', 'So it becomes 500 square roots of 3 joules, whatever that is.', u"I don't know 700 something, I'm guessing.", u"Maybe it's 800 something.", u"I'm not quite sure.", 'But the important thing to realize is that the dot', 'product is useful.', 'It applies to work.', 'It actually calculates what component of what vector goes', 'in the other direction.', 'Now you could interpret it the other way.', 'You could say this is the magnitude of a', 'times b cosine of theta.', u"And that's completely valid.", u"And what's b cosine of theta?", 'Well, if you took b cosine of theta, and you could work this', u"out as an exercise for yourself, that's the amount of", u"the magnitude of the b vector that's", 'going in the a direction.', u"So it doesn't matter what order you go.", 'So when you take the cross product, it matters whether', 'you do a cross b, or b cross a.', u"But when you're doing the dot product, it doesn't matter", 'what order.', 'So b cosine theta would be the magnitude of vector b that', 'goes in the direction of a.', 'So if you were to draw a perpendicular line here, b', 'cosine theta would be this vector.', 'That would be b cosine theta.', 'The magnitude of b cosine theta.', 'So you could say how much of vector b goes in the same', 'direction as a?', 'Then multiply the two magnitudes.', 'Or you could say how much of vector a goes in the same', 'direction is vector b?', 'And then multiply the two magnitudes.', 'And now, this is, I think, a good time to just make sure', 'you understand the difference between the dot product and', 'the cross product.', 'The dot product ends up with just a number.', 'You multiply two vectors and all you have is a number.', 'You end up with just a scalar quantity.', 'And why is that interesting?', 'Well, it tells you how much do these-- you could almost say--', 'these vectors reinforce each other.', u"Because you're taking the parts of their magnitudes that", 'go in the same direction and multiplying them.', 'The cross product is actually almost the opposite.', u"You're taking their orthogonal components, right?", 'The difference was, this was a a sine of theta.', u"I don't want to mess you up this picture too much.", 'But you should review the cross product videos.', u"And I'll do another video where I actually compare and", 'contrast them.', u"But the cross product is, you're saying, let's multiply", 'the magnitudes of the vectors that are perpendicular to each', u"other, that aren't going in the same direction, that are", 'actually orthogonal to each other.', u"And then, you have to pick a direction since you're not", 'saying, well, the same direction that', u"they're both going in.", u"So you're picking the direction that's orthogonal to", 'both vectors.', u"And then, that's why the orientation matters and you", u"have to take the right hand rule, because there's actually", 'two vectors that are perpendicular to any other two', 'vectors in three dimensions.', u"Anyway, I'm all out of time.", u"I'll continue this, hopefully not too confusing, discussion", 'in the next video.', u"I'll compare and contrast the cross", 'product and the dot product.', 'See you in the next video.']

另一方面,如果在第一次迭代中未定义result,则代码会出错,如果在任何其他迭代中执行此操作,您将再次附加最后一个结果,则需要继续你有传递的地方,你也不应该使用毯子,除了抓住你想要的东西并打印/记录错误。

答案 1 :(得分:1)

&#39 - 未转义的HTML代码, python 3.2 及以上使用

import html
html.unescape(<your_string>)