Question

嗨'我正在使用Beautifulsoup来解析网站并获得一个名称作为输出。但是在运行脚本后，我得到[u'word1', u'word2', u'word3']输出。我正在寻找的是'word1 word2 word3'。如何摆脱这个u'并使结果成为一个字符串？

from bs4 import BeautifulSoup
import urllib2
import re

myfile = open("base/dogs.txt","w+")
myfile.close()

url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':re.compile("dog")})
myfile = open("base/dogs.txt","w+")
for eachname in names:
    d = (str(eachname.string.split()))+"\n"
    print [x.encode('ascii') for x in d]
    myfile.write(d)

myfile.close()

Answer 1

BeautifulSoup和Unicode, Dammit！

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("Sacr&eacute; bleu!")
<html><body><p>Sacré bleu!</p></body></html>

那不是很好吗？制作汤时，文档将转换为Unicode，HTML实体将转换为Unicode字符！所以你得到Unicode对象作为结果。喜欢有意。没错。

所以你的问题是关于Unicode的。并解释了Unicode in this video。不喜欢视频吗？阅读Introduction to Unicode。

u是'的缩写。以下sting是Unicode编码的'。您现在可以使用所有Unicode字符，而不是128个ASCII字符。此刻超过110.000。 u未保存到文件或数据库。它是视觉反馈，因此您可以看到您正在处理Unicode编码的字符串。使用它就像它是一个普通的字符串，因为它是一个普通的字符串。

这个故事的道德：

当你看到u'…' 时，

。

Answer 2

使用.encode()的答案可以满足您的要求，但可能不是您所需要的。您可以保留unicode编码，而不是以显示其编码或类型的方式表示事物。因此，它们仍然 [u'word1', u'word2', u'word3'] - 这可以避免破坏对无法用ASCII表示的语言的支持 - 但打印为 {{1 }}

只是做：

word1 word2 word3

Answer 3

BeutifulSoap是一个非常棒的HTML解析器。使用它来解析html的最大潜力。所以只需修改你的代码，如下所示

names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]

这将在锚标签之间进行，因此您不需要d = (str(eachname.string.split()))+"\n"

所以最终的代码将是

from bs4 import BeautifulSoup
import urllib2
import re
import codecs
url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]
myfile = codecs.open("base/dogs.txt","wb",encoding="Utf-8")
for eachname in names:
    eachname=re.sub(r"[\t\n]","",eachname)
    myfile.write(eachname+"\n")
myfile.close()

如果你在文件中没有u就需要它使用codecs.open()或io.open()使用适当的文本编码（即encoding="..."）打开文本文件，而不是使用open()打开字节文件。

那将是

myfile = codecs.open("base/dogs.txt","w+",encoding="Utf-8")

在你的情况下。

并且文件中的输出将是

BARTSSHESWAYCOOL                            
DK'S SEND ALL                            
SHAKIN THINGS UP                            
FROSTED COOKIE                            
JD EMBELLISH                            
WW CASH N CARRY                            
FREEDOM ROCK                            
HVAC BUTCHIE

另请参阅我问过几乎相同的问题problem

从webscrape输出中删除'u

3 个答案:

。