为什么我不能按空间分割?

时间:2012-01-27 20:01:35

标签: python

这是字符串:

u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432 \u0421\u0435\u0440\u0433\u0435\u0439 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'

如果我尝试.split(),它不起作用 - 只返回一个部分。这可能有什么问题?

UPD。完整代码:

page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split()
self.response.out.write(str(full_name) + '<br>')

4 个答案:

答案 0 :(得分:7)

阿。请注意,关键在于您在请求之前未发布的信息。你的字符串不是它的样子:

[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432&nbsp;\u0421\u0435\u0440\u0433\u0435\u0439&nbsp;\u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']

而不是空格,它是"&nbsp;",它是非中断空格字符。关于删除这些问题的最佳方法有几个stackoverflow问题;我不知道哪一个最好。

[IOW,搜索“BeautifulSoup nbsp”。]

答案 1 :(得分:2)

我运行你的代码我得到了:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
>>> soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
>>> print soup.find('div', {'class': 'flagPageTitle'}).text
Красильников&nbsp;Сергей&nbsp;Александрович

正如您所看到的,单词不是用常规空格分隔,而是用html空格(&nbsp;或不间断空格)。使用.split('&nbsp;')可以解决您的问题:

>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split('&nbsp;')
>>> len(full_name)
3
>>> for s in full_name: print s
... 
Красильников
Сергей
Александрович

答案 2 :(得分:0)

因为您的字符串被&nbsp;拆分而不是空格。

>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip()
>>> full_name
u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432&nbsp;\u0421\u0435\u0440\u0433\u0435\u0439&nbsp;\u0410\u
043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'

>>> full_name.split("&nbsp;")
[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432', u'\u0421\u0435\u0440\u0433\u0435\u0439', u'\u0410\u0
43b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']
>>> len(full_name.split("&nbsp;"))
3

答案 3 :(得分:0)

使用python 3删除&nbsp:

let AWS = require('aws-sdk');
let express = require('express');
let router = express.Router();
let config = require('./config/dev');
AWS.config.update({
 "region": "",
 "accessKeyId": "",
 "secretAccessKey": ""
});

let docClient = new AWS.DynamoDB.DocumentClient();
let table = "sports";

router.get('/fetch', (req, res) => {

let spid = '101';
let params = {
    TableName: table,
    Key: {
        spid: spid
    }
};

docClient.get(params, function (err, data) {
    if (err) {
        console.log(err);
        handleError(err, res);
    } else {
        handleSuccess(data.Item, res);
    }
 });
});
function handleError(err, res) {
    res.json({ 'message': 'server side error', statusCode: 500, error: 
    err });
}

function handleSuccess(data, res) {
    res.json({ message: 'success', statusCode: 200, data: data })
}