我正在使用蟒蛇和美丽的汤......在Cookie的帮助下,从社交网站上提取印地语,泰米尔语,旁遮普语(印度语)帖子。我很难提取,但提取物不是那种语言本身而不是以某种编码形式..我希望它使用相同的语言..例如:印地文帖子应该只在印地文中被提取出来..
import mechanize
import cookielib
from bs4 import BeautifulSoup
import urllib2
import csv
from html2text import html2text
import re
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
urls = []
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'),('Connection','keep-alive'),('Accept','application/json, text/javascript, */*; q=0.01'),('Accept-Encoding','gzip, deflate, sdch'),('Host','link'),('Referer','https://link/'),('X-Requested-With','XMLHttpRequest'),('Accept-Language','en-US,en;q=0.8')]
br.open('https://link')
br._factory.is_html = True
# Select the first (index zero) form
#br.select_form(predicate=lambda f: f.attrs.get('id', None) == 'login_form')
br.select_form(nr=0)
# User credentials
br.form['USER'] = 'username'
br.form['PASSWORD'] = 'password'
# Login
br.submit()
soup = BeautifulSoup(br.response().read())
for tag in soup.find_all("div", re.compile("classname")):
#print tag
for tag1 in tag.find_all(re.compile("^p")):
print tag1
输出样本:
\ u0baa \ u0b9f \ u0bbf \ u0ba4 \ u0bcd \ u0ba4 \ u0ba4 \ u0bbf \ u0bb2 \ u0bcd \ u0baa \ u0bbf \ u0b9f \ u0bbf \ u0ba4 \ u0bcd \ u0ba4 \ u0ba4 \ u0bc1 \ u263a
预期输出:用该特定语言(此处为泰米尔语)编写
答案 0 :(得分:0)
unicode转义为我工作。
.decode('unicode-escape')