我是初学者。我从网页获取文本数据,并试图通过任何非空白字符分隔单词。我完全不明白为什么会这样。我也遇到了这些帖子:python re.search error TypeError: expected string or buffer,pattern matching in malayalam makes TypeError: expected string or buffer以及其他许多帖子,但却无法摆脱此错误。
我的代码功能:
def separatewords(self, text):
splitter = re.compile('\\W*')
return [s.lower() for s in splitter.split(text) if len(s)>2 and len(s)<20]
错误:
TypeError: expected string or buffer
更新
这是我正在传递的链接文本的输出:
t [u'html', u'[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]', u'[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]', u'[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]', u'[if gt IE 8]><!', u' ', u'<![endif]', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'[if (lte IE 8)&(!IEMobile)]>\n <link href="/static/stylesheets/no-mq.css" rel="stylesheet" type="text/css" media="screen" />\n \n \n <![endif]', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' white shape ', u'\n', u' python blue ', u'\n', u'\n', u'Welcome to Python.org', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'Notice:', u' While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience. ', u'\n', u'\n', u'[if lt IE 8]>\n <div id="oldie-warning" class="do-not-print">\n <p><strong>Notice:</strong> Your browser is <em>ancient</em> and <a href="http://www.ie6countdown.com/">Microsoft agrees</a>. <a href="http://browsehappy.com/">Upgrade to a different browser</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to experience a better web.</p>\n </div>\n <![endif]', u'\n', u' Sister Site Links ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' Header elements ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'Search This Site', u'\n', u'\n', u'\n GO\n ', u'\n', u'[if IE]><input type="text" style="display: none;" disabled="disabled" size="1" tabindex="4"><![endif]', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' end options-bar ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u' for optional "do-not-print" class ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'# Python 3: Fibonacci series up to n', u"\r\n>>> def fib(n):\r\n>>> a, b = 0, 1\r\n>>> while a < n:\r\n>>> print(a, end=' ')\r\n>>> a, b = b, a+b\r\n>>> print()\r\n>>> fib(1000)\r\n", u'0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987', u'\n', u'Functions Defined', u'\n', u'The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. ', u'\n', u'\n', u'\n', u'# Python 3: List comprehensions', u"\r\n>>> fruits = ['Banana', 'Apple', 'Lime']\r\n>>> loud_fruits = [fruit.upper() for fruit in fruits]\r\n>>> print(loud_fruits)\r\n", u"['BANANA', 'APPLE', 'LIME']", u'\r\n\r\n', u'# List and the enumerate function', u'\r\n>>> list(enumerate(fruits))\r\n', u"[(0, 'Banana'), (1, 'Apple'), (2, 'Lime')]", u'\n', u'Compound Data Types', u'\n', u'Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. ', u'\n', u'\n', u'\n', u'# Python 3: Simple arithmetic', u'\r\n>>> 1 / 2\r\n', u'0.5', u'\r\n>>> 2 ** 3\r\n', u'8', u'\r\n>>> 17 / 3 ', u'# classic division returns a float', u'\r\n', u'5.666666666666667', u'\r\n>>> 17 // 3 ', u'# floor division', u'\r\n', u'5', u'\n', u'Intuitive Interpretation', u'\n', u'Calculations are simple with Python, and expression syntax is straightforward: the operators ', u'+', u', ', u'-', u', ', u'*', u' and ', u'/', u' work as expected; parentheses ', u'()', u' can be used for grouping. ', u'.', u'\n', u'\n', u'\n', u'# Python 3: Simple output (with Unicode)', u'\r\n>>> print("Hello, I\'m Python!")\r\n', u"Hello, I'm Python!", u'\r\n\r\n', u'# Input, assignment', u"\r\n>>> name = input('What is your name?\\n')\r\n>>> print('Hi, %s.' % name)\r\n", u'What is your name?\r\nPython\r\nHi, Python.', u'\n', u'Quick & Easy to Learn', u'\n', u'Experienced programmers in any other language can pick up Python very quickly, and beginners find the clean syntax and indentation structure easy to learn. ', u' with our Python\xa03 overview.', u'\n', u'\n', u'\n', u'\n', u'# For loop on a list', u"\r\n>>> numbers = [2, 4, 6, 8]\r\n>>> product = 1\r\n>>> for number in numbers:\r\n... product = product * number\r\n... \r\n>>> print('The product is:', product)\r\n", u'The product is: 384', u'\n', u'All the Flow You\u2019d Expect', u'\n', u'Python knows the usual control flow statements that other languages speak \u2014 ', u'if', u', ', u'for', u', ', u'while', u' and ', u'range', u' \u2014 with some of its own twists, of course. ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'Python is a programming language that lets you work quickly ', u'and integrate systems more effectively. ', u'\n', u'\n', u' end .container ', u'\n', u'\n', u'\n', u' Main Content Column ', u'\n', u'\n', u'\n', u'\n', u'\n', u'Get Started', u'\n', u"Whether you're new to programming or an experienced developer, it's easy to learn and use Python.", u'\n', u'\n', u'\n', u'\n', u'Download', u'\n', u'Python source code and installers are available for download for all versions! Not sure which version to use? ', u'.', u'\n', u'Latest: ', u' - ', u'\n', u'\n', u'\n', u'Docs', u'\n', u"Documentation for Python's standard library, along with tutorials and guides, are available online.", u'\n', u'\n', u'\n', u'\n', u'Jobs', u'\n', u"Looking for work or have a Python related position that you're trying to hire for? Our ", u'relaunched community-run job board', u' is the place to go.', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'Latest News', u'\n', u'\n', u'\n', u'\n', u'2015-', u'12-07', u'\n', u'\n', u'\n', u'2015-', u'12-05', u'\n', u'\n', u'\n', u'2015-', u'11-22', u'\n', u'\n', u'\n', u'2015-', u'09-13', u'\n', u'\n', u'\n', u'2015-', u'09-09', u'\n', u'\n', u'\n', u' end .shrubbery ', u'\n', u'\n', u'\n', u'\n', u'Upcoming Events', u'\n', u'\n', u'\n', u'\n', u'2016-', u'03-05', u'\n', u'\n', u'\n', u'2016-', u'03-11', u'\n', u'\n', u'\n', u'2016-', u'03-12', u'\n', u'\n', u'\n', u'2016-', u'04-01', u'\n', u'\n', u'\n', u'2016-', u'04-02', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'Success Stories', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' ', u'by Tim Fortenberry', u'\n', u'\n', u'\n', u'\n', u'\n', u' end .shrubbery ', u'\n', u'\n', u'\n', u'\n', u'Use Python for\u2026', u'\n', u'\n', u'\n', u'Web Programming', u':\r\n ', u', ', u', ', u', ', u', ', u', ', u'\n', u'GUI Development', u':\r\n ', u', ', u', ', u', ', u', ', u'\n', u'Scientific and Numeric', u':\r\n ', u'\n', u', ', u', ', u'\n', u'Software Development', u':\r\n ', u', ', u', ', u'\n', u'System Administration', u':\r\n ', u', ', u', ', u'\n', u'\n', u' end .shrubbery ', u'\n', u'\n', u'\n', u'\n', u'\n', u'>>>', u' ', u': The future of Python', u' is discussed here.', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'>>>', u' ', u'\n', u'\n', u'The mission of the Python Software Foundation is to promote, protect, and advance the Python programming language, and to support and facilitate the growth of a diverse and international community of Python programmers. ', u' ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' end .container ', u'\n', u' end #content .content-wrapper ', u'\n', u' Footer and social media list ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u' end .container ', u'\n', u' ', u' end .main-footer-links ', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'\n', u'<li class="tier-1 element-3"><a href="#"><span class="say-no-more">Website</span> Colophon</a></li>', u'\n', u'\n', u'\n', u'\n', u'Copyright \xa92001-2016.', u'\n \xa0', u'\n \xa0', u'\n \xa0', u'\n', u'\n', u'\n', u' end .container ', u'\n', u' end .site-base ', u'\n', u'\n', u' end #touchnav-wrapper ', u'\n', u'\n', u'\n', u'\n', u'\n', u'[if lte IE 7]>\n <script type="text/javascript" src="/static/js/plugins/IE8-min.js" charset="utf-8"></script>\n \n \n <![endif]', u'\n', u'[if lte IE 8]>\n <script type="text/javascript" src="/static/js/plugins/getComputedStyle-min.js" charset="utf-8"></script>\n \n \n <![endif]', u'\n', u'\n', u'\n']
UPDATE2 我提取文本的功能:
def getTtextonly(self, soup):
url = soup
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
text = soup.findAll(text=True)
return text
我做错了什么?
答案 0 :(得分:2)
根据您发布的文字示例判断,您传递的是string
而非单string
的列表,因此以下是对您的代码的修复:
def separatewords(self, text):
splitter = re.compile('\\W*')
return [s.lower() for t in text for s in splitter.split(t) if 2 < len(s) < 20]
答案 1 :(得分:0)
您传递给text
的{{1}}不是字符串或缓冲区。