Question

我正在尝试从twitter解析/打印一些数据。我有一个打印推文的代码但是当我尝试将相同的代码应用于用户名时，它似乎不起作用。我想在不使用twitter API的情况下这样做。

以下是打印推文的内容

def main():
    try:
        sourceCode = opener.open('https://twitter.com/search?f=realtime&q='\
                                 +keyWord+'&src=hash').read()
        splitSource = re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>', sourceCode)
        print len(splitSource)
        print splitSource
        for item in splitSource:
            print '\n _____________________\n'
            print re.sub(r'<.*?>','',item)



    except Exception, e:
        print str(e)
        print 'error in main try'
        time.sleep(555)

main()

现在要打印用户名信息，我将“开启者”更改为“浏览器”，但它仍然会找到并打开页面，这不是问题。反正我也不认为。

def main():
    try:
        pageSource = browser.open('https://twitter.com/search?q='\
                                 +firstName+'%20'+lastName+'&src=hash&mode=users').read()
        splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource)
        for item in splitSource:
            print '\n'
            print re.sub(r'<.*?>','',item)
    except Exception, e:
        print str(e)
        print 'error in main try'


main()

它会打印sourceCode。问题似乎在于：

splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource)

这似乎根本找不到任何东西。这是我试图从中提取信息的来源的副本。

  <div class="content">
    <div class="stream-item-header">
      <a class="account-group js-user-profile-link" href="/BarackObama" >
        <img class="avatar js-action-profile-avatar " src="https://pbs.twimg.com/profile_images/451007105391022080/iu1f7brY_normal.png" alt="" data-user-id="813286"/>
        <strong class="fullname js-action-profile-name">Barack Obama</strong><span class="Icon Icon--verified Icon--small"><span class="u-isHiddenVisually">Verified account</span></span>
          <span class="username js-action-profile-name">@BarackObama</span>


      </a>
    </div>
      <p class="bio ">
          This account is run by Organizing for Action staff. Tweets from the President are signed -bo.
      </p>







  </div>

我觉得这个消息来源正在阻止我获取生物信息。间距可能？我duno。

Answer 1

As usual, don't use regex to parse HTML.

实际上，您在'<p class="bio ">'和'(.*?)'之间有换行符，这意味着您需要使用re.DOTALL进行匹配，以便.包含换行符。您也可以执行'<p class="bio ">\s*(.*?)\s*</p>'，因为\s将匹配换行符（如果存在）。这样也可以提供更清晰的输出。

import re

pat = re.compile(r'<p class="bio ">\s*(.*?)\s*</p>')
pat.findall(src) # src is your last codeblock from above
## OUTPUT:
['This account is run by Organizing for Action staff. Tweets from the President are signed -bo.']

如果你想使用BeautifulSoup选项，Python3代码如下：

from bs4 import BeautifulSoup

soup = BeautifulSoup(src) # src is your last codeblock from your question
[p_bio.contents.strip() for p_bio in soup('p' class_='bio ')]
## OUTPUT:
['This account is run by Organizing for Action staff. Tweets from the President are signed -bo.']

Answer 2

使用Regex解析任意HTML非常困难，只有在<100>确定输出结果时才真正有效。 That said, as stated in the other really sane (but not as funny answer)：

虽然确实要求正则表达式解析任意HTML就像要求Paris Hilton编写操作系统一样，但有时候解析一组有限的已知HTML也是合适的。

如果你有一小组HTML页面要从中抓取数据然后填充到数据库中，那么regexe可能正常工作。例如，我最近想获得澳大利亚联邦代表的名称，政党和地区，我从议会的网站上获取了这些名称，政党和地区。这是一项有限的一次性工作。

Regexes对我来说效果很好，设置起来非常快。

这意味着在浏览器开发人员工具中检查源不 DOM。

至于您当前的示例，如前面的示例所示，您没有捕获换行符，因此您需要添加appropriate regex flags

splitSource = re.findall(r'<p class="bio ">(.*?)</p>', pageSource, flags=re.DOTALL)

更好的解决方案包括使用像BeautifulSoup这样的HTML特定工具，因为它们可以处理HTML特性。例如，您正在对类名进行正则表达式匹配，这些类是无序。

所以这些是相同的html声明：

<div class="foo bar">
<div class="bar foo">

但是正则表达式引擎和XML解析器都会遇到使用相同查询查找这些问题的问题，而HTML特定工具可以使用CSS选择器来查找它们。

为什么这个RegEx没有找到任何数据？

2 个答案: