Question

我知道我做的事情很愚蠢，但我无法理解。有人比我更聪明请告诉我有什么问题吗？谢谢。这个脚本应该打开一个URL，获取HTML，应用正则表达式来获取感兴趣的内容，然后将内容存储在一个文件中并重复。

from selenium import selenium
import unittest, time, re, csv, string, logging, codecs

class Untitled(unittest.TestCase):
    def setUp(self):
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*firefox", "http://www.baseurl.com")
        self.selenium.start()
        self.selenium.set_timeout("60000")

    def test_untitled(self):
        sel = self.selenium
        spamReader = csv.reader(open('urlExtentions.csv', 'rb'))
        for row in spamReader:
            try:
                sel.open(row[0])
            except Exception, e:
                ofile = open('outputTest.csv', 'ab')
                ofile.write("error on %s: %s" % (row[0],e))
            else:
                time.sleep(5)
                htmlSource = sel.get_html_source()
                htmlSource2 = htmlSource.encode('utf-8')

    ##Next line throws "TypeError: 'int' object is not callable"

                bodyText = re.DOTALL('<h3>.*?<footer>', htmlSource2)

                ofile = open('output.txt', 'ab')
                ofile.write(bodyText.encode('utf-8') + '\n')
            ofile.close()

    def tearDown(self):
        self.selenium.stop()
        self.assertEqual([], self.verificationErrors)

if __name__ == "__main__":
     unittest.main()

Answer 1

re.DOTALL是re模块中的常量。它不是一个功能，你不能称之为。它旨在用作flags模块函数的re参数中的标志。

如果要搜索正则表达式，请使用：

bodyText = re.search('<h3>.*?<footer>', htmlSource2, flags=re.DOTALL)

re.search()会返回MatchObject，因此您可能希望获得匹配的文字：

bodyText = bodyText.group()

请注意，您已将HTML编码为UTF-8：

htmlSource2 = htmlSource.encode('utf-8')

所以你不想再做：

ofile.write(bodyText.encode('utf-8') + '\n')

删除那里的.encode()电话。

请注意，您应该在此处使用正确的HTML解析器，而不是使用正则表达式。例如，BeautifulSoup将是一个很好的选择。

简单：为什么我得到“TypeError：'int'对象不可调用”？（见代码第24行）

1 个答案:

简单：为什么我得到“TypeError：'int'对象不可调用”？ （见代码第24行）

1 个答案:

简单：为什么我得到“TypeError：'int'对象不可调用”？（见代码第24行）