BeautifulSoup - 抓论坛页面

时间:2014-02-23 18:38:52

标签: python beautifulsoup screen-scraping

我正在尝试搜索论坛讨论并将其导出为csv文件,其中包含“主题标题”,“用户”和“帖子”等行,其中后者是每个人的实际论坛帖子。

我是Python和BeautifulSoup的初学者,所以我很难用这个!

我当前的问题是所有文本在csv文件中每行被拆分为一个字符。那里有谁可以帮助我吗?如果有人能帮我一把,真是太好了!

以下是我一直在使用的代码:

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

1 个答案:

答案 0 :(得分:12)

好的,我们走了。不太清楚我在这里帮你做什么,但希望你有充分的理由去分析丝绸之路的帖子。

这里有一些问题,最重要的是你根本没有解析数据。 您正在使用的内容.get_text()将转到页面,突出显示整个内容,然后将整个内容复制并粘贴到csv文件中。

所以这是你应该尝试做的事情:

  1. 阅读页面来源
  2. 用汤将其分成你想要的部分
  3. 以并行数组的形式保存部分作者,日期,时间,帖子等
  4. 逐行将数据写入csv文件
  5. 我写了一些代码来向你展示它看起来像什么,它应该做的工作:

    from bs4 import BeautifulSoup
    import csv
    import urllib2
    
    # get page source and create a BeautifulSoup object based on it
    print "Reading page..."
    page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
    soup = BeautifulSoup(page)
    
    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")
    
    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")
    
    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []
    
    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    print "Parsing data..."
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date
    
    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())
    
    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'wb')
    writer = csv.writer(csvfile)
    
    # create template
    writer.writerow(["Time", "Author", "Title", "Post"])
    
    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])
    
    
    # close file
    csvfile.close()
    
    # done
    print "Operation completed successfully."
    

    编辑:包含可从目录中读取文件并使用该数据的解决方案

    好的,所以你将HTML文件放在一个目录中。您需要获取目录中的文件列表,遍历它们,并将csv文件附加到目录中的每个文件。

    这是我们新计划的基本逻辑。

    如果我们有一个名为processData()的函数,它将文件路径作为参数,并将文件中的数据附加到csv文件中,它的外观如下:

    # the directory where we have all our HTML files
    dir = "myDir"
    
    # our csv file
    csvFile = "silkroad.csv"
    
    # insert the column titles to csv
    csvfile = open(csvFile, 'wb')
    writer = csv.writer(csvfile)
    writer.writerow(["Time", "Author", "Title", "Post"])
    csvfile.close()
    
    # get a list of files in the directory
    fileList = os.listdir(dir)
    
    # define variables we need for status text
    totalLen = len(fileList)
    count = 1
    
    # iterate through files and read all of them into the csv file
    for htmlFile in fileList:
        path = os.path.join(dir, htmlFile) # get the file path
        processData(path) # process the data in the file
        print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
        count = count + 1 # increment counter
    

    碰巧我们的 processData()功能或多或少都是我们以前做过的,只需做一些更改。

    所以这与我们上一个程序非常相似,只做了一些小改动:

    1. 我们首先编写列标题
    2. 然后我们打开带有'ab'标志的csv来追加
    3. 我们导入操作系统以获取文件列表
    4. 以下是这样的内容:

      from bs4 import BeautifulSoup
      import csv
      import urllib2
      import os # added this import to process files/dirs
      
      # ** define our data processing function
      def processData( pageFile ):
          ''' take the data from an html file and append to our csv file '''
          f = open(pageFile, "r")
          page = f.read()
          f.close()
          soup = BeautifulSoup(page)
      
          # if you look at the HTML all the titles, dates, 
          # and authors are stored inside of <dt ...> tags
          metaData = soup.find_all("dt")
      
          # likewise the post data is stored
          # under <dd ...>
          postData = soup.find_all("dd")
      
          # define where we will store info
          titles = []
          authors = []
          times = []
          posts = []
      
          # now we iterate through the metaData and parse it
          # into titles, authors, and dates
          for html in metaData:
              text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
              titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
              authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
              times.append(text.split(" on ")[1].strip()) # get date
      
          # now we go through the actual post data and extract it
          for post in postData:
              posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())
      
          # now we write data to csv file
          # ***csv files MUST be opened with the 'b' flag***
          csvfile = open('silkroad.csv', 'ab')
          writer = csv.writer(csvfile)
      
          # iterate through and write all the data
          for time, author, title, post in zip(times, authors, titles, posts):
              writer.writerow([time, author, title, post])
      
          # close file
          csvfile.close()
      # ** start our process of going through files
      
      # the directory where we have all our HTML files
      dir = "myDir"
      
      # our csv file
      csvFile = "silkroad.csv"
      
      # insert the column titles to csv
      csvfile = open(csvFile, 'wb')
      writer = csv.writer(csvfile)
      writer.writerow(["Time", "Author", "Title", "Post"])
      csvfile.close()
      
      # get a list of files in the directory
      fileList = os.listdir(dir)
      
      # define variables we need for status text
      totalLen = len(fileList)
      count = 1
      
      # iterate through files and read all of them into the csv file
      for htmlFile in fileList:
          path = os.path.join(dir, htmlFile) # get the file path
          processData(path) # process the data in the file
          print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
          count = count + 1 # incriment counter