使用Beautifulsoup 4的Python Web scraper

时间:2016-04-27 14:54:55

标签: python-2.7 web-scraping beautifulsoup

我想创建一个包含常用单词的数据库。现在当我运行这个脚本时它工作正常,但我最大的问题是我需要所有的单词都在一列中。我觉得我所做的更像是一个黑客而不是真正的修复。使用Beautifulsoup,你可以在一列中打印所有内容而不需要额外的空行吗?

import requests
import re
from bs4 import BeautifulSoup

#Website you want to scrap info from  
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")

# Creating the CSV file
commonFile = open('common_words.csv', 'wb')

# Grabbing the lines you want
  for node in soup.findAll("tr"):
  # Getting just the text and removing the html
  words = ''.join(node.findAll(text=True))
  # Removing the extra lines
  ID = re.sub(r'[\t\r\n]', '', words)
  # Needed to add a break in the line to make the rows
  update = ''.join(ID)+'\n'
  # Now we add this to the file 
  commonFile.write(update)
commonFile.close()

1 个答案:

答案 0 :(得分:1)

这个怎么样?

import requests
import csv
from bs4 import BeautifulSoup

f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])

#Website you want to scrap info from  
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")

words = soup.select('div[class=file] tr')

for i in range(len(words)):
    word = words[i].text
    f.writerow([word.replace('\n', '')])