Question

我的总体目标是隔离标签，其中包含文本中的某个单词，并且只有那些打印到文本文件中。

到目前为止，我已经能够提取特定的标签，在这种情况下，并将那些打印到文本文件。

我的问题是，一旦我提取了所有文本，我该怎么办呢？我无法找到一种方法来隔离某个特定的单词，并进一步将文本修剪为我需要的内容。

这是我到目前为止所做的：

import urllib2
from BeautifulSoup import BeautifulSoup

url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

page = soup.findAll('title')

for element in page:    
    print element

file_name = raw_input("What do you want to name the file?> ")
text_file = open("/Users/user1/Projects/%s.txt" % file_name, "w")
text_file.write("%s" % page)
text_file.close()

归还给我的是：

$<title>food</title>
<title>ball</title>
<title>car</title>
<title>desk</title>
<title>blue food</title>
<title>green food</title>
<title>red ball</title>

我如何才能打印出包含“食物”的结果？

Answer 1

您可以使用.string获取元素的内容。如果您只想在food中添加结果，请添加以下内容：

for element in page:
    if 'food' in element.string:
        print element.string

Answer 2

例如，如果您想从页面中提取数据并将其放入CSV文件中，您可以这样做：

import urllib2
from BeautifulSoup import BeautifulSoup
import csv

file_name = raw_input("What do you want to name the file?> ")
c = csv.writer(open("%s.csv" % (file_name), "a"),delimiter=";" ) # Open the CSV File and Write in

url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

page = soup.findAll('title')

for element in page:    
    element = element.text.encode('utf-8')
    c.writerow([element])

您可以在Excel或/和文本编辑器软件中使用CSV文件。可以有用我的代码远离完美，但无论如何，应该工作：）

使用BeautifulSoup提取文本后，我该如何处理提取的数据？

2 个答案: