使用Python从CSV中的一列提取标签

时间:2014-07-28 12:07:02

标签: python csv beautifulsoup

我正在尝试使用python从csv文件中提取标记的实体。此文件包含csv文件的多个列中的标记实体。我只希望python处理一个特定的列。谁能告诉我怎么做?

这是我的代码:

from bs4 import BeautifulSoup
import csv

input_name =  "file.csv"      # File names for input and output
output_name = "entities.csv"

def incrementEntity(entity_string, dictionary):

    try:
        dictionary[entity_string] += 1
    except KeyError:
        dictionary[entity_string] = 1

def outputResults(dictionary, entity_type, f):

    for i in sorted(dictionary, key=dictionary.get, reverse=True):
        print i, '\t', entity_type, '\t', dictionary[i]
        f.writerow([i, entity_type, dictionary[i]])

try:
    f = open(input_name, 'r')
    soup = BeautifulSoup(f)
    f.close()
except IOError, message:
    print message
    raise ValueError("Input file could not be opened")

locations = {}  
people    = {}  
orgs      = {}

for i in soup.find_all():
    entity_name = i.get_text()
    entity_type = i.name

    if (entity_type == 'i-loc' or entity_type == 'b-loc'):
        incrementEntity(entity_name, locations)
    elif (entity_type == 'b-org' or entity_type == 'i-org'):
        incrementEntity(entity_name, orgs)
    elif (entity_type == 'b-per' or entity_type == 'i-per'):
       incrementEntity(entity_name, people)
    else:
        continue

output_file = open(output_name, 'w')
f = csv.writer(output_file)
print "Entity\t\tType\t\tCount"
print "------\t\t----\t\t-----"
f.writerow(["Entity", "Type", "Count"])

outputResults(locations, 'location', f)
outputResults(people, 'person', f)
outputResults(orgs, 'organization', f)

output_file.close()

1 个答案:

答案 0 :(得分:1)

根据定义,CSV是一个用逗号分隔数据的文件。所以你要做的就是使用你正在处理的字符串的.split()方法。 例如:

csvline = 'Joe,25,M'
age = csvline.split(',')[1]

我不确切知道您要处理的数据类型,但是由于您尝试使用BeautifulSoup,我会假设您的CSV文件在其某些列中包含纯HTML类数据并且您想要加入所有这些列的数据以使用BeautifulSoup处理它。在这种情况下你可以尝试类似的东西:

f = open(input_name, 'r')
htmlstring = '\n'.join([line.split(',')[1] for line in f])
soup = BeautifulSoup(htmlstring)
f.close()