我正在尝试使用python从csv文件中提取标记的实体。此文件包含csv文件的多个列中的标记实体。我只希望python处理一个特定的列。谁能告诉我怎么做?
这是我的代码:
from bs4 import BeautifulSoup
import csv
input_name = "file.csv" # File names for input and output
output_name = "entities.csv"
def incrementEntity(entity_string, dictionary):
try:
dictionary[entity_string] += 1
except KeyError:
dictionary[entity_string] = 1
def outputResults(dictionary, entity_type, f):
for i in sorted(dictionary, key=dictionary.get, reverse=True):
print i, '\t', entity_type, '\t', dictionary[i]
f.writerow([i, entity_type, dictionary[i]])
try:
f = open(input_name, 'r')
soup = BeautifulSoup(f)
f.close()
except IOError, message:
print message
raise ValueError("Input file could not be opened")
locations = {}
people = {}
orgs = {}
for i in soup.find_all():
entity_name = i.get_text()
entity_type = i.name
if (entity_type == 'i-loc' or entity_type == 'b-loc'):
incrementEntity(entity_name, locations)
elif (entity_type == 'b-org' or entity_type == 'i-org'):
incrementEntity(entity_name, orgs)
elif (entity_type == 'b-per' or entity_type == 'i-per'):
incrementEntity(entity_name, people)
else:
continue
output_file = open(output_name, 'w')
f = csv.writer(output_file)
print "Entity\t\tType\t\tCount"
print "------\t\t----\t\t-----"
f.writerow(["Entity", "Type", "Count"])
outputResults(locations, 'location', f)
outputResults(people, 'person', f)
outputResults(orgs, 'organization', f)
output_file.close()
答案 0 :(得分:1)
根据定义,CSV是一个用逗号分隔数据的文件。所以你要做的就是使用你正在处理的字符串的.split()方法。 例如:
csvline = 'Joe,25,M'
age = csvline.split(',')[1]
我不确切知道您要处理的数据类型,但是由于您尝试使用BeautifulSoup,我会假设您的CSV文件在其某些列中包含纯HTML类数据并且您想要加入所有这些列的数据以使用BeautifulSoup处理它。在这种情况下你可以尝试类似的东西:
f = open(input_name, 'r')
htmlstring = '\n'.join([line.split(',')[1] for line in f])
soup = BeautifulSoup(htmlstring)
f.close()