我成功地用Python3用bs4编写了一个脚本来获取维基百科页面中没有重复的字符串。为此,
算法:
1)编写csv文件与重复
使用上述文件,
2)编写csv文件,不带重复。
剧本:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://ta.wikisource.org/w/index.php?title=அட்டவணை:அ. மருதகாசி-பாடல்கள்.pdf&action=history'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
#getting the uncleaned contributors
userBdi = soup.findAll('bdi')
#list 2 string
uncleanedContributors =''.join(str(userBdi)[1:-1]).replace('</','<').replace('<bdi>','').replace(',','\n').replace(' ','').replace('பக்கம்','அட்டவணை_பேச்சு').replace('Bot','').replace('BOT','')
print()
print('The output of uncleaned contributors')
print('--------------------------------------')
print(uncleanedContributors)
with open('uncleaned-contributors.csv','a') as csvwrite:
csvwriter = csvwrite.write(uncleanedContributors+'\n')
content = open('uncleaned-contributors.csv','r').readlines()
content4set = set(content)
cleanedcontent = open('cleaned-contributors.csv','w')
print()
print('The output of cleaned contributors')
print('--------------------------------------')
for i, line in enumerate(content4set,0):
cleanedcontent.write("{}.{}".format(str(i+1),line.replace('பக்கம்','அட்டவணை_பேச்சு')))
line=line.strip()
print(i, line)
cleanedcontent.close()
如何直接编写没有重复的CSV文件?有什么办法吗?
答案 0 :(得分:1)
以下是解决问题的一种方法:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://ta.wikisource.org/w/index.php?title=அட்டவணை:அ. மருதகாசி-பாடல்கள்.pdf&action=history'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
#getting the uncleaned contributors
userBdi = soup.findAll('bdi')
#list 2 string
uncleanedContributors =''.join(str(userBdi)[1:-1]).replace('</','<').replace('<bdi>','').replace(',','\n').replace(' ','').replace('பக்கம்','அட்டவணை_பேச்சு').replace('Bot','').replace('BOT','')
cleanedcontent = open('cleaned-contributors.csv','w')
print()
print('The output of cleaned contributors')
print('--------------------------------------')
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist
a = ' '.join(unique_list(uncleanedContributors.split()))
for i, j in enumerate(a.split(' ')):
cleanedcontent.write("{}.{}".format(str(i+1),j.replace('பக்கம்','அட்டவணை_பேச்சு')))
cleanedcontent.write('\n')
print(i+1, j)
cleanedcontent.close()
执行时,
[1]:
The output of cleaned contributors
--------------------------------------
1 Balajijagadesh
2 Info-farmer
3 Tshrinivasan
上面的解决方案代码提供了您在问题中所需的确切输出,并且能够直接写入CSV文件而无需任何重复。