我的清单:
[' \ n \ r \ n \ t这篇文章是关于甜香蕉。对于哪个属 香蕉植物属于,见Musa(属)。\ n \ r \ n \ t对于含淀粉的香蕉 用于烹饪,请参阅烹饪香蕉。其他用途,请参阅香蕉 (消歧)\ n \ r \ n \ tMusa种是原生于热带的Indomalaya 和澳大利亚,很可能首先在巴布亚驯化 新几内亚。\ n \ r \ n \ t他们长在135 国家。\ n \ n \ n \ n \ n \ n \ t \全世界范围内,没有明显的区别 "香蕉"和#34;大蕉"。\ n \ n描述\ n \ r \ n \ t香蕉植物是 最大的草本开花植物。\ n \ r \ n \ t所有的地上 香蕉植物的一部分从通常称为a的结构生长 " corm"。\ n \ nEtymology \ n \ r \ n \ t这个词香蕉被认为属于西方 非洲血统,可能来自Wolof一词banaana,并传入 英语通过西班牙语或葡萄牙语。\ n']
示例代码:
<script src="https://cdn.jsdelivr.net/npm/vue"></script>
<div id="app">
<button @click="trigger">upload image</button>
<input type="file" id="fileLoader" v-show="false" @change="fetchData">
</div>
在我抓取网页后,我需要进行数据清理。我想将所有import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
,"\r"
,"\n"
替换为"\t"
,但我发现我有副标题,如果我这样做,字幕和句子将混合在一起。
每个字幕始终以""
开头,以\n\n
结尾,是否有可能我可以在此列表中区分它们,例如\n\r\n\t
。如果我将\aEtymology\a
和\n\n
分别替换为\n\r\n\t
首先导致其他部分可能具有与此\a
相同的元素,那么它将无效比如\n\n\r
。提前谢谢!
答案 0 :(得分:1)
方法
<subtitles>
\n
,\r
,\t
等代码
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
输出
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
答案 1 :(得分:0)
import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
我认为你可以使用正则表达式
答案 2 :(得分:0)
您可以使用正则表达式执行此操作:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1>
是对第一个正则表达式中(\ w +)的反向引用。它可以让你重复使用那里的东西。