Python:替换列表中的\ n \ r \ t,不包括那些以\ n \ n开头并以\ n \ r \ n \ t结尾的列表

时间:2017-10-16 08:37:47

标签: python web-scraping data-cleansing

我的清单:

  

[' \ n \ r \ n \ t这篇文章是关于甜香蕉。对于哪个属   香蕉植物属于,见Musa(属)。\ n \ r \ n \ t对于含淀粉的香蕉   用于烹饪,请参阅烹饪香蕉。其他用途,请参阅香蕉   (消歧)\ n \ r \ n \ tMusa种是原生于热带的Indomalaya   和澳大利亚,很可能首先在巴布亚驯化   新几内亚。\ n \ r \ n \ t他们长在135   国家。\ n \ n \ n \ n \ n \ n \ t \全世界范围内,没有明显的区别   "香蕉"和#34;大蕉"。\ n \ n描述\ n \ r \ n \ t香蕉植物是   最大的草本开花植物。\ n \ r \ n \ t所有的地上   香蕉植物的一部分从通常称为a的结构生长   " corm"。\ n \ nEtymology \ n \ r \ n \ t这个词香蕉被认为属于西方   非洲血统,可能来自Wolof一词banaana,并传入   英语通过西班牙语或葡萄牙语。\ n']

示例代码:

<script src="https://cdn.jsdelivr.net/npm/vue"></script>

<div id="app">
  <button @click="trigger">upload image</button>
  <input type="file" id="fileLoader"  v-show="false" @change="fetchData">
</div>

在我抓取网页后,我需要进行数据清理。我想将所有import requests from bs4 import BeautifulSoup import re re=requests.get('http://www.abcde.com/banana') soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser") title_tag = soup.select_one('.page_article_title') print(title_tag.text) list=[] for tag in soup.select('.page_article_content'): list.append(tag.text) #list=([c.replace('\n', '') for c in list]) #list=([c.replace('\r', '') for c in list]) #list=([c.replace('\t', '') for c in list]) print(list) "\r""\n"替换为"\t",但我发现我有副标题,如果我这样做,字幕和句子将混合在一起

每个字幕始终以""开头,以\n\n结尾,是否有可能我可以在此列表中区分它们,例如\n\r\n\t。如果我将\aEtymology\a\n\n分别替换为\n\r\n\t首先导致其他部分可能具有与此\a相同的元素,那么它将无效比如\n\n\r。提前谢谢!

3 个答案:

答案 0 :(得分:1)

方法

  1. 将字幕替换为自定义字符串,列表中的<subtitles>
  2. 替换列表中的\n\r\t
  3. 将自定义字符串替换为实际字幕
  4. 代码

    l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
    
    import re
    regex=re.findall("\n\n.*.\n\r\n\t",l[0])
    print(regex)
    
    for x in regex:
        l = [r.replace(x,"<subtitles>") for r in l]
    
    rep = ['\n','\t','\r']
    for y in rep:
        l = [r.replace(y, '') for r in l]
    
    for x in regex:
        l = [r.replace('<subtitles>', x, 1) for r in l]
    print(l)
    

    输出

    ['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
    
    ['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
    

答案 1 :(得分:0)

import re    

print([re.sub(r'[\n\r\t]', '', c) for c in list])

我认为你可以使用正则表达式

答案 2 :(得分:0)

您可以使用正则表达式执行此操作:

import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]

\g<1>是对第一个正则表达式中(\ w +)的反向引用。它可以让你重复使用那里的东西。