从网站上删除的数据中删除\ n \ t

时间:2017-04-24 05:34:30

标签: python web-scraping beautifulsoup

我正在尝试删除显示在网页上的数据中显示的\n\t

我使用了strip()函数,但由于某些原因它似乎不起作用。 我的输出仍显示所有\n\t s。

这是我的代码:

import urllib.request
from bs4 import BeautifulSoup
import sys

all_comments = [] 
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'

while next_page and len(all_comments) < max_comments : 
    response = response = urllib.request.urlopen(next_page)
    srcode = response.read()
    soup = BeautifulSoup(srcode, "html.parser")

    all_comments_div=soup.find_all('div', class_="comment_body"); 
    for div in all_comments_div:
        data = div.find('p').text
        data = data.strip(' \t\n')#actual comment content
        data=''.join([ i for i in data if ord(i) < 128 ])
        all_comments.append(data)

    #getting the link of the stream for more comments
    next_page = soup.find('li', class_='pager-next first last')
    if next_page : 
        next_page = base_url + next_page.find('a').get('href')
    print('comments: {}'.format(len(all_comments)))

print(all_comments)

这是我得到的输出:

comments: 10

["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners? “]

4 个答案:

答案 0 :(得分:1)

strip()仅从字符串的末尾删除空格等。要删除字符串中的项目,您需要使用replacere.sub

所以改变:

data = data.strip(' \t\n')

要:

import re

data = re.sub(r'[\t\n ]+', ' ', data).strip()

删除\t\n个字符。

答案 1 :(得分:0)

使用replace而不是strip:

if(uploadRecordings.getStatus() == AsyncTask.Status.RUNNING){
    // My AsyncTask is currently doing work in doInBackground()
}

结果:

div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)

一些解释: 条带在您的情况下不起作用,因为它只从字符串的开头或结尾删除指定的字符,您无法从中间删除它。

一些例子:

blablabla  blablabla

结果:

div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)

不会删除开始和结束之间的任何字符:

blablabla blablabla

结果:

div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)

答案 2 :(得分:0)

您可以拆分文本(将删除所有空白区域,包括TAB)并再次连接片段,只使用一个空格作为&#34;胶水&#34;:

data = " ".join(data.split())

答案 3 :(得分:-1)

正如其他人所提到的,strip仅从开始和结束中删除空格。删除特定字符,例如\t\n

使用正则表达式(re),它很容易实现。指定模式(过滤需要替换的字符)。您需要的方法是sub(替代):

import re

data = re.sub(r'[\t\n ]+', ' ', data)

sub(<the characters to replace>, <to replace with>) - 上面我们设置了一个模式来获取[\t\n ]++是一个或多个,[ ]是指定字符类。 在单个语句中处理sub和strip:

data = re.sub(r'[\t\n ]+', ' ', data).strip()

数据\t\n

["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners? &#39;]

测试运行

import re

data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir....  Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee  (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY  GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', '                                   ?                                                    ', '        9     Central Government and Central Autonomous Bodies   pensioners/ family pensioners  1  2016     ,          1.1    .2017                                   ?', ' 9                            /                                  ', ' 9            Central Government and Central Autonomous Bodies   pensioners/ family pensioners  1  2016     ,          01.1 .2017                          DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641        ', '  ,\n \n,  ,\n\t                       Central Government and Central Autonomous Bodies   pensioners/ family pensioners ? ']



data_out = []


for s in data:

    data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())

输出

  

[&#34;先生,我谦虚的提议是请公众不要处理   医生,因为他们在非常微妙的情况下工作,以节省一个   病人并不总是在他手中。处理事件   医生日益增多,而且变得非常困难   在这些情况下工作。多数人不选择医疗   专业,它将在医学领域造成危机   身体可以敢于对医生,护士,救护车工人进行处理   将会被关押14年。&#34;,&#39;你好先生.... Mera AK想法海   Jese公交车票ki机器hai aur pata chalta hai ki din me kitni   门票尼卡利USSI TARH哼哼交警ko bhi aishi machine de to   usee(1)JO MEMO DUPLICATE BANATE VO BHI NIKL JAYENGE意味着SAB   法律HO JAYEGA ....我们的HMARI SARKAR K TRAZERY帐户我直接   信贷HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE   NE FIND(DAND)LIYA HAI VO LIGALLY HAI ... USEE 1. SAB LOG TRAFIC   狠狠地跟着KARENEGE ...... TAHNKYOU SIR ..&#39;,&#39;尊重先生,我是印地语   在那格浦尔市的一所学校的老师。我的问题是   在第9和第10个STD。为什么主题HINDI不是强制性的。在里面   目前的模式英语是学生学习的必备条件   但我们的国家语言HINDI不是。先生,我要求更新   模式使得语言印地语应该是强制性的   第9和第10学生。&#39;先生建议AADHAR BASE SYSTEM 1.Cash   减少教育PAN印度集中制度2.减少健康政策   为所有&amp;集中率MRP系统3.All Private&amp;政府医院   必须是现金少4.所有收费亭,停车等无现金必须   5.强制性健康保险&amp;农业保险所有6.所有银行,GOVT部门,PVT部门应通过TAT产生致谢   提到7.市政公司/ ZP必须是CASH Less System   经济实惠的最低税收住房取消TDS&#39;,&#39; SIR KINDLY LOOK IN MARITIME   行业特别是政府海事培训机构真的   条件变得最糟糕,需要立即改变和注意   保护我们在海洋行业的印度声誉。 JAI HIND&#39;,&#39;?&#39;,&#39; 9   中央政府和中央自治机构养老金领取者/家庭   养老金领取者1 2016年,1.1 .2017?&#39;,&#39; 9 /&#39;,&#39; 9中央政府和   中央自治机构养老金领取者/家庭养老金领取者1 2016,01.1   .2017 DOPPW / E / 2017/03242,DOPPW / E / 2017/04085,PMOPG / E / 2017/0148952,   PMOPG / E / 2017/0115720,PMOPG / E / 2017/0123641&#39;,&#39; ,,, Central   政府和中央自治机构养老金领取者/家庭养老金领取者   ?&#39;]