我想从csv文件中删除不是url的值:我们的df ['url']包含的值如“https://stackoverflow.com/questions/ask”https://www.linkedin.com/feed/''345',我要删除345
def Find_url(string):
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
return url
if __name__ == "__main__":
file = pd.read_csv('url_file.csv')
df = pd.DataFrame(file)
for i in range(len(df)):
url = Find_url(df.loc[i]['url'])
df.loc[i]['url']=url
df.to_csv('clean_url.csv')
示例输入:
'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365'
'https://en.wikipedia.org/wiki/Railway_Board'
'https://en.wikipedia.org/wiki/Railway_Board#History'
我想要像这样的输出样本输出:
'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
'https://en.wikipedia.org/wiki/Railway_Board'
'https://en.wikipedia.org/wiki/Railway_Board#History'
答案 0 :(得分:0)
您可以使用标准库中的urllib.parse
来尝试将字符串解析为具有必要属性的URL。
from io import StringIO
from urllib.parse import urlparse
import pandas as pd
def url_validator(x):
try:
result = urlparse(x)
# check non-empty attributes
return all((result.scheme, result.netloc, result.path))
except AttributeError:
return False
mystr = StringIO("""https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560
http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365
https://en.wikipedia.org/wiki/Railway_Board
https://en.wikipedia.org/wiki/Railway_Board#History""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, header=None, names=['values'])
# apply filter based on checker function
df = df[df['values'].apply(url_validator)]
print(df)
values
0 https://www.zaubacorp.com/company/HINDUSTAN-CA...
1 http://www.indianrailways.gov.in/railwayboard/...
5 https://en.wikipedia.org/wiki/Railway_Board
6 https://en.wikipedia.org/wiki/Railway_Board#Hi...