我想从任何特定的网站上删除有意义的文字。
有意义 - 英文字典词
ex:我试图从this网站获取文字。
使用的代码段:
import urllib2
from bs4 import BeautifulSoup
import sys
from memory_profiler import profile
sys.stdout = open("test_data.txt", "w")
url2 = "http://apk-mania.com"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A'}
req = urllib2.Request(url2, None, headers)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
html = urllib2.urlopen(req, timeout=60).read()
soup = BeautifulSoup(html)
list1 = soup.title.string
lines = (line.strip() for line in list1.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)
text1 = text.lower()
desc = soup.find(attrs={'name': 'Description'})
if desc == None:
desc = soup.find(attrs={'name': 'description'})
try:
list2 = desc['content']
c = 2
except Exception as e:
c = 1
if c == 2:
lines = (line.strip() for line in list2.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)
text2 = text.lower()
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)
text3 = text.lower()
corpus = text1 + text3
if c == 2:
corpus = text1 + text2 + text3
print corpus.encode('utf-8')
我得到的文字根本没有意义。该文本主要包括不适合用于机器学习的垃圾文本。 见here
我知道在获取文本后,我必须清理数据。我也这样做了,但是,大多数数据仍然是垃圾。见here
我的问题是,我可以在抓取级别做一些事情,以便检索到的文本更有意义吗?
或者做更多数据清理是唯一的选择吗? 如果是,我还应该在数据清理部分做些什么呢?
主要是这类文字让我很烦恼。
dmfyifwegqzdkwyjcedvgxhgzmfxnzhcedmxhgzmfxnjfcedyxiiwixhgmxnzrcedcxhgrvxmfcedmxxhgznxmzvcedmxhgzmfxmzbcedmxxhgzofxmzjcedmxhgzqvxmzrcedmxhgzmyisilxntjceduxhgmxntbcedyxhgnvxnzjcedqzxhgrlxnkvcedzfxhgnvxnjncedcxhgovxnkzcedzfiiwixhgrfxnkzceddbxhgmlxntrcedqzxhgmfxnjvcedyxhgmlxndncedzgxhgrvxnkvcedyxhgmxnzrcedyxhgrlxnkuilcjcedcxhgnvxnjjcedzcxhgovxnzrceduyxhgnfxndnceduwxhgnvxnjvcedcyxhgmxnkzcedzfxhgrvxnjvcedyzxhgnfxnjlcedzgxhgrsisilxnkzcedzfxhgovxnjncedyxhgmxnjfcedzfxhgnfxnjlcedyxhgmvxnzrcedyiiwixhgmxnjfcedzfxhgnfxnjlcedyxhgmvxnzrcedyiiwixhgnvxnzhcedyxhgmyisilxnkrcedyxxhgnfxnjncedyiiwixhgmxnjncedcyxhgovxnzbcedciiwixhgmxnzjcedyxhgmvxnzrcedyxhgnvxnkncedyxhgrfxnjvcedzfxhgncisilxnjrcedyxxhgnfxnjfcedjexhgmxnjzcedyxxhgmxnzlcedzfxhgmyisilxnjncedcyxhgnvxnjfcedcxhgnvxndfcedcxhgnfxnzjcedyxhgmlxnzvcedcxhgnsisilxnzzcedyxxhgqxnzvcedyiiwixhgmxnjvcedcxhgmvxnzrcedcxhgmlxnjlcedyyxhgnvxnzrcedyxhgrvxnkzcedyxhgnsisilxmkyilcjcedczxhgmlxnjmilcjcedyxhgnfxnzrcedcwxhgzqvxmkzcedjgiiwixhgnvxnkvcedyzxhgrlxnjrcedyiiwixhgrlxnkvcedzdxhgrlxnjfcedyiiwixhgmxnkzcedzfxhgnlxnjlcedyiiwixhgrlxnkvcedyzxhgqxnjlcedyzxhgqiisilxnjeilcjcedyxhgmlxnjvcedyiiwixhgnvxnzjcedzdiiwixhgnfxnjfcedcyxhgnxnjvcedciiwixhgrlxmziilcjcedyxxhgmfxnzbcedyxhgrvxnjrcedqzxhgofxnjlcedzdxhgncisilxnjjcedzgxhgnfxnzkilcjcedyzxhgqxnjlcedyzxhgqiisilxnjrcedyxhgmxnzbcedyxxhgnfxnjncedyxhgnvxnzzcedyxhgrvxnzqilcjcedcyxhgnvxnkrcedzgxhgnlxnjvcedqzxhgofxnjlcedzdxhgncisilxnzbcedyxxhgmlxnjvcedzfxhgnfxnevcedzgxhgnfxnjuilcjcedyxxhgmxnzlcedzfxhgmyisilxnzvcedzfxhgnfxnjvcedyxhgovxnkvcedyxhgncisilxnzncedyzxhgmlxnjlcedcwxhgnfxnzmilcjcedyxhgnvxnzrcedqxhgqxnjvcedzexhgnvxnkvcedcxhgmxndjcedcxhgnfxnjfcedyxhgrvxnjfcedzexhgnsisilxnjlcedzfxhgmxnjvcedcyxhgnfxndjcedyxhgnlxnkzcedcyxhgnsisiiisilxnjncedcyxhgnvxnjfcedcxhgnvxndrcedyxxhgnfxnjfcedqzxhgofxnjfcedzfxhgrvxnjvcedzdiiwixhgmxnjvcedcxhgqxnkzcedyzxhgmvxnkncedqxhgnvxnzncedyzxhgmlxnjlcedcwxhgnfxnjlcedzgxhgrsisilxnjncedcyxhgnvxnjfcedcxhgnvxnezcedyxhgnlxnjvcedcyiiwixhgqxnjvcedzfxhgnxnzrcedyiiwixhgmxnzvcedyyxhgmxnzrcedcyiiwixhgzmfxmzbcedmwiiwixhgzmfxmzbcedmxiiwixhgnlxnzjcedzgxhgrfxndncedyxhgmvxnzjcedqzxhgrlxnjrcedyiiwixhgmxnkzcedzfxhgmxnjfcedciiwixhgmlxnjfcedzfxhgnfxnkzcedzeiiwixhgnlxnkncedzgxhgrlxnziilcjcedcyxhgnvxnzbcedzdxhgmvxnjncedyiiwixhgmvxndjcedqzxhgnfxndvcedqxhgnxndhcedqxhgqvxnejcedrdxhgrfxnevcedrgxhgmfxntfceduyxhgmxntrceduxhgnlxntdceduxhgovxnufcedyxxhgmlxnjncedyxhgnvxnjzcedyxhgofxnjlcedzbxhgqlxnkncedzexhgrvxnkzcedcwxhgmvxnzjcedczxhgnfxnzvcedcxhgnxnzhcedcxhgqvxmzbcedmxxhgzmlxmzncedmxhgznvxmzzcedmxhgzofxmzlcedjcxhgyrlxmqilcjcedyzxhgofxnjfcedcyxhgmxnkzcedyxhgnvxndfcedciiwixhgmxnjhcedyxxhgmlxndfcedciiwixhgrlxmzaixtsoznvuyrpbokxtpzihblkkdapptawkbdbxzbzdlkovswxvpezhcibfmhgntqedeznvuyrpboxzbotunxgkxtyxigxzbotunxhhpsbuzxcgkhdpbmrvdtfmhhkowqwzjdxxxdluzgwwegqzdlbmdfhxawkbdbxzbzdlkovsxvpkhtpyvtzxjzxjzoltdxjsolwegqzdlbmvxxsewdglvbmfsoltunrwrgfyunoywuzwxzoiewfvkttfmhgntqegfbxzbzdlkovsxvznvuyrpboxzbotunxhikxtpzihfmhgntqegjbxzbzdlkovsxvmjihfmhgntqegilyhbmcxxsxldnkfwuwzatovmswzfslmwetzjatovmswfsgwetzjatovmswfslnpltfmhhkowqwzddxshfmhgntqegjbxzbzdlkovsxvbxzbzdlkovsxvpwzfdkslawyoivwedkndvnyymivwedkndvyltfmhhkowqwzhdxsgvxigxotjcljeofwufdeovwumjuxcmtbclnwxnzjcligxwzytovmlxkfdnbmdfdkskvksymivwedkndvyltfmhhkowqwzhdxsgvxlthlwywltldezesnhoolthlwywltldezesnhpezdjcpkxtfmhgntqedcicewonjcmlwdevszwlbnqzgjdwlbnrbxzbzdlkovsxmfdkfwegqzdlbovponjcmlwdengqvnbmmzgjdwlbnrbxzbzdlkovsxmldkfwegqzdlbmtfdkttzyjpchrdrkftewjwwegqzdlbmtndxtgitecnyaxbrwxlbwvudftfmhhkowqwzexvocnyaxbqzbuluyykytfmhgntqedqrxzbzdlkovsxnvrkfwedkndvostfmhgntqedqpkwegqzdlbmtvdkwedkndvnttzyjpchrfbgvtzwwwegqzdlbmtzdxtfmhhkowqwzexstfmhgntqegirxzbzdlkovsxnvrxzbotunxgwwegqzdlbmthdxshhkttzyjpchrfbgvtzwwwegqzdlbmtldxtmdwjdglvbigpelmkhzvawqomckpxdpbmrvdtfmhhkowqwziwxvpelbhnlihtmbiodmfyifwedkndvyybpbibawkbdbxzbzdlkovsymfdkxtfmhgntqedzbxzbotunxgykfwedkndvyyldpvwedkndvmihjbmawdbxzbotunxhjxslorvyvtzwwwegqzdlbmjfdxtmdwjdglvbigpelmkcffmhgntqedmpewedkndvmzxozhcibfmhgntqegqzgjdwlbnrbxzbzdlkovsxmfdkfwegqzdlbmjjdkttfmhgntqegrbxzbzdlkovsymdpvwedkndvnltfmhhkowqwzixvxzbotunxhkwwegqzdlbmjvdxtfmhhkowqwzixttkbnbwvudftfmhhkowqwzixvbxzbzdlkovsyndkfwedkndvzckdmfyifwedkndvztgbmvievdxnlrxzlbnqoxzbzdlkovsyovseyjcedcxhgovxnjvcedcijpawkbcsilxnjjcedcxhgmlxnjjcedzdxhgnvxnzmionrydwusilxnjncedyxxhgrvxnjncedyxhgqxnjfcedyyxhgqxnjuiomzhbhnlfskxzbotunxhkwwegqzdlbmzbdxshfmhgntqegupowedkndvzftfmhhkowqwzmyxvbxzbzdlkovszmvdkfwedkndvzclfxonjcmlwdevszwlbnrbxzbzdlkovszmdpsahmttyxigxzbotunxhmowegqzdlbmzrditihrcgvvzibkbnbwvudftfmhhkowqwzmxvmjihfmhgntqegyzgjdwlbnrbxzbzdlkovsznvdwzbdk