仅从网站抓取文字

时间:2016-10-26 07:31:10

标签: python web-scraping beautifulsoup data-cleaning

我想从任何特定的网站上删除有意义的文字。

有意义 - 英文字典词

ex:我试图从this网站获取文字。

使用的代码段:

import urllib2
from bs4 import BeautifulSoup
import sys
from memory_profiler import profile

sys.stdout = open("test_data.txt", "w")
url2 = "http://apk-mania.com"

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A'}
req = urllib2.Request(url2, None, headers)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
html = urllib2.urlopen(req, timeout=60).read()
soup = BeautifulSoup(html)
list1 = soup.title.string
lines = (line.strip() for line in list1.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = ' '.join(chunk for chunk in chunks if chunk)
text1 = text.lower()
desc = soup.find(attrs={'name': 'Description'})
if desc == None:
    desc = soup.find(attrs={'name': 'description'})
try:
    list2 = desc['content']
    c = 2
except Exception as e:
    c = 1
if c == 2:
    lines = (line.strip() for line in list2.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = ' '.join(chunk for chunk in chunks if chunk)
    text2 = text.lower()
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = ' '.join(chunk for chunk in chunks if chunk)
text3 = text.lower()
corpus = text1 + text3

if c == 2:
    corpus = text1 + text2 + text3

print corpus.encode('utf-8')

我得到的文字根本没有意义。该文本主要包括不适合用于机器学习的垃圾文本。 见here

我知道在获取文本后,我必须清理数据。我也这样做了,但是,大多数数据仍然是垃圾。见here

我的问题是,我可以在抓取级别做一些事情,以便检索到的文本更有意义吗?

或者做更多数据清理是唯一的选择吗? 如果是,我还应该在数据清理部分做些什么呢?

主要是这类文字让我很烦恼。

dmfyifwegqzdkwyjcedvgxhgzmfxnzhcedmxhgzmfxnjfcedyxiiwixhgmxnzrcedcxhgrvxmfcedmxxhgznxmzvcedmxhgzmfxmzbcedmxxhgzofxmzjcedmxhgzqvxmzrcedmxhgzmyisilxntjceduxhgmxntbcedyxhgnvxnzjcedqzxhgrlxnkvcedzfxhgnvxnjncedcxhgovxnkzcedzfiiwixhgrfxnkzceddbxhgmlxntrcedqzxhgmfxnjvcedyxhgmlxndncedzgxhgrvxnkvcedyxhgmxnzrcedyxhgrlxnkuilcjcedcxhgnvxnjjcedzcxhgovxnzrceduyxhgnfxndnceduwxhgnvxnjvcedcyxhgmxnkzcedzfxhgrvxnjvcedyzxhgnfxnjlcedzgxhgrsisilxnkzcedzfxhgovxnjncedyxhgmxnjfcedzfxhgnfxnjlcedyxhgmvxnzrcedyiiwixhgmxnjfcedzfxhgnfxnjlcedyxhgmvxnzrcedyiiwixhgnvxnzhcedyxhgmyisilxnkrcedyxxhgnfxnjncedyiiwixhgmxnjncedcyxhgovxnzbcedciiwixhgmxnzjcedyxhgmvxnzrcedyxhgnvxnkncedyxhgrfxnjvcedzfxhgncisilxnjrcedyxxhgnfxnjfcedjexhgmxnjzcedyxxhgmxnzlcedzfxhgmyisilxnjncedcyxhgnvxnjfcedcxhgnvxndfcedcxhgnfxnzjcedyxhgmlxnzvcedcxhgnsisilxnzzcedyxxhgqxnzvcedyiiwixhgmxnjvcedcxhgmvxnzrcedcxhgmlxnjlcedyyxhgnvxnzrcedyxhgrvxnkzcedyxhgnsisilxmkyilcjcedczxhgmlxnjmilcjcedyxhgnfxnzrcedcwxhgzqvxmkzcedjgiiwixhgnvxnkvcedyzxhgrlxnjrcedyiiwixhgrlxnkvcedzdxhgrlxnjfcedyiiwixhgmxnkzcedzfxhgnlxnjlcedyiiwixhgrlxnkvcedyzxhgqxnjlcedyzxhgqiisilxnjeilcjcedyxhgmlxnjvcedyiiwixhgnvxnzjcedzdiiwixhgnfxnjfcedcyxhgnxnjvcedciiwixhgrlxmziilcjcedyxxhgmfxnzbcedyxhgrvxnjrcedqzxhgofxnjlcedzdxhgncisilxnjjcedzgxhgnfxnzkilcjcedyzxhgqxnjlcedyzxhgqiisilxnjrcedyxhgmxnzbcedyxxhgnfxnjncedyxhgnvxnzzcedyxhgrvxnzqilcjcedcyxhgnvxnkrcedzgxhgnlxnjvcedqzxhgofxnjlcedzdxhgncisilxnzbcedyxxhgmlxnjvcedzfxhgnfxnevcedzgxhgnfxnjuilcjcedyxxhgmxnzlcedzfxhgmyisilxnzvcedzfxhgnfxnjvcedyxhgovxnkvcedyxhgncisilxnzncedyzxhgmlxnjlcedcwxhgnfxnzmilcjcedyxhgnvxnzrcedqxhgqxnjvcedzexhgnvxnkvcedcxhgmxndjcedcxhgnfxnjfcedyxhgrvxnjfcedzexhgnsisilxnjlcedzfxhgmxnjvcedcyxhgnfxndjcedyxhgnlxnkzcedcyxhgnsisiiisilxnjncedcyxhgnvxnjfcedcxhgnvxndrcedyxxhgnfxnjfcedqzxhgofxnjfcedzfxhgrvxnjvcedzdiiwixhgmxnjvcedcxhgqxnkzcedyzxhgmvxnkncedqxhgnvxnzncedyzxhgmlxnjlcedcwxhgnfxnjlcedzgxhgrsisilxnjncedcyxhgnvxnjfcedcxhgnvxnezcedyxhgnlxnjvcedcyiiwixhgqxnjvcedzfxhgnxnzrcedyiiwixhgmxnzvcedyyxhgmxnzrcedcyiiwixhgzmfxmzbcedmwiiwixhgzmfxmzbcedmxiiwixhgnlxnzjcedzgxhgrfxndncedyxhgmvxnzjcedqzxhgrlxnjrcedyiiwixhgmxnkzcedzfxhgmxnjfcedciiwixhgmlxnjfcedzfxhgnfxnkzcedzeiiwixhgnlxnkncedzgxhgrlxnziilcjcedcyxhgnvxnzbcedzdxhgmvxnjncedyiiwixhgmvxndjcedqzxhgnfxndvcedqxhgnxndhcedqxhgqvxnejcedrdxhgrfxnevcedrgxhgmfxntfceduyxhgmxntrceduxhgnlxntdceduxhgovxnufcedyxxhgmlxnjncedyxhgnvxnjzcedyxhgofxnjlcedzbxhgqlxnkncedzexhgrvxnkzcedcwxhgmvxnzjcedczxhgnfxnzvcedcxhgnxnzhcedcxhgqvxmzbcedmxxhgzmlxmzncedmxhgznvxmzzcedmxhgzofxmzlcedjcxhgyrlxmqilcjcedyzxhgofxnjfcedcyxhgmxnkzcedyxhgnvxndfcedciiwixhgmxnjhcedyxxhgmlxndfcedciiwixhgrlxmzaixtsoznvuyrpbokxtpzihblkkdapptawkbdbxzbzdlkovswxvpezhcibfmhgntqedeznvuyrpboxzbotunxgkxtyxigxzbotunxhhpsbuzxcgkhdpbmrvdtfmhhkowqwzjdxxxdluzgwwegqzdlbmdfhxawkbdbxzbzdlkovsxvpkhtpyvtzxjzxjzoltdxjsolwegqzdlbmvxxsewdglvbmfsoltunrwrgfyunoywuzwxzoiewfvkttfmhgntqegfbxzbzdlkovsxvznvuyrpboxzbotunxhikxtpzihfmhgntqegjbxzbzdlkovsxvmjihfmhgntqegilyhbmcxxsxldnkfwuwzatovmswzfslmwetzjatovmswfsgwetzjatovmswfslnpltfmhhkowqwzddxshfmhgntqegjbxzbzdlkovsxvbxzbzdlkovsxvpwzfdkslawyoivwedkndvnyymivwedkndvyltfmhhkowqwzhdxsgvxigxotjcljeofwufdeovwumjuxcmtbclnwxnzjcligxwzytovmlxkfdnbmdfdkskvksymivwedkndvyltfmhhkowqwzhdxsgvxlthlwywltldezesnhoolthlwywltldezesnhpezdjcpkxtfmhgntqedcicewonjcmlwdevszwlbnqzgjdwlbnrbxzbzdlkovsxmfdkfwegqzdlbovponjcmlwdengqvnbmmzgjdwlbnrbxzbzdlkovsxmldkfwegqzdlbmtfdkttzyjpchrdrkftewjwwegqzdlbmtndxtgitecnyaxbrwxlbwvudftfmhhkowqwzexvocnyaxbqzbuluyykytfmhgntqedqrxzbzdlkovsxnvrkfwedkndvostfmhgntqedqpkwegqzdlbmtvdkwedkndvnttzyjpchrfbgvtzwwwegqzdlbmtzdxtfmhhkowqwzexstfmhgntqegirxzbzdlkovsxnvrxzbotunxgwwegqzdlbmthdxshhkttzyjpchrfbgvtzwwwegqzdlbmtldxtmdwjdglvbigpelmkhzvawqomckpxdpbmrvdtfmhhkowqwziwxvpelbhnlihtmbiodmfyifwedkndvyybpbibawkbdbxzbzdlkovsymfdkxtfmhgntqedzbxzbotunxgykfwedkndvyyldpvwedkndvmihjbmawdbxzbotunxhjxslorvyvtzwwwegqzdlbmjfdxtmdwjdglvbigpelmkcffmhgntqedmpewedkndvmzxozhcibfmhgntqegqzgjdwlbnrbxzbzdlkovsxmfdkfwegqzdlbmjjdkttfmhgntqegrbxzbzdlkovsymdpvwedkndvnltfmhhkowqwzixvxzbotunxhkwwegqzdlbmjvdxtfmhhkowqwzixttkbnbwvudftfmhhkowqwzixvbxzbzdlkovsyndkfwedkndvzckdmfyifwedkndvztgbmvievdxnlrxzlbnqoxzbzdlkovsyovseyjcedcxhgovxnjvcedcijpawkbcsilxnjjcedcxhgmlxnjjcedzdxhgnvxnzmionrydwusilxnjncedyxxhgrvxnjncedyxhgqxnjfcedyyxhgqxnjuiomzhbhnlfskxzbotunxhkwwegqzdlbmzbdxshfmhgntqegupowedkndvzftfmhhkowqwzmyxvbxzbzdlkovszmvdkfwedkndvzclfxonjcmlwdevszwlbnrbxzbzdlkovszmdpsahmttyxigxzbotunxhmowegqzdlbmzrditihrcgvvzibkbnbwvudftfmhhkowqwzmxvmjihfmhgntqegyzgjdwlbnrbxzbzdlkovsznvdwzbdk

0 个答案:

没有答案