Question

我有一个URL列表，我需要使用Python来抓取数据。我使用下面的代码来抓取数据

def extract_url_data1(url):
   html = urllib.request.urlopen(url).read()
   soup = BeautifulSoup(html)
   for script in soup(["script", "style"]):
    script.extract()
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = " ".join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

我将返回的数据存储在文本文件中。我面临的问题是某些网址以＆＃34; xbd5 \ xef \ xbf \ xbdFDK \ xef \ xbf \ xbdCP \的形式返回数据xef \ xbf \ xbdHP \ xef \ xbf \ xbd \ xef \ xbf \ xbd6N＆＃34; 。我只想在文本文件中存储正确的英文单词。请告诉我如何实现与我已经尝试过的一些正则表达式相同的内容，例如下面的

re.sub(r'[^\x00-\x7f]',r' ',text)

Answer 1

如果你想删除非英文字母，那么你去：

In [1]: import re

In [2]: s = "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N"

In [3]: ' '.join(re.findall(r'\w+', s))
Out[3]: 'xbd5 FDK CP HP 6N'

但是，如果您只想保留有效的英语单词，那么您需要验证它们。这个How to check if a word is an English word with Python?会对您有所帮助。

删除表格＆＃34; xbd5 \ xef \ xbf \ xbdFDK \ xef \ xbf \ xbdCP \ xef \ xbf \ xbdHP \ xef \ xbf \ xbd \ xef \ xbf \ xbd6N＆＃34;的UTF数据在Python中

1 个答案: