Question

我对python还是很陌生，我正在尝试实现一个功能，该功能可以在段落标签中输出唯一单词的数量，但是在以几种方式编辑了这些文本之后。第一：检索段落标记中包含的所有文本，并将其转换为小写第二：去除我正在使用的str.translate(str.maketrans('','',string.punctuation))标点符号第三：基于空格分隔将其标记为单词。第四：输出唯一字数。

这是我的代码：

import urllib
def getLength(url):
    r=urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'html.parser')
    links = soup.find_all('p')
    k=[]
    for p in links:
        if not p.find('a'):
            pText = p.get_text()
            k=k.append(pText)
        k=k.lower()
        translator=str.translate(str.maketrans('','',string.punctuation))
        k=k.translate(translator)
    #missing code
getLength("https://en.wikipedia.org/wiki/Google")

我尝试打印值，但发现我的逻辑不正确。我不知道该如何纠正并继续进行。请帮忙。

编辑：

import urllib
def getLength(url):
    r=urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'html.parser')
    links = soup.find_all('p')
    for p in links:
        pText = p.get_text()
        pText=pText.lower()
        transpText=pText.translate(pText.maketrans('','',string.punctuation))
        print(transpText)
        newdata=transpText.split()
        length=len(newdata)
        return length
getLength("https://en.wikipedia.org/wiki/Google")

我知道了，但是我不理解标记化部分。由于某种原因，我将长度设为0。我做错了什么或应该怎么做。

Answer 1

import numpy as np
import urllib
def getLength(url):
    r=urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'html.parser')
    links = soup.find_all('p')
    k=[]
for p in links:
    pText = p.get_text()
    pText=pText.lower()
    transpText=pText.translate(pText.maketrans('','',string.punctuation))
    newdata=transpText.split()
    k += newdata
n=np.unique(k)
return len(n)
getLength("https://en.wikipedia.org/wiki/Google")

尝试了多次之后...这段代码是我所着手的，它似乎可以在各种测试用例中正常工作。

使用beautifulsoup查找段落标签中的唯一单词数

1 个答案: