Question

我正在尝试从网页中隔离图像的特定链接，但无法完全实现。 HTML看起来像：

<head>
   <img alt="Generic title" src="https://genericURL/photo/picture.jpg/"> 
   <img src="https://genericurl/.../">
   <img src="https://genericurl/.../">
   ....

我能够返回许多链接，但我特别想要的链接是显示的顶部链接，它是唯一包含/photo/picture.jpg的链接。我尝试过使用Find specific link text with bs4和其他变体的答案，但尚未弄明白。有人能看一下吗？

我的代码：

links = soup.findAll('img', {'src': re.compile('^http://image\d+')})
for link in links:
     print(link.text)

编辑：使用建议我意识到链接格式正在根据我使用的过滤器而改变，例如：当我打印整个网页时，我看到链接为http://image...。但是当我使用findAll('img', {'src' ...时，链接输出为https://img，所以我试图重新编译错误的东西。

Answer 1

soup.find_all("img", alt="Generic title")

您应该使用alt作为过滤器。

Answer 2

var React = require('react');
var NumberList = require('./NumberList');

class App extends React.Component {
  constructor(props) {
    super(props);
  }

  render() {
    return <div>
      <h1>Hello {this.props.name} from React!</h1>
      <NumberList numbers={[1,2,3,4,5]} />
    </div>;
  }
}

module.exports = App;

Python Web抓取：查找特定链接

2 个答案: