Question

＆lt; p＆gt;在python中，我有一个程序从URL列表中返回。＆lt; / p＆gt; ＆lt; p＆gt;当他们在列表中放在一起时，其中一些会返回错误请求＆lt; / p＆gt; ＆lt; p＆gt;例如，我将两个URL加载到文本文件中：＆lt; / p＆gt; ＆LT;预＆GT;＆LT;代码＆GT; HTTP：//www.scientific.net/MSF http://www.scientific.net/JMNM ＆LT; /代码＆GT;＆LT; /预＆GT; ＆lt; p＆gt;它返回：＆lt; / p＆gt; ＆lt; pre＆gt;＆lt; code＆gt;＆lt; title＆gt;错误请求＆lt; / title＆gt; ＆lt; title＆gt;亚稳材料和纳米晶材料杂志＆lt; / title＆gt; ＆LT; /代码＆GT;＆LT; /预＆GT; ＆lt; p＆gt;如果我只有列表中的第一个URL，则代码可以正常工作。如何让它检索标题而不是错误请求？＆lt; / p＆gt; ＆lt; p＆gt;我的代码：＆lt; / p＆gt; ＆lt; pre＆gt;＆lt; code＆gt; url_list = [] f =打开（＆＃39; test.txt＆＃39;＆＃39; r＆＃39;）带有网址的#text文件对于f中的行： url_list.append（线）对于url_list中的链接：尝试： r = requests.get（链接）汤= BeautifulSoup（r.content，＆＃34; html.parser＆＃34;） title = soup.title title.string = title.get_text（strip = True）打印（STR（标题））除了：打印（＆＃34;未找到标题＆＃34;）继续＆LT; /代码＆GT;＆LT; /预＆GT;

Answer 1

您的问题来自于从文本文件中读取。在for link in url_list循环中，link的第一个值为http://www.scientific.net/MSF\n - 最后\n导致Bad Request错误。在阅读时从行中删除\n，您的代码将起作用。您的最后一行似乎没有\n，因此只需使用url_list.append(line[:-1])，最后一行就会失败。

Answer 2

private static void testHadoop(Pipeline pipeline){
    Class<? extends FileInputFormat<LongWritable, Text>> inputFormatClass =
            (Class<? extends FileInputFormat<LongWritable, Text>>)
                    (Class<?>) TextInputFormat.class;
    @SuppressWarnings("unchecked")  //hdfs://localhost:9000
            HadoopIO.Read.Bound<LongWritable, Text> readPTransfom_1 = HadoopIO.Read.from("hdfs://localhost:9000/tmp/kinglear.txt",
            inputFormatClass,
            LongWritable.class,
            Text.class);
    PCollection<KV<LongWritable, Text>> textInput = pipeline.apply(readPTransfom_1)
            .setCoder(KvCoder.of(WritableCoder.of(LongWritable.class), WritableCoder.of(Text.class)));

    //OutputFormat
    @SuppressWarnings("unchecked")
    Class<? extends FileOutputFormat<LongWritable, Text>> outputFormatClass =
            (Class<? extends FileOutputFormat<LongWritable, Text>>)
                    (Class<?>) TemplatedTextOutputFormat.class;

    @SuppressWarnings("unchecked")
    HadoopIO.Write.Bound<LongWritable, Text> writePTransform = HadoopIO.Write.to("hdfs://localhost:9000/tmp/output", outputFormatClass, LongWritable.class, Text.class);

    textInput.apply(ParDo.of(new ParDoFn())).apply(writePTransform.withoutSharding());

    pipeline.run().waitUntilFinish();

}

r = requests.get(link) soup = BeautifulSoup(r.content,"html.parser") #title = soup.title titles = soup.find_all('title') for title in titles: title.string = title.get_text(strip = True) print(str(title))是.的快捷方式，它会返回第一个匹配项，您应该使用.find()返回所有匹配项。

想要返回<title>标签但返回＆lt; title＆gt;错误的请求＆lt; .title＆gt; Python 3

2 个答案: