答案 0 :(得分:1)
您的问题来自于从文本文件中读取。在for link in url_list
循环中,link
的第一个值为http://www.scientific.net/MSF\n
- 最后\n
导致Bad Request
错误。在阅读时从行中删除\n
,您的代码将起作用。您的最后一行似乎没有\n
,因此只需使用url_list.append(line[:-1])
,最后一行就会失败。
答案 1 :(得分:0)
private static void testHadoop(Pipeline pipeline){
Class<? extends FileInputFormat<LongWritable, Text>> inputFormatClass =
(Class<? extends FileInputFormat<LongWritable, Text>>)
(Class<?>) TextInputFormat.class;
@SuppressWarnings("unchecked") //hdfs://localhost:9000
HadoopIO.Read.Bound<LongWritable, Text> readPTransfom_1 = HadoopIO.Read.from("hdfs://localhost:9000/tmp/kinglear.txt",
inputFormatClass,
LongWritable.class,
Text.class);
PCollection<KV<LongWritable, Text>> textInput = pipeline.apply(readPTransfom_1)
.setCoder(KvCoder.of(WritableCoder.of(LongWritable.class), WritableCoder.of(Text.class)));
//OutputFormat
@SuppressWarnings("unchecked")
Class<? extends FileOutputFormat<LongWritable, Text>> outputFormatClass =
(Class<? extends FileOutputFormat<LongWritable, Text>>)
(Class<?>) TemplatedTextOutputFormat.class;
@SuppressWarnings("unchecked")
HadoopIO.Write.Bound<LongWritable, Text> writePTransform = HadoopIO.Write.to("hdfs://localhost:9000/tmp/output", outputFormatClass, LongWritable.class, Text.class);
textInput.apply(ParDo.of(new ParDoFn())).apply(writePTransform.withoutSharding());
pipeline.run().waitUntilFinish();
}
r = requests.get(link)
soup = BeautifulSoup(r.content,"html.parser")
#title = soup.title
titles = soup.find_all('title')
for title in titles:
title.string = title.get_text(strip = True)
print(str(title))
是.
的快捷方式,它会返回第一个匹配项,您应该使用.find()
返回所有匹配项。