我将如何访问txt文件并拆分链接

时间:2018-10-16 00:35:35

标签: c# parsing

好的,我有一个程序可以从网站上获取链接,然后将其放入txt,但链接并不会分隔在自己的行上,我需要以某种方式做到这一点,而不必自己亲自进行操作,这是用于获取网站链接的代码,将链接写入文本文件,然后获取txt文件并阅读。

        private void linkLabel1_LinkClicked(object sender, LinkLabelLinkClickedEventArgs e)
    {
        var client = new WebClient();

        string text = client.DownloadString("https://currentlinks.com");

        File.WriteAllText("C:/ProgramData/oof.txt", text);


        string searchKeyword = "https://foobar.to/showthread.php";
        string fileName = "C:/ProgramData/oof.txt";
        string[] textLines = File.ReadAllLines(fileName);
        List<string> results = new List<string>();

        foreach (string line in textLines)
        {
            if (line.Contains(searchKeyword))
            {
                results.Add(line);
            }
            var sb = new StringBuilder();
            foreach (var item in results)
            {
                sb.Append(item);
            }

            textBox1.Text = sb.ToString();

            var parsed = textBox1;

            TextWriter tw = new StreamWriter("C:/ProgramData/parsed.txt");

            // write lines of text to the file
            tw.WriteLine(parsed);

            // close the stream     
            tw.Close();





        }
    }

2 个答案:

答案 0 :(得分:0)

。拆分方式

您可以使用yourString.Split("https://");吗?

示例:

//This simple example assumes that all links are https (not http)
string contents = "https://www.example.com/dogs/poodles/poodle1.htmlhttps://www.example.com/dogs/poodles/poodle2.html";

const string Prefix = "https://";
var linksWithoutPrefix = contents.Split(Prefix, StringSplitOptions.RemoveEmptyEntries);

//using System.Linq
var linksWithPrefix = linksWithoutPrefix.Select(l => Prefix + l);
foreach (var match in linksWithPrefix)
{
    Console.WriteLine(match);
}

正则表达式方式

另一个选择是使用reg exp。

失败-无法找到/编写正确的正则表达式...现在开始

string contents = "http://www.example.com/dogs/poodles/poodle1.htmlhttp://www.example.com/dogs/poodles/poodle2.html";

//From https://regexr.com/
var rgx = new Regex(@"(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*");
var matches = rgx.Matches(contents);

foreach(var match in matches )
{
    Console.WriteLine(match);
}

//This finds 'http://www.example.com/dogs/poodles/poodle1.htmlhttp' (note the htmlhttp at the end

答案 1 :(得分:0)

您将所有链接(URL)放在一个字符串中。没有一些假设,没有一种直接获取所有URL的直接方法。

对于您共享的样本数据,我假设字符串中的URL遵循简单URL格式,并且其中没有任何花哨的内容。它们以http开头,并且一个网址没有其他http

基于上述假设,我建议使用以下代码。

// Sample data as shared by the OP
string data = "https://forum.to/showthread.php?tid=22305https://forum.to/showthread.php?tid=22405https://forum.to/showthread.php?tid=22318";

//Splitting the string by string `http` 
var items = data.Split(new [] {"http"},StringSplitOptions.RemoveEmptyEntries).ToList();

//At this point all the strings in items collection will be without "http" at the start. 
//So they will look like as following.
// s://forum.to/showthread.php?tid=22305
// s://forum.to/showthread.php?tid=22405
// s://forum.to/showthread.php?tid=22318

//So we need to add "http" at the start of each of the item as following.
items = items.Select(i => "http" + i).ToList();

// After this they will become like following.
// https://forum.to/showthread.php?tid=22305
// https://forum.to/showthread.php?tid=22405
// https://forum.to/showthread.php?tid=22318

//Now we need to create a single string with newline character between two items so 
//that they represent a single line individually.
var text = String.Join("\r\n", items);

// Then write the text to the file.
File.WriteAllText("C:/ProgramData/oof.txt", text);

这应该可以帮助您解决问题。