我有一个包含25000个文本文件的文件夹,我想阅读这些文件并将这些文字放入表格。我的文本文件以下列格式命名:1.txt,2.txt,...... ..等等到25000.txt。每个文本文件都包含以下形式的单词。
sample contents of my file
apple
cat
rat
shoe
这些单词也可以在其他文本文件中重复,我希望ac#代码可以读取文本文件,识别重复的单词以及那些不重复的单词,然后以下面的形式将它们插入到Sqlserver的数据库中。
keyword document name
cat 1.txt,2.txt,3.txt
rat 4.txt,1.txt
fish 5.txt
`
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
using System.Data.SqlClient;
namespace RAMESH
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void textBox1_TextChanged(object sender, EventArgs e)
{
}
private void button2_Click(object sender, EventArgs e)
{
string[] files = Directory.GetFiles(textBox1.Text, "*.txt");
int i;
string sqlstmt,str;
SqlConnection con = new SqlConnection("data source=dell-pc\\sql1; initial catalog=db; user id=sa; password=a;");
SqlCommand cmd;
sqlstmt = "delete from Items";
cmd = new SqlCommand(sqlstmt, con);
con.Open();
cmd.ExecuteNonQuery();
for (i = 0; i < files.Length; i++)
{
StreamReader sr = new StreamReader(files[i]);
FileInfo f = new FileInfo(files[i]);
string fname;
fname = f.Name;
fname = fname.Substring(0, fname.LastIndexOf('.'));
//MessageBox.Show(fname);
while ((str = sr.ReadLine()) != null)
{
int nstr=1;
//int x,y;
//for (x = 0; x < str.Length; x++)
//{
// y = Convert.ToInt32(str.Substring(x,1));
// if ((y < 48 && y > 75) || (y < 65 && y > 97) || (y < 97 && y > 122)) ;
//}
sqlstmt = "insert into Items values('" + str + "','" + fname + "')";
cmd = new SqlCommand(sqlstmt, con);
try
{
cmd.ExecuteNonQuery();
}
catch (Exception ex)
{
sqlstmt = "update Items set docname=docname + '," + fname + "' where itemname='" + str + "'";
cmd = new SqlCommand(sqlstmt, con);
cmd.ExecuteNonQuery();
}
}
sr.Close();
}
MessageBox.Show("keywords added successfully");
con.Close();
}
}
} `
答案 0 :(得分:1)
首先,我将向您的数据库添加一个存储过程,以隔离更新或插入的逻辑
CREATE PROCEDURE UpsertWords
@word nvarchar(MAX), @file nvarchar(256)
as
Declare @cnt integer
Select @cnt = Count(*) from Items where ItemName = @word
if @cnt = 0
INSERT INTO Items (@word, @file)
else
UPDATE Items SET docname = docname + ',' + @file where ItemName = @word
现在,我们可以大量简化您的代码
.....
// Build the command just one time, outside the loop,
// make it point to the stored procedure above
cmd = new SqlCommand("UpsertWords", con);
cmd.CommandType = CommandType.StoredProcedure;
// Create dummy parameters, the actual value is supplied inside the loop
cmd.Parameters.AddWithValue("@word", string.Empty);
cmd.Parameters.AddWithValue("@file", string.Empty);
// Now loop on every file
for (i = 0; i < files.Length; i++)
{
// Open and read all the lines in the current file
string[] lines = File.ReadAllLines(files[i]);
// Get only the filename part without the extension
string fname = Path.GetFileNameWithoutExtension(files[i])
// In case of just one line per file, this loop will execute just one time
// however we also could handle more than one line per file
foreach(string line in lines)
{
// Set the actual value of the parameters created outside the loop
cmd.Parameters["@word"] = line;
cmd.Parameters["@file"] = fname;
// Run the insert or update (the logic is inside the storedprocedure)
cmd.ExecuteNonQuery();
}
此时尚不清楚您的行是由单个单词组成,还是由多个单词分隔多个单词(制表符,逗号,分号)。在这种情况下,您需要拆分字符串和另一个循环。
但是,我发现您的数据库架构错误。最好为每个单词添加一个新行和它出现的文件。这样一个简单的查询就像
SELECT docname from Items where itemname = @word
会在没有任何重大性能问题的情况下大喊所有文件,并且您有一个更易于搜索的数据库 或者,如果您需要计算单词的出现次数
SELECT ItemName, COUNT(ItemName) as WordCount
FROM Items
GROUP BY ItemName
ORDER BY Count(ItemName) ASC
答案 1 :(得分:0)
尝试这种方法:
首先从您的文件开始,循环并创建一个简单的XML文档。
var fname = "File12.txt";
var keywords = new List<string>(new[]{ "dog", "cat", "moose" });
var miXML = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("root"));
foreach (var el in keywords.Select(i => new XElement("item", new XAttribute("key", i))))
{
miXML.Root.Add(el);
}
using (var con = new SqlConnection("Server=localhost;Database=HT;Trusted_Connection=True;"))
{
con.Open();
using (var cmd = new SqlCommand("uspUpsert", con) {CommandType = CommandType.StoredProcedure})
{
cmd.Parameters.AddWithValue("@X", miXML.ToString());
cmd.Parameters.AddWithValue("@fileName", fname);
cmd.ExecuteNonQuery();
}
}
然后,对于您的存储过程,您可以调用此Proc,它将该XML转换为表,将关键字和文件名插入数据库。
CREATE PROCEDURE uspUpsert
@X xml,
@Filename varchar(100)
AS
BEGIN
SET NOCOUNT ON;
WITH KV as (
select
x.v.value('@key', 'varchar(20)') as Keyword
,@FileName as FileName
FROM @x.nodes('/root/item') x(v)
)
insert into Items
select KV.keyWord, KV.FileName
from KV
left outer join Items I on I.Keyword=KV.Keyword and I.FileName=KV.FileName
where I.id is null
END
由于您可能不希望'file1.txt file2.txt file3.txt'查找重复项,因此您将使用此查询在重复文件中查找单词:
select * from items where keyword='dog'
或者,现在可以进行计数并在此表上进行所有其他聚合。