我有一个空格分隔的文件。它的大小约为1Gb,我想从中获取数字。我决定使用内存映射文件快速读取,但我不明白该怎么做。我试着做下一个:
var mmf = MemoryMappedFile.CreateFromFile("test", FileMode.Open, "myFile");
var mmfa = mmf.CreateViewAccessor(0, 0, MemoryMappedFileAccess.Read);
var nums = new int[6];
var a = mmfa.ReadArray<int>(0, nums, 0, 6);
但如果“test”在num [0]中只包含“01”,我得到12337. 12337 = 48 * 256 + 49。 我在互联网上搜索过,但没有发现任何关于我的问题。仅涉及字节数组或进程间通信。你能告诉我如何在num [0]中得到1吗?
答案 0 :(得分:3)
以下示例将以最快的方式从内存映射文件中读取ASCII整数,而不创建任何字符串。 MiMo提供的解决方案要慢得多。它确实以5 MB / s的速度运行,这对你没有多大帮助。 MiMo解决方案的最大问题是它确实为每个字符调用一个方法(读取),这个字符花费了性能的百分比因素。 我想知道为什么你接受他的解决方案,如果你的原始问题是你有性能问题。使用哑字符串阅读器可以获得20 MB / s并将字符串解析为整数。通过方法调用获取每个字节会破坏您可能的读取性能。
下面的代码将文件映射为200 MB块,以防止填满32位地址空间。然后它会使用非常快的字节指针扫描缓冲区。如果不考虑本地化,则整数解析很容易。有趣的是,如果我创建一个映射视图,获取指向视图缓冲区的指针的唯一方法是不允许我从映射区域开始。
我认为这是bug in the .NET Framwork,但在.NET 4.5中仍然没有修复。 SafeMemoryMappedViewHandle缓冲区分配有OS的分配粒度。如果你前进到某个偏移量,你会得到一个指针,它仍然指向缓冲区的开始。这真的很不幸,因为这会在解析性能方面产生5MB / s到77MB / s的差异。
Did read 258.888.890 bytes with 77 MB/s
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Runtime.InteropServices;
unsafe class Program
{
static void Main(string[] args)
{
new Program().Start();
}
private void Start()
{
var sw = Stopwatch.StartNew();
string fileName = @"C:\Source\BigFile.txt";//@"C:\Source\Numbers.txt";
var file = MemoryMappedFile.CreateFromFile(fileName);
var fileSize = new FileInfo(fileName).Length;
int viewSize = 200 * 100 * 1000;
long offset = 0;
for (; offset < fileSize-viewSize; offset +=viewSize ) // create 200 MB views
{
using (var accessor = file.CreateViewAccessor(offset, viewSize))
{
int unReadBytes = ReadData(accessor, offset);
offset -= unReadBytes;
}
}
using (var rest = file.CreateViewAccessor(offset, fileSize - offset))
{
ReadData(rest, offset);
}
sw.Stop();
Console.WriteLine("Did read {0:N0} bytes with {1:F0} MB/s", fileSize, (fileSize / (1024 * 1024)) / sw.Elapsed.TotalSeconds);
}
List<int> Data = new List<int>();
private int ReadData(MemoryMappedViewAccessor accessor, long offset)
{
using(var safeViewHandle = accessor.SafeMemoryMappedViewHandle)
{
byte* pStart = null;
safeViewHandle.AcquirePointer(ref pStart);
ulong correction = 0;
// needed to correct offset because the view handle does not start at the offset specified in the CreateAccessor call
// This makes AquirePointer nearly useless.
// http://connect.microsoft.com/VisualStudio/feedback/details/537635/no-way-to-determine-internal-offset-used-by-memorymappedviewaccessor-makes-safememorymappedviewhandle-property-unusable
pStart = Helper.Pointer(pStart, offset, out correction);
var len = safeViewHandle.ByteLength - correction;
bool digitFound = false;
int curInt = 0;
byte current =0;
for (ulong i = 0; i < len; i++)
{
current = *(pStart + i);
if (current == (byte)' ' && digitFound)
{
Data.Add(curInt);
// Console.WriteLine("Add {0}", curInt);
digitFound = false;
curInt = 0;
}
else
{
curInt = curInt * 10 + (current - '0');
digitFound = true;
}
}
// scan backwards to find partial read number
int unread = 0;
if (curInt != 0 && digitFound)
{
byte* pEnd = pStart + len;
while (true)
{
pEnd--;
if (*pEnd == (byte)' ' || pEnd == pStart)
{
break;
}
unread++;
}
}
safeViewHandle.ReleasePointer();
return unread;
}
}
public unsafe static class Helper
{
static SYSTEM_INFO info;
static Helper()
{
GetSystemInfo(ref info);
}
public static byte* Pointer(byte *pByte, long offset, out ulong diff)
{
var num = offset % info.dwAllocationGranularity;
diff = (ulong)num; // return difference
byte* tmp_ptr = pByte;
tmp_ptr += num;
return tmp_ptr;
}
[DllImport("kernel32.dll", SetLastError = true)]
internal static extern void GetSystemInfo(ref SYSTEM_INFO lpSystemInfo);
internal struct SYSTEM_INFO
{
internal int dwOemId;
internal int dwPageSize;
internal IntPtr lpMinimumApplicationAddress;
internal IntPtr lpMaximumApplicationAddress;
internal IntPtr dwActiveProcessorMask;
internal int dwNumberOfProcessors;
internal int dwProcessorType;
internal int dwAllocationGranularity;
internal short wProcessorLevel;
internal short wProcessorRevision;
}
}
void GenerateNumbers()
{
using (var file = File.CreateText(@"C:\Source\BigFile.txt"))
{
for (int i = 0; i < 30 * 1000 * 1000; i++)
{
file.Write(i.ToString() + " ");
}
}
}
}
答案 1 :(得分:1)
您需要解析文件内容,将字符转换为数字 - 如下所示:
List<int> nums = new List<int>();
long curPos = 0;
int curV = 0;
bool hasCurV = false;
while (curPos < mmfa.Capacity) {
byte c;
mmfa.Read(curPos++, out c);
if (c == 0) {
break;
}
if (c == 32) {
if (hasCurV) {
nums.Add(curV);
curV = 0;
}
hasCurV = false;
} else {
curV = checked(curV*10 + (int)(c-48));
hasCurV = true;
}
}
if (hasCurV) {
nums.Add(curV);
}
假设mmfa.Capacity
是要读取的字符总数,并且该文件仅包含以空格分隔的数字(即没有结束行或其他空格)
答案 2 :(得分:0)
48 = 0x30 ='0',49 = 0x31 ='1'
所以你真正得到了你的角色,它们只是ASCII编码。
字符串“01”占用2个字节,适合一个int
,因此您可以在一个int
中使用它们。如果您想单独获取它们,则需要询问byte
s的数组。
编辑:如果需要将“01”解析为常量1
,即从ASCII表示转换为二进制,则需要采用其他方式。我建议