场景:我需要以数学方式处理超过1.5GB的文本和csv文件.我尝试使用SQL Server Express,但加载信息,即使使用BULK导入也需要很长时间,理想情况下我需要将整个数据集放在内存中,以减少硬盘IO.
有超过120,000,000条记录,但即使我尝试将信息过滤到一列(内存中),我的C#控制台应用程序也消耗~3.5GB内存来处理仅125MB(实际读入700MB)的文本.
似乎GC没有收集对字符串和字符串数组的引用,即使在将所有引用设置为null并使用using关键字封装IDisposable之后也是如此.
我认为罪魁祸首是String.Split()方法,它为每个逗号分隔值创建一个新字符串.
您可能会建议我甚至不应该将不需要的*列读入字符串数组中,但是忽略了这一点:如何将整个数据集放在内存中,以便我可以在C#中并行处理它?
我可以使用复杂的调度算法优化统计算法和协调任务,但这是我在遇到内存问题之前希望做的事情,而不是因为.
我已经包含了一个模拟我的环境的完整控制台应用程序,应该可以帮助复制问题.
任何帮助表示赞赏.提前致谢.
using System; using System.Collections.Generic; using System.Text; using System.IO; namespace InMemProcessingLeak { class Program { static void Main(string[] args) { //Setup Test Environment. Uncomment Once //15000-20000 files would be more realistic //InMemoryProcessingLeak.GenerateTestDirectoryFilesAndColumns(3000, 3); //GC GC.Collect(); //Demostrate Large Object Memory Allocation Problem (LOMAP) InMemoryProcessingLeak.SelectColumnFromAllFiles(3000, 2); } } class InMemoryProcessingLeak { public static ListSelectColumnFromAllFiles(int filesToSelect, int column) { List allItems = new List (); int fileCount = filesToSelect; long fileSize, totalReadSize = 0; for (int i = 1; i <= fileCount; i++) { allItems.AddRange(SelectColumn(i, column, out fileSize)); totalReadSize += fileSize; Console.Clear(); Console.Out.WriteLine("Reading file {0:00000} of {1}", i, fileCount); Console.Out.WriteLine("Memory = {0}MB", GC.GetTotalMemory(false) / 1048576); Console.Out.WriteLine("Total Read = {0}MB", totalReadSize / 1048576); } Console.ReadLine(); return allItems; } //reads a csv file and returns the values for a selected column private static List SelectColumn(int fileNumber, int column, out long fileSize) { string fileIn; FileInfo file = new FileInfo(string.Format(@"MemLeakTestFiles/File{0:00000}.txt", fileNumber)); fileSize = file.Length; using (System.IO.FileStream fs = file.Open(FileMode.Open, FileAccess.Read, FileShare.Read)) { using (System.IO.StreamReader sr = new System.IO.StreamReader(fs)) { fileIn = sr.ReadToEnd(); } } string[] lineDelimiter = { "\n" }; string[] allLines = fileIn.Split(lineDelimiter, StringSplitOptions.None); List processedColumn = new List (); string current; for (int i = 0; i < allLines.Length - 1; i++) { current = GetColumnFromProcessedRow(allLines[i], column); processedColumn.Add(current); } for (int i = 0; i < lineDelimiter.Length; i++) //GC { lineDelimiter[i] = null; } lineDelimiter = null; for (int i = 0; i < allLines.Length; i++) //GC { allLines[i] = null; } allLines = null; current = null; return processedColumn; } //returns a row value from the selected comma separated string and column position private static string GetColumnFromProcessedRow(string line, int columnPosition) { string[] entireRow = line.Split(",".ToCharArray()); string currentColumn = entireRow[columnPosition]; //GC for (int i = 0; i < entireRow.Length; i++) { entireRow[i] = null; } entireRow = null; return currentColumn; } #region Generators public static void GenerateTestDirectoryFilesAndColumns(int filesToGenerate, int columnsToGenerate) { DirectoryInfo dirInfo = new DirectoryInfo("MemLeakTestFiles"); if (!dirInfo.Exists) { dirInfo.Create(); } Random seed = new Random(); string[] columns = new string[columnsToGenerate]; StringBuilder sb = new StringBuilder(); for (int i = 1; i <= filesToGenerate; i++) { int rows = seed.Next(10, 8000); for (int j = 0; j < rows; j++) { sb.Append(GenerateRow(seed, columnsToGenerate)); } using (TextWriter tw = new StreamWriter(String.Format(@"{0}/File{1:00000}.txt", dirInfo, i))) { tw.Write(sb.ToString()); tw.Flush(); } sb.Remove(0, sb.Length); Console.Clear(); Console.Out.WriteLine("Generating file {0:00000} of {1}", i, filesToGenerate); } } private static string GenerateString(Random seed) { StringBuilder sb = new StringBuilder(); int characters = seed.Next(4, 12); for (int i = 0; i < characters; i++) { sb.Append(Convert.ToChar(Convert.ToInt32(Math.Floor(26 * seed.NextDouble() + 65)))); } return sb.ToString(); } private static string GenerateRow(Random seed, int columnsToGenerate) { StringBuilder sb = new StringBuilder(); sb.Append(seed.Next()); for (int i = 0; i < columnsToGenerate - 1; i++) { sb.Append(","); sb.Append(GenerateString(seed)); } sb.Append("\n"); return sb.ToString(); } #endregion } }
*这些其他列将在程序的生命周期中按顺序和随机访问和访问,因此每次从磁盘读取都是非常费力的开销.
**环境备注:4GB DDR2 SDRAM 800,Core 2 Duo 2.5Ghz,.NET Runtime 3.5 SP1,Vista 64.
是的,String.Split为每个"片段"创建一个新的String对象 - 这就是它的意图.
现在,请记住,在.NET中的字符串都是Unicode(UTF-16真的),并以该对象的开销字节字符串的成本大约是20 + 2*n
这里n
的字符数.
这意味着如果你有很多小字符串,那么与所涉及的文本数据大小相比,它会占用大量内存.例如,分成10 x 8个字符串的80个字符行将占用文件中的80个字节,但内存中的10*(20 + 2*8)= 360个字节 - 4.5倍的爆炸!
我怀疑这是一个GC问题 - 我建议你删除额外的语句设置变量为null时,如果没有必要 - 只是有太多数据的问题.
我会建议是,你读文件中的行由行(使用TextReader.ReadLine()
替代TextReader.ReadToEnd()
).如果你不需要,清楚地将整个文件放在内存中是浪费的.