I have a large text file (~400MB) that I need to read and extract some specific information out of. Whenever the file gets in the neighborhood of 200MB the job just runs & runs and never seems to complete. It slows down the server to the point where we have to kill the job and then end up manually splitting the file into smaller pieces. The server that this is running is a P4 2.7GHz and has 2GB RAM. When the job runs there is nothing really running except for the usual web server & sql server stuff to support light Internet traffic.
I would initially try to read all files into memory but then I would get the OutOfMemoryException if the file is larger than 200MB. So I try and read the file line by line, parse it, then I have an ArrayList of items....pretty simple stuff here.
Below is my code snippet..the GetList method being called parses the string with a Regex and then returns an ArrayList of matches. The ArrayList could have upwards of 15,000-20,000 items but I won't know ahead of time. I thought of using a string[] array but I didn't think it would make that much of a difference to refactor 3 different pieces of code to test it plus I don't know the size ahead of time.
What am I missing or not seeing in the loop for large files? Is it the ArrayList or how I'm reading the file?
string input = null;
//200,000,000 (200MB) is OK.
if (theFile.Length <= 200000000)
{
using (StreamReader sr = new StreamReader(theFile.FullName))
{
input = sr.ReadToEnd();
list = GetList(input);
}
}
else
{
using (StreamReader sr = new StreamReader(theFile.FullName))
{
ArrayList tmpList = new ArrayList();
while (sr.Peek() >= 0)
{
input = sr.ReadLine();
tmpList = GetList(input);
if (tmpList.Count != 0)
{
list.Add(tmpList);
}
list = GetList(input);
}
}
}
}
return list;