CLONE - File-Crawler scheduled task eats a lot of CPU
This case is related to CLOUD-775. In that case I was trying to figured out why the File Crawler scheduled task is eating so much CPU (when that task is executed the CPU is at 100% all the time, during half an hour). I have to say that this customer has around 76,000 files and around 74,000 folders.
I’ve managed to reproduce the issue in my laptop and I think that there’s a bug in the DotNetNuke.Services.FileSystem.FolderManager class. I’m going to try to explain what I’ve found:
I’ve started with a profiling session in Visual Studio, in which we can see an interesting result:
“The GetFolders() method is a CPU intensive task”. Now, the question is, why? Let’s take a look at how the GetFiles() method calls the GetFolders() method. This is the stack trace:
The first GetFiles() function is just a wrapper which calls the following overloaded version of the same method:
As we can see, that method has a foreach statement to get all the subfolders of that folder, in order to retrieve all the files in a recursive way.
This is the GetFolder() method:
The bug here is that, GetFolders() calls an overloaded version of the same method, in which you only have to pass the PortalId as a parameter and then, in memory, it will apply that linq expression "Where", to filter only the subfolders. In our case (ArtfulColor), GetFolders(portalId) will return a list of 74505 folders and then, the "Where" expression is applied (it will need to iterate through all the 74505 to only return the folders that are subfolders of parentFolder). 1 iteration through 74505.
Let’s see GetFolder(PortalId):
This method gets the list of folders from the database (or from the cache) and then it iterates through all of them to create the list that will be returned. This is then, another iteration through 74505 elements.
So, for each folder the crawler needs to explore, it has to iterate (in this particular case) through 74505 + 74505 = 149,010 elements. That’s why I think it’s taking so much time. I guess the solution is just to replace the call to GetFoldersByPortal for something like GetFoldersByPath.
Another thing that I’ve found in this site (I’m not an expert on this matter, so I don’t know if this is normal or not): If you execute this two queries in the customer’s database, there are a lot of folders with ParentId = NULL and two folders with an empty FolderPath. Is it normal?
QA Test Plan
Tested in Platform with 700,000 files in 70,000 folders
Closing as fixed 7.3.3 build 112
DNN-5862: overload GetFolders method and return all sub folders in one time when second parameter set to true.