CLONE - File-Crawler scheduled task eats a lot of CPU

Description

This case is related to CLOUD-775. In that case I was trying to figured out why the File Crawler scheduled task is eating so much CPU (when that task is executed the CPU is at 100% all the time, during half an hour). I have to say that this customer has around 76,000 files and around 74,000 folders.

I’ve managed to reproduce the issue in my laptop and I think that there’s a bug in the DotNetNuke.Services.FileSystem.FolderManager class. I’m going to try to explain what I’ve found:
I’ve started with a profiling session in Visual Studio, in which we can see an interesting result:

“The GetFolders() method is a CPU intensive task”. Now, the question is, why? Let’s take a look at how the GetFiles() method calls the GetFolders() method. This is the stack trace:

The first GetFiles() function is just a wrapper which calls the following overloaded version of the same method:

As we can see, that method has a foreach statement to get all the subfolders of that folder, in order to retrieve all the files in a recursive way.
This is the GetFolder() method:

The bug here is that, GetFolders() calls an overloaded version of the same method, in which you only have to pass the PortalId as a parameter and then, in memory, it will apply that linq expression "Where", to filter only the subfolders. In our case (ArtfulColor), GetFolders(portalId) will return a list of 74505 folders and then, the "Where" expression is applied (it will need to iterate through all the 74505 to only return the folders that are subfolders of parentFolder). 1 iteration through 74505.
Let’s see GetFolder(PortalId):

This method gets the list of folders from the database (or from the cache) and then it iterates through all of them to create the list that will be returned. This is then, another iteration through 74505 elements.

So, for each folder the crawler needs to explore, it has to iterate (in this particular case) through 74505 + 74505 = 149,010 elements. That’s why I think it’s taking so much time. I guess the solution is just to replace the call to GetFoldersByPortal for something like GetFoldersByPath.

Another thing that I’ve found in this site (I’m not an expert on this matter, so I don’t know if this is normal or not): If you execute this two queries in the customer’s database, there are a lot of folders with ParentId = NULL and two folders with an empty FolderPath. Is it normal?

QA Test Plan

None

Activity

Show:
Ben Zhong
September 15, 2014, 11:12 PM
Edited

DNN-5862: overload GetFolders method and return all sub folders in one time when second parameter set to true.

Ken Grierson
September 28, 2014, 9:01 PM

Tested in Platform with 700,000 files in 70,000 folders
Closing as fixed 7.3.3 build 112

Assignee

Unassigned

Reporter

Ben Zhong

Story Size

Unknown

Severity

Major

Triage

New

Reported in Build #

None

Fixed in Build

Dev Owner

None

Includes Code Fix

No

Documentation Required

No

Trouble Ticket

None

Requires More Info

None

QA Story Points

None

QA Owner

None

Injected

None

Automation Required

None

Code Review Owner

None

Components

Sprint

None

Fix versions

Priority

High
Configure