Runaway threads causing high CPU usage of DNN w3wp process on all cores
Description
QA Test Plan
Attachments
Activity

Bing WuJuly 15, 2014 at 12:28 AM
Thanks Cathal, and closed this issue based on the conversation with Cathal.
This issue is identified based on Marco's comments and dump file, developers noticed there is a thread-safe issue in the code and fixed it. There is no way to reproduce this problem at this moment, it could happen on any pages, some testing will be covered in regression cycle anyway.

cathal connollyJune 18, 2014 at 1:35 AM
Bing, the issue manifests with the website effectively hanging as the CPU starts using very high (80-100%) of it's capacity as multiple threads try to load a list of users and get in each others way. It's very hard to recreate as it's down to thread timing, and accessing of shared data (the users table). Anecdotal reports indicate it predominantly happens on sites which are mostly logged in users e.g. intranet sites/extranet sites rather than internet sites (where most users start out anonymous and only some log in). This makes sense in that a few authenticated users hit a page and then competing processes attempt to return user specific data and cause the high CPU usage.
There is no real change in functionality to test as the change is simply the type of object that stores the data i.e. the data doesn't change, but the object that stores it was designed to handle situations where multiple threads would try to access the same data, and as such ensures that they don't cause deadlocks but are dealt with safely and without impacting performance. To my knowledge we've never seen this issue happen in our test lab - we could try to force it by throwing lots of capacity testing at sites, but due to the nature of threading/sycronisation issues, they're hard to tie down and hard to recreate. As it's been reviewed by both Charles and I and at least 2 users on the dnnsoftware.com thread have indicated it fixed their problem, it might be an idea just to pass it, as the level of effort to recreate it substantial.
As to the functionality, it runs on each page as it simply returns the user's userinfo i.e. it checks if it can get from cache or must summon from the database - in fact in most pages case it runs multiple times as user data is checked for a number of items (permissions, mail and notification counts etc.)

Bing WuJune 16, 2014 at 8:28 PM
Hi Cathal, agreed it is not easy to test, but I'd like to know more info that we can run some regression and make sure at least no functionality broken.

Charles NurseJune 16, 2014 at 3:44 PM
Code reviewed

Robert CuiJune 9, 2014 at 7:54 PM
Please do a Code Review and put a comment after the review. Thanks.
(This issue/fix has also been reported by me on the DNN forum: http://www.dnnsoftware.com/forums/forumid/198/threadid/432769/scope/posts/threadpage/4
post 12/23/2013 9:46 AM by Marco Kijlstra)
I ran into a high CPU problem on our production server last week. (DNN version 6.2.5). The problem did not manifest itself on our single core test-environment, nor in our development environments. There also seemed to be no relation between any particular request and the probability of the problem occurring.
I made dumps using DebugDiag and the (32 bit version of) taskmanager. Analyses of the dumps with WinDbg showed all of the runaway threads busy with an Insert call on a generic dictionary in the UserController.GetCachedUser method.
This method indeed gets a dictionary from the cache and sets a value on it. The dictionary is a common generic dictionary, not thread-safe. However the object is shared among threads through the cache, and thus a thread-safe dictionary should be used.
I have fixed this by changing line 242 in method UserController.GetUserLookupDictionary so that the CBO.GetCachedObject overload with the saveInDictionary parameter is used, having set the saveInDictionary parameter to true.
I have attached the fixed version of UserController.cs. See lines 242-251
This bug probably also affects all 7+ versions and earlier 6.. versions. because the offending methods have not been changed.
So far, the high CPU problem has not returned in the past 24 hours. Before the fix, the problem would manifest itself one or more times per hour. The fix does not seam to have had any adverse effects on performance or otherwise.