Re: deferred rss update instead of sloppy rss

From: Andrew Morton <akpm_at_osdl.org>
Date: 2004-11-23 09:11:48
Christoph Lameter <clameter@sgi.com> wrote:
>
> One way to solve the rss issues is--as discussed--to put rss into the
> task structure and then have the page fault increment that rss.
> 
> The problem is then that the proc filesystem must do an extensive scan
> over all threads to find users of a certain mm_struct.
> 
> The following patch does put the rss into task_struct. The page fault
> handler is then incrementing current->rss if the page_table_lock is not
> held.
> 
> The timer interrupt checks if task->rss is non zero (when doing
> stime/utime updates. rss is defined near those so its hopefully on the
> same cacheline and has a minimal impact).
> 
> If rss is non zero and the page_table_lock and the mmap_sem can be taken
> then the mm->rss will be updated by the value of the task->rss and
> task->rss will be zeroed. This avoids all proc issues. The only
> disadvantage is that rss may be inaccurate for a couple of clock ticks.
> 
> This also adds some performance (sorry only a 4p system):
> 
> sloppy rss:
> 
> Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   4  10    1    0.593s     29.897s  30.050s 85973.585  85948.565
>   4  10    2    0.616s     42.184s  22.045s 61247.450 116719.558
>   4  10    4    0.559s     44.918s  14.076s 57641.255 177553.945
> 
> deferred rss:
>  Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
>   4  10    1    0.565s     29.429s  30.000s 87395.518  87366.580
>   4  10    2    0.500s     33.514s  18.002s 77067.935 145426.659
>   4  10    4    0.533s     44.455s  14.085s 58269.368 176413.196

hrm.  I cannot see anywhere in this patch where you update task_struct.rss.

> Index: linux-2.6.9/kernel/exit.c
> ===================================================================
> --- linux-2.6.9.orig/kernel/exit.c	2004-11-22 09:51:58.000000000 -0800
> +++ linux-2.6.9/kernel/exit.c	2004-11-22 11:16:02.000000000 -0800
> @@ -501,6 +501,9 @@
>  	/* more a memory barrier than a real lock */
>  	task_lock(tsk);
>  	tsk->mm = NULL;
> +	/* only holding mmap_sem here maybe get page_table_lock too? */
> +	mm->rss += tsk->rss;
> +	tsk->rss = 0;
>  	up_read(&mm->mmap_sem);

mmap_sem needs to be held for writing, surely?

> +	/* Update mm->rss if necessary */
> +	if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
> +		if (spin_trylock(&p->mm->page_table_lock)) {
> +			p->mm->rss += p->rss;
> +			p->rss = 0;
> +			spin_unlock(&p->mm->page_table_lock);
> +		}
> +		up_write(&p->mm->mmap_sem);
> +	}
>  }

I'd also suggest that you do:

	tsk->rss++;
	if (tsk->rss > 16) {
		spin_lock(&mm->page_table_lock);
		mm->rss += tsk->rss;
		spin_unlock(&mm->page_table_lock);
		tsk->rss = 0;
	}

just to prevent transient gross inaccuracies.  For some value of "16".
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Mon Nov 22 22:26:41 2004

This archive was generated by hypermail 2.1.8 : 2005-08-02 09:20:32 EST