Re: [PATCH] ia64 cpuset + build_sched_domains() mangles structures

From: Dinakar Guniguntala <dino_at_in.ibm.com>
Date: 2005-09-03 00:47:15
Andrew,

Please include the patch below into -mm. I had reported a problem
with this patch earlier on 2.6.13-rc6, but I am just not able to
reproduce the problem on newer kernels (2.6.13 and 2.6.13-mm1).

I have tested this extensively on a Power5 box and I believe
that John Hawke's has tested this on ia64 as well.

The patch is here

http://marc.theaimsgroup.com/?l=linux-ia64&m=112474434128996&w=2


Regards,

Dinakar



On Mon, Aug 22, 2005 at 06:07:19PM +0200, Ingo Molnar wrote:
> 
> * Dinakar Guniguntala <dino@in.ibm.com> wrote:
> 
> > On Mon, Aug 22, 2005 at 09:08:34AM +0200, Ingo Molnar wrote:
> > > 
> > > in terms of 2.6.14, the replacement patch below also does what i always 
> > > wanted to do: to merge the ia64-specific build_sched_domains() code back 
> > > into kernel/sched.c. I've done this by taking your improved dynamic 
> > > build-domains code and putting it into kernel/sched.c.
> > > 
> > 
> > Ingo, one change required to your patch and the exclusive
> > cpuset functionality seems to work fine on a NUMA ppc64 box.
> > I am still running some of my dynamic sched domain tests. So far
> > it seems to be holding ok.
> 
> great! Andrew, i'd suggest we try the merged patch attached below in 
> -mm.
> 
> > Any idea why the ia64 stuff was forked in the first place?
> 
> most of the NUMA domain-trees stuff happened in the ia64 space so there 
> was a natural desire to keep it more hackable there. But now i think 
> it's getting counterproductive.
> 
> 	Ingo
> 
> -----
> I've already sent this to the maintainers, and this is now being sent to a
> larger community audience.  I have fixed a problem with the ia64 version of
> build_sched_domains(), but a similar fix still needs to be made to the
> generic build_sched_domains() in kernel/sched.c.
> 
> The "dynamic sched domains" functionality has recently been merged into
> 2.6.13-rcN that sees the dynamic declaration of a cpu-exclusive (a.k.a.
> "isolated") cpuset and rebuilds the CPU Scheduler sched domains and sched
> groups to separate away the CPUs in this cpu-exclusive cpuset from the
> remainder of the non-isolated CPUs.  This allows the non-isolated CPUs to
> completely ignore the isolated CPUs when doing load-balancing.
> 
> Unfortunately, build_sched_domains() expects that a sched domain will
> include all the CPUs of each node in the domain, i.e., that no node will
> belong in both an isolated cpuset and a non-isolated cpuset.  Declaring
> a cpuset that violates this presumption will produce flawed data
> structures and will oops the kernel.
> 
> To trigger the problem (on a NUMA system with >1 CPUs per node):
>    cd /dev/cpuset
>    mkdir newcpuset
>    cd newcpuset
>    echo 0 >cpus
>    echo 0 >mems
>    echo 1 >cpu_exclusive
> 
> I have fixed this shortcoming for ia64 NUMA (with multiple CPUs per node).
> A similar shortcoming exists in the generic build_sched_domains() (in
> kernel/sched.c) for NUMA, and that needs to be fixed also.  The fix involves
> dynamically allocating sched_group_nodes[] and sched_group_allnodes[] for
> each invocation of build_sched_domains(), rather than using global arrays
> for these structures.  Care must be taken to remember kmalloc() addresses
> so that arch_destroy_sched_domains() can properly kfree() the new dynamic
> structures.
> 
> This is a patch against 2.6.13-rc6.
> 
> Signed-off-by: John Hawkes <hawkes@sgi.com>
> 
> reworked the patch to also move the ia64 domain setup code to the generic
> code.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> ppc64 fix
> 
> From: Dinakar Guniguntala <dino@in.ibm.com>
> 
>  arch/ia64/kernel/domain.c    |  400 -------------------------------------------
>  arch/ia64/kernel/Makefile    |    2 
>  include/asm-ia64/processor.h |    3 
>  include/asm-ia64/topology.h  |   22 --
>  include/linux/sched.h        |    9 
>  include/linux/topology.h     |   22 ++
>  kernel/sched.c               |  290 +++++++++++++++++++++++++------
>  7 files changed, 259 insertions(+), 489 deletions(-)
> 
> Index: linux-sched-curr/arch/ia64/kernel/Makefile
> ===================================================================
> --- linux-sched-curr.orig/arch/ia64/kernel/Makefile
> +++ linux-sched-curr/arch/ia64/kernel/Makefile
> @@ -16,7 +16,7 @@ obj-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += acp
>  obj-$(CONFIG_IA64_PALINFO)	+= palinfo.o
>  obj-$(CONFIG_IOSAPIC)		+= iosapic.o
>  obj-$(CONFIG_MODULES)		+= module.o
> -obj-$(CONFIG_SMP)		+= smp.o smpboot.o domain.o
> +obj-$(CONFIG_SMP)		+= smp.o smpboot.o
>  obj-$(CONFIG_NUMA)		+= numa.o
>  obj-$(CONFIG_PERFMON)		+= perfmon_default_smpl.o
>  obj-$(CONFIG_IA64_CYCLONE)	+= cyclone.o
> Index: linux-sched-curr/arch/ia64/kernel/domain.c
> ===================================================================
> --- linux-sched-curr.orig/arch/ia64/kernel/domain.c
> +++ /dev/null
> @@ -1,400 +0,0 @@
> -/*
> - * arch/ia64/kernel/domain.c
> - * Architecture specific sched-domains builder.
> - *
> - * Copyright (C) 2004 Jesse Barnes
> - * Copyright (C) 2004 Silicon Graphics, Inc.
> - */
> -
> -#include <linux/sched.h>
> -#include <linux/percpu.h>
> -#include <linux/slab.h>
> -#include <linux/cpumask.h>
> -#include <linux/init.h>
> -#include <linux/topology.h>
> -#include <linux/nodemask.h>
> -
> -#define SD_NODES_PER_DOMAIN 16
> -
> -#ifdef CONFIG_NUMA
> -/**
> - * find_next_best_node - find the next node to include in a sched_domain
> - * @node: node whose sched_domain we're building
> - * @used_nodes: nodes already in the sched_domain
> - *
> - * Find the next node to include in a given scheduling domain.  Simply
> - * finds the closest node not already in the @used_nodes map.
> - *
> - * Should use nodemask_t.
> - */
> -static int find_next_best_node(int node, unsigned long *used_nodes)
> -{
> -	int i, n, val, min_val, best_node = 0;
> -
> -	min_val = INT_MAX;
> -
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		/* Start at @node */
> -		n = (node + i) % MAX_NUMNODES;
> -
> -		if (!nr_cpus_node(n))
> -			continue;
> -
> -		/* Skip already used nodes */
> -		if (test_bit(n, used_nodes))
> -			continue;
> -
> -		/* Simple min distance search */
> -		val = node_distance(node, n);
> -
> -		if (val < min_val) {
> -			min_val = val;
> -			best_node = n;
> -		}
> -	}
> -
> -	set_bit(best_node, used_nodes);
> -	return best_node;
> -}
> -
> -/**
> - * sched_domain_node_span - get a cpumask for a node's sched_domain
> - * @node: node whose cpumask we're constructing
> - * @size: number of nodes to include in this span
> - *
> - * Given a node, construct a good cpumask for its sched_domain to span.  It
> - * should be one that prevents unnecessary balancing, but also spreads tasks
> - * out optimally.
> - */
> -static cpumask_t sched_domain_node_span(int node)
> -{
> -	int i;
> -	cpumask_t span, nodemask;
> -	DECLARE_BITMAP(used_nodes, MAX_NUMNODES);
> -
> -	cpus_clear(span);
> -	bitmap_zero(used_nodes, MAX_NUMNODES);
> -
> -	nodemask = node_to_cpumask(node);
> -	cpus_or(span, span, nodemask);
> -	set_bit(node, used_nodes);
> -
> -	for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
> -		int next_node = find_next_best_node(node, used_nodes);
> -		nodemask = node_to_cpumask(next_node);
> -		cpus_or(span, span, nodemask);
> -	}
> -
> -	return span;
> -}
> -#endif
> -
> -/*
> - * At the moment, CONFIG_SCHED_SMT is never defined, but leave it in so we
> - * can switch it on easily if needed.
> - */
> -#ifdef CONFIG_SCHED_SMT
> -static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
> -static struct sched_group sched_group_cpus[NR_CPUS];
> -static int cpu_to_cpu_group(int cpu)
> -{
> -	return cpu;
> -}
> -#endif
> -
> -static DEFINE_PER_CPU(struct sched_domain, phys_domains);
> -static struct sched_group sched_group_phys[NR_CPUS];
> -static int cpu_to_phys_group(int cpu)
> -{
> -#ifdef CONFIG_SCHED_SMT
> -	return first_cpu(cpu_sibling_map[cpu]);
> -#else
> -	return cpu;
> -#endif
> -}
> -
> -#ifdef CONFIG_NUMA
> -/*
> - * The init_sched_build_groups can't handle what we want to do with node
> - * groups, so roll our own. Now each node has its own list of groups which
> - * gets dynamically allocated.
> - */
> -static DEFINE_PER_CPU(struct sched_domain, node_domains);
> -static struct sched_group *sched_group_nodes[MAX_NUMNODES];
> -
> -static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
> -static struct sched_group sched_group_allnodes[MAX_NUMNODES];
> -
> -static int cpu_to_allnodes_group(int cpu)
> -{
> -	return cpu_to_node(cpu);
> -}
> -#endif
> -
> -/*
> - * Build sched domains for a given set of cpus and attach the sched domains
> - * to the individual cpus
> - */
> -void build_sched_domains(const cpumask_t *cpu_map)
> -{
> -	int i;
> -
> -	/*
> -	 * Set up domains for cpus specified by the cpu_map.
> -	 */
> -	for_each_cpu_mask(i, *cpu_map) {
> -		int group;
> -		struct sched_domain *sd = NULL, *p;
> -		cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
> -
> -		cpus_and(nodemask, nodemask, *cpu_map);
> -
> -#ifdef CONFIG_NUMA
> -		if (num_online_cpus()
> -				> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
> -			sd = &per_cpu(allnodes_domains, i);
> -			*sd = SD_ALLNODES_INIT;
> -			sd->span = *cpu_map;
> -			group = cpu_to_allnodes_group(i);
> -			sd->groups = &sched_group_allnodes[group];
> -			p = sd;
> -		} else
> -			p = NULL;
> -
> -		sd = &per_cpu(node_domains, i);
> -		*sd = SD_NODE_INIT;
> -		sd->span = sched_domain_node_span(cpu_to_node(i));
> -		sd->parent = p;
> -		cpus_and(sd->span, sd->span, *cpu_map);
> -#endif
> -
> -		p = sd;
> -		sd = &per_cpu(phys_domains, i);
> -		group = cpu_to_phys_group(i);
> -		*sd = SD_CPU_INIT;
> -		sd->span = nodemask;
> -		sd->parent = p;
> -		sd->groups = &sched_group_phys[group];
> -
> -#ifdef CONFIG_SCHED_SMT
> -		p = sd;
> -		sd = &per_cpu(cpu_domains, i);
> -		group = cpu_to_cpu_group(i);
> -		*sd = SD_SIBLING_INIT;
> -		sd->span = cpu_sibling_map[i];
> -		cpus_and(sd->span, sd->span, *cpu_map);
> -		sd->parent = p;
> -		sd->groups = &sched_group_cpus[group];
> -#endif
> -	}
> -
> -#ifdef CONFIG_SCHED_SMT
> -	/* Set up CPU (sibling) groups */
> -	for_each_cpu_mask(i, *cpu_map) {
> -		cpumask_t this_sibling_map = cpu_sibling_map[i];
> -		cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
> -		if (i != first_cpu(this_sibling_map))
> -			continue;
> -
> -		init_sched_build_groups(sched_group_cpus, this_sibling_map,
> -						&cpu_to_cpu_group);
> -	}
> -#endif
> -
> -	/* Set up physical groups */
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		cpumask_t nodemask = node_to_cpumask(i);
> -
> -		cpus_and(nodemask, nodemask, *cpu_map);
> -		if (cpus_empty(nodemask))
> -			continue;
> -
> -		init_sched_build_groups(sched_group_phys, nodemask,
> -						&cpu_to_phys_group);
> -	}
> -
> -#ifdef CONFIG_NUMA
> -	init_sched_build_groups(sched_group_allnodes, *cpu_map,
> -				&cpu_to_allnodes_group);
> -
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		/* Set up node groups */
> -		struct sched_group *sg, *prev;
> -		cpumask_t nodemask = node_to_cpumask(i);
> -		cpumask_t domainspan;
> -		cpumask_t covered = CPU_MASK_NONE;
> -		int j;
> -
> -		cpus_and(nodemask, nodemask, *cpu_map);
> -		if (cpus_empty(nodemask))
> -			continue;
> -
> -		domainspan = sched_domain_node_span(i);
> -		cpus_and(domainspan, domainspan, *cpu_map);
> -
> -		sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> -		sched_group_nodes[i] = sg;
> -		for_each_cpu_mask(j, nodemask) {
> -			struct sched_domain *sd;
> -			sd = &per_cpu(node_domains, j);
> -			sd->groups = sg;
> -			if (sd->groups == NULL) {
> -				/* Turn off balancing if we have no groups */
> -				sd->flags = 0;
> -			}
> -		}
> -		if (!sg) {
> -			printk(KERN_WARNING
> -			"Can not alloc domain group for node %d\n", i);
> -			continue;
> -		}
> -		sg->cpu_power = 0;
> -		sg->cpumask = nodemask;
> -		cpus_or(covered, covered, nodemask);
> -		prev = sg;
> -
> -		for (j = 0; j < MAX_NUMNODES; j++) {
> -			cpumask_t tmp, notcovered;
> -			int n = (i + j) % MAX_NUMNODES;
> -
> -			cpus_complement(notcovered, covered);
> -			cpus_and(tmp, notcovered, *cpu_map);
> -			cpus_and(tmp, tmp, domainspan);
> -			if (cpus_empty(tmp))
> -				break;
> -
> -			nodemask = node_to_cpumask(n);
> -			cpus_and(tmp, tmp, nodemask);
> -			if (cpus_empty(tmp))
> -				continue;
> -
> -			sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> -			if (!sg) {
> -				printk(KERN_WARNING
> -				"Can not alloc domain group for node %d\n", j);
> -				break;
> -			}
> -			sg->cpu_power = 0;
> -			sg->cpumask = tmp;
> -			cpus_or(covered, covered, tmp);
> -			prev->next = sg;
> -			prev = sg;
> -		}
> -		prev->next = sched_group_nodes[i];
> -	}
> -#endif
> -
> -	/* Calculate CPU power for physical packages and nodes */
> -	for_each_cpu_mask(i, *cpu_map) {
> -		int power;
> -		struct sched_domain *sd;
> -#ifdef CONFIG_SCHED_SMT
> -		sd = &per_cpu(cpu_domains, i);
> -		power = SCHED_LOAD_SCALE;
> -		sd->groups->cpu_power = power;
> -#endif
> -
> -		sd = &per_cpu(phys_domains, i);
> -		power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> -				(cpus_weight(sd->groups->cpumask)-1) / 10;
> -		sd->groups->cpu_power = power;
> -
> -#ifdef CONFIG_NUMA
> -		sd = &per_cpu(allnodes_domains, i);
> -		if (sd->groups) {
> -			power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> -				(cpus_weight(sd->groups->cpumask)-1) / 10;
> -			sd->groups->cpu_power = power;
> -		}
> -#endif
> -	}
> -
> -#ifdef CONFIG_NUMA
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		struct sched_group *sg = sched_group_nodes[i];
> -		int j;
> -
> -		if (sg == NULL)
> -			continue;
> -next_sg:
> -		for_each_cpu_mask(j, sg->cpumask) {
> -			struct sched_domain *sd;
> -			int power;
> -
> -			sd = &per_cpu(phys_domains, j);
> -			if (j != first_cpu(sd->groups->cpumask)) {
> -				/*
> -				 * Only add "power" once for each
> -				 * physical package.
> -				 */
> -				continue;
> -			}
> -			power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> -				(cpus_weight(sd->groups->cpumask)-1) / 10;
> -
> -			sg->cpu_power += power;
> -		}
> -		sg = sg->next;
> -		if (sg != sched_group_nodes[i])
> -			goto next_sg;
> -	}
> -#endif
> -
> -	/* Attach the domains */
> -	for_each_online_cpu(i) {
> -		struct sched_domain *sd;
> -#ifdef CONFIG_SCHED_SMT
> -		sd = &per_cpu(cpu_domains, i);
> -#else
> -		sd = &per_cpu(phys_domains, i);
> -#endif
> -		cpu_attach_domain(sd, i);
> -	}
> -	/*
> -	 * Tune cache-hot values:
> -	 */
> -	calibrate_migration_costs();
> -}
> -/*
> - * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
> - */
> -void arch_init_sched_domains(const cpumask_t *cpu_map)
> -{
> -	cpumask_t cpu_default_map;
> -
> -	/*
> -	 * Setup mask for cpus without special case scheduling requirements.
> -	 * For now this just excludes isolated cpus, but could be used to
> -	 * exclude other special cases in the future.
> -	 */
> -	cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
> -
> -	build_sched_domains(&cpu_default_map);
> -}
> -
> -void arch_destroy_sched_domains(const cpumask_t *cpu_map)
> -{
> -#ifdef CONFIG_NUMA
> -	int i;
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		cpumask_t nodemask = node_to_cpumask(i);
> -		struct sched_group *oldsg, *sg = sched_group_nodes[i];
> -
> -		cpus_and(nodemask, nodemask, *cpu_map);
> -		if (cpus_empty(nodemask))
> -			continue;
> -
> -		if (sg == NULL)
> -			continue;
> -		sg = sg->next;
> -next_sg:
> -		oldsg = sg;
> -		sg = sg->next;
> -		kfree(oldsg);
> -		if (oldsg != sched_group_nodes[i])
> -			goto next_sg;
> -		sched_group_nodes[i] = NULL;
> -	}
> -#endif
> -}
> -
> Index: linux-sched-curr/include/asm-ia64/processor.h
> ===================================================================
> --- linux-sched-curr.orig/include/asm-ia64/processor.h
> +++ linux-sched-curr/include/asm-ia64/processor.h
> @@ -20,9 +20,6 @@
>  #include <asm/ptrace.h>
>  #include <asm/ustack.h>
>  
> -/* Our arch specific arch_init_sched_domain is in arch/ia64/kernel/domain.c */
> -#define ARCH_HAS_SCHED_DOMAIN
> -
>  #define IA64_NUM_DBG_REGS	8
>  /*
>   * Limits for PMC and PMD are set to less than maximum architected values
> Index: linux-sched-curr/include/asm-ia64/topology.h
> ===================================================================
> --- linux-sched-curr.orig/include/asm-ia64/topology.h
> +++ linux-sched-curr/include/asm-ia64/topology.h
> @@ -96,28 +96,6 @@ void build_cpu_to_node_map(void);
>  	.nr_balance_failed	= 0,			\
>  }
>  
> -/* sched_domains SD_ALLNODES_INIT for IA64 NUMA machines */
> -#define SD_ALLNODES_INIT (struct sched_domain) {	\
> -	.span			= CPU_MASK_NONE,	\
> -	.parent			= NULL,			\
> -	.groups			= NULL,			\
> -	.min_interval		= 64,			\
> -	.max_interval		= 64*num_online_cpus(),	\
> -	.busy_factor		= 128,			\
> -	.imbalance_pct		= 133,			\
> -	.cache_nice_tries	= 1,			\
> -	.busy_idx		= 3,			\
> -	.idle_idx		= 3,			\
> -	.newidle_idx		= 0, /* unused */	\
> -	.wake_idx		= 0, /* unused */	\
> -	.forkexec_idx		= 0, /* unused */	\
> -	.per_cpu_gain		= 100,			\
> -	.flags			= SD_LOAD_BALANCE,	\
> -	.last_balance		= jiffies,		\
> -	.balance_interval	= 64,			\
> -	.nr_balance_failed	= 0,			\
> -}
> -
>  #endif /* CONFIG_NUMA */
>  
>  #include <asm-generic/topology.h>
> Index: linux-sched-curr/include/linux/sched.h
> ===================================================================
> --- linux-sched-curr.orig/include/linux/sched.h
> +++ linux-sched-curr/include/linux/sched.h
> @@ -546,15 +546,6 @@ struct sched_domain {
>  
>  extern void partition_sched_domains(cpumask_t *partition1,
>  				    cpumask_t *partition2);
> -#ifdef ARCH_HAS_SCHED_DOMAIN
> -/* Useful helpers that arch setup code may use. Defined in kernel/sched.c */
> -extern cpumask_t cpu_isolated_map;
> -extern void init_sched_build_groups(struct sched_group groups[],
> -	                        cpumask_t span, int (*group_fn)(int cpu));
> -extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
> -
> -#endif /* ARCH_HAS_SCHED_DOMAIN */
> -
>  /*
>   * Maximum cache size the migration-costs auto-tuning code will
>   * search from:
> Index: linux-sched-curr/include/linux/topology.h
> ===================================================================
> --- linux-sched-curr.orig/include/linux/topology.h
> +++ linux-sched-curr/include/linux/topology.h
> @@ -133,6 +133,28 @@
>  }
>  #endif
>  
> +/* sched_domains SD_ALLNODES_INIT for NUMA machines */
> +#define SD_ALLNODES_INIT (struct sched_domain) {	\
> +	.span			= CPU_MASK_NONE,	\
> +	.parent			= NULL,			\
> +	.groups			= NULL,			\
> +	.min_interval		= 64,			\
> +	.max_interval		= 64*num_online_cpus(),	\
> +	.busy_factor		= 128,			\
> +	.imbalance_pct		= 133,			\
> +	.cache_nice_tries	= 1,			\
> +	.busy_idx		= 3,			\
> +	.idle_idx		= 3,			\
> +	.newidle_idx		= 0, /* unused */	\
> +	.wake_idx		= 0, /* unused */	\
> +	.forkexec_idx		= 0, /* unused */	\
> +	.per_cpu_gain		= 100,			\
> +	.flags			= SD_LOAD_BALANCE,	\
> +	.last_balance		= jiffies,		\
> +	.balance_interval	= 64,			\
> +	.nr_balance_failed	= 0,			\
> +}
> +
>  #ifdef CONFIG_NUMA
>  #ifndef SD_NODE_INIT
>  #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
> Index: linux-sched-curr/kernel/sched.c
> ===================================================================
> --- linux-sched-curr.orig/kernel/sched.c
> +++ linux-sched-curr/kernel/sched.c
> @@ -4947,7 +4947,7 @@ static int sd_parent_degenerate(struct s
>   * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
>   * hold the hotplug lock.
>   */
> -void cpu_attach_domain(struct sched_domain *sd, int cpu)
> +static void cpu_attach_domain(struct sched_domain *sd, int cpu)
>  {
>  	runqueue_t *rq = cpu_rq(cpu);
>  	struct sched_domain *tmp;
> @@ -4970,7 +4970,7 @@ void cpu_attach_domain(struct sched_doma
>  }
>  
>  /* cpus with isolated domains */
> -cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE;
> +static cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE;
>  
>  /* Setup the mask of cpus configured for isolated domains */
>  static int __init isolated_cpu_setup(char *str)
> @@ -4998,8 +4998,8 @@ __setup ("isolcpus=", isolated_cpu_setup
>   * covered by the given span, and will set each group's ->cpumask correctly,
>   * and ->cpu_power to 0.
>   */
> -void init_sched_build_groups(struct sched_group groups[],
> -			cpumask_t span, int (*group_fn)(int cpu))
> +static void init_sched_build_groups(struct sched_group groups[], cpumask_t span,
> +				    int (*group_fn)(int cpu))
>  {
>  	struct sched_group *first = NULL, *last = NULL;
>  	cpumask_t covered = CPU_MASK_NONE;
> @@ -5513,12 +5513,85 @@ void __devinit calibrate_migration_costs
>  	local_irq_restore(flags);
>  }
>  
> +#define SD_NODES_PER_DOMAIN 16
>  
> -#ifdef ARCH_HAS_SCHED_DOMAIN
> -extern void build_sched_domains(const cpumask_t *cpu_map);
> -extern void arch_init_sched_domains(const cpumask_t *cpu_map);
> -extern void arch_destroy_sched_domains(const cpumask_t *cpu_map);
> -#else
> +#ifdef CONFIG_NUMA
> +/**
> + * find_next_best_node - find the next node to include in a sched_domain
> + * @node: node whose sched_domain we're building
> + * @used_nodes: nodes already in the sched_domain
> + *
> + * Find the next node to include in a given scheduling domain.  Simply
> + * finds the closest node not already in the @used_nodes map.
> + *
> + * Should use nodemask_t.
> + */
> +static int find_next_best_node(int node, unsigned long *used_nodes)
> +{
> +	int i, n, val, min_val, best_node = 0;
> +
> +	min_val = INT_MAX;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		/* Start at @node */
> +		n = (node + i) % MAX_NUMNODES;
> +
> +		if (!nr_cpus_node(n))
> +			continue;
> +
> +		/* Skip already used nodes */
> +		if (test_bit(n, used_nodes))
> +			continue;
> +
> +		/* Simple min distance search */
> +		val = node_distance(node, n);
> +
> +		if (val < min_val) {
> +			min_val = val;
> +			best_node = n;
> +		}
> +	}
> +
> +	set_bit(best_node, used_nodes);
> +	return best_node;
> +}
> +
> +/**
> + * sched_domain_node_span - get a cpumask for a node's sched_domain
> + * @node: node whose cpumask we're constructing
> + * @size: number of nodes to include in this span
> + *
> + * Given a node, construct a good cpumask for its sched_domain to span.  It
> + * should be one that prevents unnecessary balancing, but also spreads tasks
> + * out optimally.
> + */
> +static cpumask_t sched_domain_node_span(int node)
> +{
> +	int i;
> +	cpumask_t span, nodemask;
> +	DECLARE_BITMAP(used_nodes, MAX_NUMNODES);
> +
> +	cpus_clear(span);
> +	bitmap_zero(used_nodes, MAX_NUMNODES);
> +
> +	nodemask = node_to_cpumask(node);
> +	cpus_or(span, span, nodemask);
> +	set_bit(node, used_nodes);
> +
> +	for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
> +		int next_node = find_next_best_node(node, used_nodes);
> +		nodemask = node_to_cpumask(next_node);
> +		cpus_or(span, span, nodemask);
> +	}
> +
> +	return span;
> +}
> +#endif
> +
> +/*
> + * At the moment, CONFIG_SCHED_SMT is never defined, but leave it in so we
> + * can switch it on easily if needed.
> + */
>  #ifdef CONFIG_SCHED_SMT
>  static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
>  static struct sched_group sched_group_cpus[NR_CPUS];
> @@ -5540,44 +5613,28 @@ static int cpu_to_phys_group(int cpu)
>  }
>  
>  #ifdef CONFIG_NUMA
> -
> +/*
> + * The init_sched_build_groups can't handle what we want to do with node
> + * groups, so roll our own. Now each node has its own list of groups which
> + * gets dynamically allocated.
> + */
>  static DEFINE_PER_CPU(struct sched_domain, node_domains);
> -static struct sched_group sched_group_nodes[MAX_NUMNODES];
> -static int cpu_to_node_group(int cpu)
> +static struct sched_group *sched_group_nodes[MAX_NUMNODES];
> +
> +static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
> +static struct sched_group sched_group_allnodes[MAX_NUMNODES];
> +
> +static int cpu_to_allnodes_group(int cpu)
>  {
>  	return cpu_to_node(cpu);
>  }
>  #endif
>  
> -#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
> -/*
> - * The domains setup code relies on siblings not spanning
> - * multiple nodes. Make sure the architecture has a proper
> - * siblings map:
> - */
> -static void check_sibling_maps(void)
> -{
> -	int i, j;
> -
> -	for_each_online_cpu(i) {
> -		for_each_cpu_mask(j, cpu_sibling_map[i]) {
> -			if (cpu_to_node(i) != cpu_to_node(j)) {
> -				printk(KERN_INFO "warning: CPU %d siblings map "
> -					"to different node - isolating "
> -					"them.\n", i);
> -				cpu_sibling_map[i] = cpumask_of_cpu(i);
> -				break;
> -			}
> -		}
> -	}
> -}
> -#endif
> -
>  /*
>   * Build sched domains for a given set of cpus and attach the sched domains
>   * to the individual cpus
>   */
> -static void build_sched_domains(const cpumask_t *cpu_map)
> +void build_sched_domains(const cpumask_t *cpu_map)
>  {
>  	int i;
>  
> @@ -5592,11 +5649,22 @@ static void build_sched_domains(const cp
>  		cpus_and(nodemask, nodemask, *cpu_map);
>  
>  #ifdef CONFIG_NUMA
> +		if (num_online_cpus()
> +				> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
> +			sd = &per_cpu(allnodes_domains, i);
> +			*sd = SD_ALLNODES_INIT;
> +			sd->span = *cpu_map;
> +			group = cpu_to_allnodes_group(i);
> +			sd->groups = &sched_group_allnodes[group];
> +			p = sd;
> +		} else
> +			p = NULL;
> +
>  		sd = &per_cpu(node_domains, i);
> -		group = cpu_to_node_group(i);
>  		*sd = SD_NODE_INIT;
> -		sd->span = *cpu_map;
> -		sd->groups = &sched_group_nodes[group];
> +		sd->span = sched_domain_node_span(cpu_to_node(i));
> +		sd->parent = p;
> +		cpus_and(sd->span, sd->span, *cpu_map);
>  #endif
>  
>  		p = sd;
> @@ -5621,7 +5689,7 @@ static void build_sched_domains(const cp
>  
>  #ifdef CONFIG_SCHED_SMT
>  	/* Set up CPU (sibling) groups */
> -	for_each_online_cpu(i) {
> +	for_each_cpu_mask(i, *cpu_map) {
>  		cpumask_t this_sibling_map = cpu_sibling_map[i];
>  		cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
>  		if (i != first_cpu(this_sibling_map))
> @@ -5646,8 +5714,74 @@ static void build_sched_domains(const cp
>  
>  #ifdef CONFIG_NUMA
>  	/* Set up node groups */
> -	init_sched_build_groups(sched_group_nodes, *cpu_map,
> -					&cpu_to_node_group);
> +	init_sched_build_groups(sched_group_allnodes, *cpu_map,
> +				&cpu_to_allnodes_group);
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		/* Set up node groups */
> +		struct sched_group *sg, *prev;
> +		cpumask_t nodemask = node_to_cpumask(i);
> +		cpumask_t domainspan;
> +		cpumask_t covered = CPU_MASK_NONE;
> +		int j;
> +
> +		cpus_and(nodemask, nodemask, *cpu_map);
> +		if (cpus_empty(nodemask))
> +			continue;
> +
> +		domainspan = sched_domain_node_span(i);
> +		cpus_and(domainspan, domainspan, *cpu_map);
> +
> +		sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> +		sched_group_nodes[i] = sg;
> +		for_each_cpu_mask(j, nodemask) {
> +			struct sched_domain *sd;
> +			sd = &per_cpu(node_domains, j);
> +			sd->groups = sg;
> +			if (sd->groups == NULL) {
> +				/* Turn off balancing if we have no groups */
> +				sd->flags = 0;
> +			}
> +		}
> +		if (!sg) {
> +			printk(KERN_WARNING
> +			"Can not alloc domain group for node %d\n", i);
> +			continue;
> +		}
> +		sg->cpu_power = 0;
> +		sg->cpumask = nodemask;
> +		cpus_or(covered, covered, nodemask);
> +		prev = sg;
> +
> +		for (j = 0; j < MAX_NUMNODES; j++) {
> +			cpumask_t tmp, notcovered;
> +			int n = (i + j) % MAX_NUMNODES;
> +
> +			cpus_complement(notcovered, covered);
> +			cpus_and(tmp, notcovered, *cpu_map);
> +			cpus_and(tmp, tmp, domainspan);
> +			if (cpus_empty(tmp))
> +				break;
> +
> +			nodemask = node_to_cpumask(n);
> +			cpus_and(tmp, tmp, nodemask);
> +			if (cpus_empty(tmp))
> +				continue;
> +
> +			sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> +			if (!sg) {
> +				printk(KERN_WARNING
> +				"Can not alloc domain group for node %d\n", j);
> +				break;
> +			}
> +			sg->cpu_power = 0;
> +			sg->cpumask = tmp;
> +			cpus_or(covered, covered, tmp);
> +			prev->next = sg;
> +			prev = sg;
> +		}
> +		prev->next = sched_group_nodes[i];
> +	}
>  #endif
>  
>  	/* Calculate CPU power for physical packages and nodes */
> @@ -5666,14 +5800,46 @@ static void build_sched_domains(const cp
>  		sd->groups->cpu_power = power;
>  
>  #ifdef CONFIG_NUMA
> -		if (i == first_cpu(sd->groups->cpumask)) {
> -			/* Only add "power" once for each physical package. */
> -			sd = &per_cpu(node_domains, i);
> -			sd->groups->cpu_power += power;
> +		sd = &per_cpu(allnodes_domains, i);
> +		if (sd->groups) {
> +			power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> +				(cpus_weight(sd->groups->cpumask)-1) / 10;
> +			sd->groups->cpu_power = power;
>  		}
>  #endif
>  	}
>  
> +#ifdef CONFIG_NUMA
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		struct sched_group *sg = sched_group_nodes[i];
> +		int j;
> +
> +		if (sg == NULL)
> +			continue;
> +next_sg:
> +		for_each_cpu_mask(j, sg->cpumask) {
> +			struct sched_domain *sd;
> +			int power;
> +
> +			sd = &per_cpu(phys_domains, j);
> +			if (j != first_cpu(sd->groups->cpumask)) {
> +				/*
> +				 * Only add "power" once for each
> +				 * physical package.
> +				 */
> +				continue;
> +			}
> +			power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> +				(cpus_weight(sd->groups->cpumask)-1) / 10;
> +
> +			sg->cpu_power += power;
> +		}
> +		sg = sg->next;
> +		if (sg != sched_group_nodes[i])
> +			goto next_sg;
> +	}
> +#endif
> +
>  	/* Attach the domains */
>  	for_each_cpu_mask(i, *cpu_map) {
>  		struct sched_domain *sd;
> @@ -5692,13 +5858,10 @@ static void build_sched_domains(const cp
>  /*
>   * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
>   */
> -static void arch_init_sched_domains(cpumask_t *cpu_map)
> +static void arch_init_sched_domains(const cpumask_t *cpu_map)
>  {
>  	cpumask_t cpu_default_map;
>  
> -#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
> -	check_sibling_maps();
> -#endif
>  	/*
>  	 * Setup mask for cpus without special case scheduling requirements.
>  	 * For now this just excludes isolated cpus, but could be used to
> @@ -5711,10 +5874,29 @@ static void arch_init_sched_domains(cpum
>  
>  static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
>  {
> -	/* Do nothing: everything is statically allocated. */
> -}
> +#ifdef CONFIG_NUMA
> +	int i;
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		cpumask_t nodemask = node_to_cpumask(i);
> +		struct sched_group *oldsg, *sg = sched_group_nodes[i];
> +
> +		cpus_and(nodemask, nodemask, *cpu_map);
> +		if (cpus_empty(nodemask))
> +			continue;
>  
> -#endif /* ARCH_HAS_SCHED_DOMAIN */
> +		if (sg == NULL)
> +			continue;
> +		sg = sg->next;
> +next_sg:
> +		oldsg = sg;
> +		sg = sg->next;
> +		kfree(oldsg);
> +		if (oldsg != sched_group_nodes[i])
> +			goto next_sg;
> +		sched_group_nodes[i] = NULL;
> +	}
> +#endif
> +}
>  
>  /*
>   * Detach sched domains from a group of cpus specified in cpu_map
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Received on Sat Sep 03 00:37:23 2005

This archive was generated by hypermail 2.1.8 : 2005-09-03 00:37:29 EST