This is the Complete Fair Scheduler (CFS) v9 patch for linux 2.6.20.10 patch (rediffed cleanly against .11). http://people.redhat.com/mingo/cfs-scheduler/ Index: linux-cfs-2.6.20.8.q/Documentation/kernel-parameters.txt =================================================================== --- linux-cfs-2.6.20.8.q.orig/Documentation/kernel-parameters.txt +++ linux-cfs-2.6.20.8.q/Documentation/kernel-parameters.txt @@ -914,49 +914,6 @@ and is between 256 and 4096 characters. mga= [HW,DRM] - migration_cost= - [KNL,SMP] debug: override scheduler migration costs - Format: ,,... - This debugging option can be used to override the - default scheduler migration cost matrix. The numbers - are indexed by 'CPU domain distance'. - E.g. migration_cost=1000,2000,3000 on an SMT NUMA - box will set up an intra-core migration cost of - 1 msec, an inter-core migration cost of 2 msecs, - and an inter-node migration cost of 3 msecs. - - WARNING: using the wrong values here can break - scheduler performance, so it's only for scheduler - development purposes, not production environments. - - migration_debug= - [KNL,SMP] migration cost auto-detect verbosity - Format=<0|1|2> - If a system's migration matrix reported at bootup - seems erroneous then this option can be used to - increase verbosity of the detection process. - We default to 0 (no extra messages), 1 will print - some more information, and 2 will be really - verbose (probably only useful if you also have a - serial console attached to the system). - - migration_factor= - [KNL,SMP] multiply/divide migration costs by a factor - Format= - This debug option can be used to proportionally - increase or decrease the auto-detected migration - costs for all entries of the migration matrix. - E.g. migration_factor=150 will increase migration - costs by 50%. (and thus the scheduler will be less - eager migrating cache-hot tasks) - migration_factor=80 will decrease migration costs - by 20%. (thus the scheduler will be more eager to - migrate tasks) - - WARNING: using the wrong values here can break - scheduler performance, so it's only for scheduler - development purposes, not production environments. - mousedev.tap_time= [MOUSE] Maximum time between finger touching and leaving touchpad surface for touch to be considered Index: linux-cfs-2.6.20.8.q/Documentation/sched-design-CFS.txt =================================================================== --- /dev/null +++ linux-cfs-2.6.20.8.q/Documentation/sched-design-CFS.txt @@ -0,0 +1,107 @@ +[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] + +i'm pleased to announce the first release of the "Modular Scheduler Core +and Completely Fair Scheduler [CFS]" patchset: + + http://redhat.com/~mingo/cfs-scheduler/ + +This project is a complete rewrite of the Linux task scheduler. My goal +is to address various feature requests and to fix deficiencies in the +vanilla scheduler that were suggested/found in the past few years, both +for desktop scheduling and for server scheduling workloads. + +[ QuickStart: apply the patch, recompile, reboot. The new scheduler + will be active by default and all tasks will default to the + SCHED_NORMAL interactive scheduling class. ] + +Highlights are: + + - the introduction of Scheduling Classes: an extensible hierarchy of + scheduler modules. These modules encapsulate scheduling policy + details and are handled by the scheduler core without the core + code assuming about them too much. + + - sched_fair.c implements the 'CFS desktop scheduler': it is a + replacement for the vanilla scheduler's SCHED_OTHER interactivity + code. + + i'd like to give credit to Con Kolivas for the general approach here: + he has proven via RSDL/SD that 'fair scheduling' is possible and that + it results in better desktop scheduling. Kudos Con! + + The CFS patch uses a completely different approach and implementation + from RSDL/SD. My goal was to make CFS's interactivity quality exceed + that of RSDL/SD, which is a high standard to meet :-) Testing + feedback is welcome to decide this one way or another. [ and, in any + case, all of SD's logic could be added via a kernel/sched_sd.c module + as well, if Con is interested in such an approach. ] + + CFS's design is quite radical: it does not use runqueues, it uses a + time-ordered rbtree to build a 'timeline' of future task execution, + and thus has no 'array switch' artifacts (by which both the vanilla + scheduler and RSDL/SD are affected). + + CFS uses nanosecond granularity accounting and does not rely on any + jiffies or other HZ detail. Thus the CFS scheduler has no notion of + 'timeslices' and has no heuristics whatsoever. There is only one + central tunable: + + /proc/sys/kernel/sched_granularity_ns + + which can be used to tune the scheduler from 'desktop' (low + latencies) to 'server' (good batching) workloads. It defaults to a + setting suitable for desktop workloads. SCHED_BATCH is handled by the + CFS scheduler module too. + + due to its design, the CFS scheduler is not prone to any of the + 'attacks' that exist today against the heuristics of the stock + scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all + work fine and do not impact interactivity and produce the expected + behavior. + + the CFS scheduler has a much stronger handling of nice levels and + SCHED_BATCH: both types of workloads should be isolated much more + agressively than under the vanilla scheduler. + + ( another rdetail: due to nanosec accounting and timeline sorting, + sched_yield() support is very simple under CFS, and in fact under + CFS sched_yield() behaves much better than under any other + scheduler i have tested so far. ) + + - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler + way than the vanilla scheduler does. It uses 100 runqueues (for all + 100 RT priority levels, instead of 140 in the vanilla scheduler) + and it needs no expired array. + + - reworked/sanitized SMP load-balancing: the runqueue-walking + assumptions are gone from the load-balancing code now, and + iterators of the scheduling modules are used. The balancing code got + quite a bit simpler as a result. + +the core scheduler got smaller by more than 700 lines: + + kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------ + 1 file changed, 372 insertions(+), 1082 deletions(-) + +and even adding all the scheduling modules, the total size impact is +relatively small: + + 18 files changed, 1454 insertions(+), 1133 deletions(-) + +most of the increase is due to extensive comments. The kernel size +impact is in fact a small negative: + + text data bss dec hex filename + 23366 4001 24 27391 6aff kernel/sched.o.vanilla + 24159 2705 56 26920 6928 kernel/sched.o.CFS + +(this is mainly due to the benefit of getting rid of the expired array +and its data structure overhead.) + +thanks go to Thomas Gleixner and Arjan van de Ven for review of this +patchset. + +as usual, any sort of feedback, bugreports, fixes and suggestions are +more than welcome, + + Ingo Index: linux-cfs-2.6.20.8.q/Makefile =================================================================== --- linux-cfs-2.6.20.8.q.orig/Makefile +++ linux-cfs-2.6.20.8.q/Makefile @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 20 -EXTRAVERSION = .11 +EXTRAVERSION = .11-cfs-v9 NAME = Homicidal Dwarf Hamster # *DOCUMENTATION* Index: linux-cfs-2.6.20.8.q/arch/i386/kernel/smpboot.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/i386/kernel/smpboot.c +++ linux-cfs-2.6.20.8.q/arch/i386/kernel/smpboot.c @@ -1132,18 +1132,6 @@ exit: } #endif -static void smp_tune_scheduling(void) -{ - unsigned long cachesize; /* kB */ - - if (cpu_khz) { - cachesize = boot_cpu_data.x86_cache_size; - - if (cachesize > 0) - max_cache_size = cachesize * 1024; - } -} - /* * Cycle through the processors sending APIC IPIs to boot each. */ @@ -1172,7 +1160,6 @@ static void __init smp_boot_cpus(unsigne x86_cpu_to_apicid[0] = boot_cpu_physical_apicid; current_thread_info()->cpu = 0; - smp_tune_scheduling(); set_cpu_sibling_map(0); Index: linux-cfs-2.6.20.8.q/arch/i386/kernel/syscall_table.S =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/i386/kernel/syscall_table.S +++ linux-cfs-2.6.20.8.q/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_sched_yield_to /* 320 */ Index: linux-cfs-2.6.20.8.q/arch/i386/kernel/tsc.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/i386/kernel/tsc.c +++ linux-cfs-2.6.20.8.q/arch/i386/kernel/tsc.c @@ -61,6 +61,8 @@ static inline int check_tsc_unstable(voi void mark_tsc_unstable(void) { + sched_clock_unstable_event(); + tsc_unstable = 1; } EXPORT_SYMBOL_GPL(mark_tsc_unstable); @@ -107,13 +109,7 @@ unsigned long long sched_clock(void) { unsigned long long this_offset; - /* - * in the NUMA case we dont use the TSC as they are not - * synchronized across all CPUs. - */ -#ifndef CONFIG_NUMA - if (!cpu_khz || check_tsc_unstable()) -#endif + if (!cpu_khz || !cpu_has_tsc) /* no locking but a rare wrong value is not a big deal */ return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); Index: linux-cfs-2.6.20.8.q/arch/ia64/kernel/setup.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/ia64/kernel/setup.c +++ linux-cfs-2.6.20.8.q/arch/ia64/kernel/setup.c @@ -773,7 +773,6 @@ static void __cpuinit get_max_cacheline_size (void) { unsigned long line_size, max = 1; - unsigned int cache_size = 0; u64 l, levels, unique_caches; pal_cache_config_info_t cci; s64 status; @@ -803,8 +802,6 @@ get_max_cacheline_size (void) line_size = 1 << cci.pcci_line_size; if (line_size > max) max = line_size; - if (cache_size < cci.pcci_cache_size) - cache_size = cci.pcci_cache_size; if (!cci.pcci_unified) { status = ia64_pal_cache_config_info(l, /* cache_type (instruction)= */ 1, @@ -821,9 +818,6 @@ get_max_cacheline_size (void) ia64_i_cache_stride_shift = cci.pcci_stride; } out: -#ifdef CONFIG_SMP - max_cache_size = max(max_cache_size, cache_size); -#endif if (max > ia64_max_cacheline_size) ia64_max_cacheline_size = max; } Index: linux-cfs-2.6.20.8.q/arch/mips/kernel/smp.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/mips/kernel/smp.c +++ linux-cfs-2.6.20.8.q/arch/mips/kernel/smp.c @@ -245,7 +245,6 @@ void __init smp_prepare_cpus(unsigned in { init_new_context(current, &init_mm); current_thread_info()->cpu = 0; - smp_tune_scheduling(); plat_prepare_cpus(max_cpus); #ifndef CONFIG_HOTPLUG_CPU cpu_present_map = cpu_possible_map; Index: linux-cfs-2.6.20.8.q/arch/sparc/kernel/smp.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/sparc/kernel/smp.c +++ linux-cfs-2.6.20.8.q/arch/sparc/kernel/smp.c @@ -69,16 +69,6 @@ void __cpuinit smp_store_cpu_info(int id cpu_data(id).prom_node = cpu_node; cpu_data(id).mid = cpu_get_hwmid(cpu_node); - /* this is required to tune the scheduler correctly */ - /* is it possible to have CPUs with different cache sizes? */ - if (id == boot_cpu_id) { - int cache_line,cache_nlines; - cache_line = 0x20; - cache_line = prom_getintdefault(cpu_node, "ecache-line-size", cache_line); - cache_nlines = 0x8000; - cache_nlines = prom_getintdefault(cpu_node, "ecache-nlines", cache_nlines); - max_cache_size = cache_line * cache_nlines; - } if (cpu_data(id).mid < 0) panic("No MID found for CPU%d at node 0x%08d", id, cpu_node); } Index: linux-cfs-2.6.20.8.q/arch/sparc64/kernel/smp.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/arch/sparc64/kernel/smp.c +++ linux-cfs-2.6.20.8.q/arch/sparc64/kernel/smp.c @@ -1293,41 +1293,6 @@ int setup_profiling_timer(unsigned int m return 0; } -static void __init smp_tune_scheduling(void) -{ - struct device_node *dp; - int instance; - unsigned int def, smallest = ~0U; - - def = ((tlb_type == hypervisor) ? - (3 * 1024 * 1024) : - (4 * 1024 * 1024)); - - instance = 0; - while (!cpu_find_by_instance(instance, &dp, NULL)) { - unsigned int val; - - val = of_getintprop_default(dp, "ecache-size", def); - if (val < smallest) - smallest = val; - - instance++; - } - - /* Any value less than 256K is nonsense. */ - if (smallest < (256U * 1024U)) - smallest = 256 * 1024; - - max_cache_size = smallest; - - if (smallest < 1U * 1024U * 1024U) - printk(KERN_INFO "Using max_cache_size of %uKB\n", - smallest / 1024U); - else - printk(KERN_INFO "Using max_cache_size of %uMB\n", - smallest / 1024U / 1024U); -} - /* Constrain the number of cpus to max_cpus. */ void __init smp_prepare_cpus(unsigned int max_cpus) { @@ -1363,7 +1328,6 @@ void __init smp_prepare_cpus(unsigned in } smp_store_cpu_info(boot_cpu_id); - smp_tune_scheduling(); } /* Set this up early so that things like the scheduler can init Index: linux-cfs-2.6.20.8.q/fs/proc/array.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/fs/proc/array.c +++ linux-cfs-2.6.20.8.q/fs/proc/array.c @@ -165,7 +165,6 @@ static inline char * task_state(struct t rcu_read_lock(); buffer += sprintf(buffer, "State:\t%s\n" - "SleepAVG:\t%lu%%\n" "Tgid:\t%d\n" "Pid:\t%d\n" "PPid:\t%d\n" @@ -173,9 +172,8 @@ static inline char * task_state(struct t "Uid:\t%d\t%d\t%d\t%d\n" "Gid:\t%d\t%d\t%d\t%d\n", get_task_state(p), - (p->sleep_avg/1024)*100/(1020000000/1024), - p->tgid, p->pid, - pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0, + p->tgid, p->pid, + pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0, pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0, p->uid, p->euid, p->suid, p->fsuid, p->gid, p->egid, p->sgid, p->fsgid); @@ -312,6 +310,11 @@ int proc_pid_status(struct task_struct * return buffer - orig; } +int proc_pid_sched(struct task_struct *task, char *buffer) +{ + return sched_print_task_state(task, buffer) - buffer; +} + static int do_task_stat(struct task_struct *task, char * buffer, int whole) { unsigned long vsize, eip, esp, wchan = ~0UL; Index: linux-cfs-2.6.20.8.q/fs/proc/base.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/fs/proc/base.c +++ linux-cfs-2.6.20.8.q/fs/proc/base.c @@ -1839,6 +1839,7 @@ static struct pid_entry tgid_base_stuff[ INF("environ", S_IRUSR, pid_environ), INF("auxv", S_IRUSR, pid_auxv), INF("status", S_IRUGO, pid_status), + INF("sched", S_IRUGO, pid_sched), INF("cmdline", S_IRUGO, pid_cmdline), INF("stat", S_IRUGO, tgid_stat), INF("statm", S_IRUGO, pid_statm), @@ -2121,6 +2122,7 @@ static struct pid_entry tid_base_stuff[] INF("environ", S_IRUSR, pid_environ), INF("auxv", S_IRUSR, pid_auxv), INF("status", S_IRUGO, pid_status), + INF("sched", S_IRUGO, pid_sched), INF("cmdline", S_IRUGO, pid_cmdline), INF("stat", S_IRUGO, tid_stat), INF("statm", S_IRUGO, pid_statm), Index: linux-cfs-2.6.20.8.q/fs/proc/internal.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/fs/proc/internal.h +++ linux-cfs-2.6.20.8.q/fs/proc/internal.h @@ -36,6 +36,7 @@ extern int proc_exe_link(struct inode *, extern int proc_tid_stat(struct task_struct *, char *); extern int proc_tgid_stat(struct task_struct *, char *); extern int proc_pid_status(struct task_struct *, char *); +extern int proc_pid_sched(struct task_struct *, char *); extern int proc_pid_statm(struct task_struct *, char *); extern struct file_operations proc_maps_operations; Index: linux-cfs-2.6.20.8.q/include/asm-generic/bitops/sched.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-generic/bitops/sched.h +++ linux-cfs-2.6.20.8.q/include/asm-generic/bitops/sched.h @@ -6,28 +6,23 @@ /* * Every architecture must define this function. It's the fastest - * way of searching a 140-bit bitmap where the first 100 bits are - * unlikely to be set. It's guaranteed that at least one of the 140 - * bits is cleared. + * way of searching a 100-bit bitmap. It's guaranteed that at least + * one of the 100 bits is cleared. */ static inline int sched_find_first_bit(const unsigned long *b) { #if BITS_PER_LONG == 64 - if (unlikely(b[0])) + if (b[0]) return __ffs(b[0]); - if (likely(b[1])) - return __ffs(b[1]) + 64; - return __ffs(b[2]) + 128; + return __ffs(b[1]) + 64; #elif BITS_PER_LONG == 32 - if (unlikely(b[0])) + if (b[0]) return __ffs(b[0]); - if (unlikely(b[1])) + if (b[1]) return __ffs(b[1]) + 32; - if (unlikely(b[2])) + if (b[2]) return __ffs(b[2]) + 64; - if (b[3]) - return __ffs(b[3]) + 96; - return __ffs(b[4]) + 128; + return __ffs(b[3]) + 96; #else #error BITS_PER_LONG not defined #endif Index: linux-cfs-2.6.20.8.q/include/asm-i386/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-i386/topology.h +++ linux-cfs-2.6.20.8.q/include/asm-i386/topology.h @@ -85,7 +85,6 @@ static inline int node_to_first_cpu(int .idle_idx = 1, \ .newidle_idx = 2, \ .wake_idx = 1, \ - .per_cpu_gain = 100, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ Index: linux-cfs-2.6.20.8.q/include/asm-i386/unistd.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-i386/unistd.h +++ linux-cfs-2.6.20.8.q/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages 317 #define __NR_getcpu 318 #define __NR_epoll_pwait 319 +#define __NR_sched_yield_to 320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-cfs-2.6.20.8.q/include/asm-ia64/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-ia64/topology.h +++ linux-cfs-2.6.20.8.q/include/asm-ia64/topology.h @@ -65,7 +65,6 @@ void build_cpu_to_node_map(void); .max_interval = 4, \ .busy_factor = 64, \ .imbalance_pct = 125, \ - .per_cpu_gain = 100, \ .cache_nice_tries = 2, \ .busy_idx = 2, \ .idle_idx = 1, \ @@ -97,7 +96,6 @@ void build_cpu_to_node_map(void); .newidle_idx = 0, /* unused */ \ .wake_idx = 1, \ .forkexec_idx = 1, \ - .per_cpu_gain = 100, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ Index: linux-cfs-2.6.20.8.q/include/asm-mips/mach-ip27/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-mips/mach-ip27/topology.h +++ linux-cfs-2.6.20.8.q/include/asm-mips/mach-ip27/topology.h @@ -28,7 +28,6 @@ extern unsigned char __node_distances[MA .busy_factor = 32, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .per_cpu_gain = 100, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_EXEC \ | SD_WAKE_BALANCE, \ Index: linux-cfs-2.6.20.8.q/include/asm-powerpc/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-powerpc/topology.h +++ linux-cfs-2.6.20.8.q/include/asm-powerpc/topology.h @@ -57,7 +57,6 @@ static inline int pcibus_to_node(struct .busy_factor = 32, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .per_cpu_gain = 100, \ .busy_idx = 3, \ .idle_idx = 1, \ .newidle_idx = 2, \ Index: linux-cfs-2.6.20.8.q/include/asm-x86_64/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-x86_64/topology.h +++ linux-cfs-2.6.20.8.q/include/asm-x86_64/topology.h @@ -43,7 +43,6 @@ extern int __node_distance(int, int); .newidle_idx = 0, \ .wake_idx = 1, \ .forkexec_idx = 1, \ - .per_cpu_gain = 100, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ Index: linux-cfs-2.6.20.8.q/include/asm-x86_64/unistd.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/asm-x86_64/unistd.h +++ linux-cfs-2.6.20.8.q/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_sched_yield_to 280 +__SYSCALL(__NR_sched_yield_to, sys_sched_yield_to) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_sched_yield_to #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-cfs-2.6.20.8.q/include/linux/hardirq.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/linux/hardirq.h +++ linux-cfs-2.6.20.8.q/include/linux/hardirq.h @@ -79,6 +79,19 @@ #endif #ifdef CONFIG_PREEMPT +# define PREEMPT_CHECK_OFFSET 1 +#else +# define PREEMPT_CHECK_OFFSET 0 +#endif + +/* + * Check whether we were atomic before we did preempt_disable(): + * (used by the scheduler) + */ +#define in_atomic_preempt_off() \ + ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) + +#ifdef CONFIG_PREEMPT # define preemptible() (preempt_count() == 0 && !irqs_disabled()) # define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1) #else Index: linux-cfs-2.6.20.8.q/include/linux/ktime.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/linux/ktime.h +++ linux-cfs-2.6.20.8.q/include/linux/ktime.h @@ -274,4 +274,6 @@ extern void ktime_get_ts(struct timespec /* Get the real (wall-) time in timespec format: */ #define ktime_get_real_ts(ts) getnstimeofday(ts) +extern ktime_t ktime_get(void); + #endif Index: linux-cfs-2.6.20.8.q/include/linux/sched.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/linux/sched.h +++ linux-cfs-2.6.20.8.q/include/linux/sched.h @@ -2,7 +2,6 @@ #define _LINUX_SCHED_H #include /* For AT_VECTOR_SIZE */ - /* * cloning flags: */ @@ -37,6 +36,8 @@ #ifdef __KERNEL__ +#include /* For run_node */ + struct sched_param { int sched_priority; }; @@ -196,13 +197,13 @@ extern void init_idle(struct task_struct extern cpumask_t nohz_cpu_mask; /* - * Only dump TASK_* tasks. (-1 for all tasks) + * Only dump TASK_* tasks. (0 for all tasks) */ extern void show_state_filter(unsigned long state_filter); static inline void show_state(void) { - show_state_filter(-1); + show_state_filter(0); } extern void show_regs(struct pt_regs *); @@ -464,7 +465,7 @@ struct signal_struct { * from jiffies_to_ns(utime + stime) if sched_clock uses something * other than jiffies.) */ - unsigned long long sched_time; + unsigned long long sum_sched_runtime; /* * We don't bother to synchronize most readers of this at all, @@ -524,6 +525,7 @@ struct signal_struct { #define MAX_RT_PRIO MAX_USER_RT_PRIO #define MAX_PRIO (MAX_RT_PRIO + 40) +#define DEFAULT_PRIO (MAX_RT_PRIO + 20) #define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO) #define rt_task(p) rt_prio((p)->prio) @@ -635,7 +637,14 @@ enum idle_type /* * sched-domains (multiprocessor balancing) declarations: */ -#define SCHED_LOAD_SCALE 128UL /* increase resolution of load */ + +/* + * Increase resolution of nice-level calculations: + */ +#define SCHED_LOAD_SHIFT 10 +#define SCHED_LOAD_SCALE (1UL << SCHED_LOAD_SHIFT) + +#define SCHED_LOAD_SCALE_FUZZ (SCHED_LOAD_SCALE >> 5) #ifdef CONFIG_SMP #define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */ @@ -684,7 +693,6 @@ struct sched_domain { unsigned int imbalance_pct; /* No balance until over watermark */ unsigned long long cache_hot_time; /* Task considered cache hot (ns) */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ - unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */ unsigned int busy_idx; unsigned int idle_idx; unsigned int newidle_idx; @@ -733,12 +741,6 @@ struct sched_domain { extern int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2); -/* - * Maximum cache size the migration-costs auto-tuning code will - * search from: - */ -extern unsigned int max_cache_size; - #endif /* CONFIG_SMP */ @@ -789,14 +791,28 @@ struct mempolicy; struct pipe_inode_info; struct uts_namespace; -enum sleep_type { - SLEEP_NORMAL, - SLEEP_NONINTERACTIVE, - SLEEP_INTERACTIVE, - SLEEP_INTERRUPTED, -}; +struct rq; -struct prio_array; +struct sched_class { + struct sched_class *next; + + void (*enqueue_task) (struct rq *rq, struct task_struct *p, + int wakeup, u64 now); + void (*dequeue_task) (struct rq *rq, struct task_struct *p, + int sleep, u64 now); + void (*yield_task) (struct rq *rq, struct task_struct *p, + struct task_struct *p_to); + + void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); + + struct task_struct * (*pick_next_task) (struct rq *rq, u64 now); + void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now); + + struct task_struct * (*load_balance_start) (struct rq *rq); + struct task_struct * (*load_balance_next) (struct rq *rq); + void (*task_tick) (struct rq *rq, struct task_struct *p); + void (*task_new) (struct rq *rq, struct task_struct *p); +}; struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ @@ -813,26 +829,45 @@ struct task_struct { #endif #endif int load_weight; /* for niceness load balancing purposes */ + int load_shift; + int prio, static_prio, normal_prio; + int on_rq; struct list_head run_list; - struct prio_array *array; + struct rb_node run_node; unsigned short ioprio; #ifdef CONFIG_BLK_DEV_IO_TRACE unsigned int btrace_seq; #endif - unsigned long sleep_avg; - unsigned long long timestamp, last_ran; - unsigned long long sched_time; /* sched_clock time spent running */ - enum sleep_type sleep_type; + /* CFS scheduling class statistics fields: */ + u64 wait_start_fair; + u64 wait_start; + u64 exec_start; + u64 sleep_start; + u64 block_start; + u64 sleep_max; + u64 block_max; + u64 exec_max; + u64 wait_max; + u64 last_ran; + + s64 wait_runtime; + u64 sum_exec_runtime; + s64 fair_key; + s64 sum_wait_runtime; unsigned long policy; cpumask_t cpus_allowed; - unsigned int time_slice, first_time_slice; + unsigned int time_slice; + struct sched_class *sched_class; + + s64 min_wait_runtime; #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif + u64 nr_switches; struct list_head tasks; /* @@ -1195,8 +1230,9 @@ static inline int set_cpus_allowed(struc #endif extern unsigned long long sched_clock(void); +extern void sched_clock_unstable_event(void); extern unsigned long long -current_sched_time(const struct task_struct *current_task); +current_sched_runtime(const struct task_struct *current_task); /* sched_exec is called by processes performing an exec */ #ifdef CONFIG_SMP @@ -1212,6 +1248,13 @@ static inline void idle_task_exit(void) #endif extern void sched_idle_next(void); +extern char * sched_print_task_state(struct task_struct *p, char *buffer); + +extern unsigned int sysctl_sched_granularity; +extern unsigned int sysctl_sched_wakeup_granularity; +extern unsigned int sysctl_sched_sleep_history_max; +extern unsigned int sysctl_sched_child_runs_first; +extern unsigned int sysctl_sched_load_smoothing; #ifdef CONFIG_RT_MUTEXES extern int rt_mutex_getprio(struct task_struct *p); @@ -1290,8 +1333,7 @@ extern void FASTCALL(wake_up_new_task(st #else static inline void kick_process(struct task_struct *tsk) { } #endif -extern void FASTCALL(sched_fork(struct task_struct * p, int clone_flags)); -extern void FASTCALL(sched_exit(struct task_struct * p)); +extern void sched_fork(struct task_struct * p, int clone_flags); extern int in_group_p(gid_t); extern int in_egroup_p(gid_t); Index: linux-cfs-2.6.20.8.q/include/linux/topology.h =================================================================== --- linux-cfs-2.6.20.8.q.orig/include/linux/topology.h +++ linux-cfs-2.6.20.8.q/include/linux/topology.h @@ -96,7 +96,6 @@ .busy_factor = 64, \ .imbalance_pct = 110, \ .cache_nice_tries = 0, \ - .per_cpu_gain = 25, \ .busy_idx = 0, \ .idle_idx = 0, \ .newidle_idx = 1, \ @@ -128,7 +127,6 @@ .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .per_cpu_gain = 100, \ .busy_idx = 2, \ .idle_idx = 1, \ .newidle_idx = 2, \ @@ -159,7 +157,6 @@ .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 1, \ - .per_cpu_gain = 100, \ .busy_idx = 2, \ .idle_idx = 1, \ .newidle_idx = 2, \ @@ -193,7 +190,6 @@ .newidle_idx = 0, /* unused */ \ .wake_idx = 0, /* unused */ \ .forkexec_idx = 0, /* unused */ \ - .per_cpu_gain = 100, \ .flags = SD_LOAD_BALANCE \ | SD_SERIALIZE, \ .last_balance = jiffies, \ Index: linux-cfs-2.6.20.8.q/init/main.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/init/main.c +++ linux-cfs-2.6.20.8.q/init/main.c @@ -422,7 +422,7 @@ static void noinline rest_init(void) /* * The boot idle thread must execute schedule() - * at least one to get things moving: + * at least once to get things moving: */ preempt_enable_no_resched(); schedule(); Index: linux-cfs-2.6.20.8.q/kernel/exit.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/exit.c +++ linux-cfs-2.6.20.8.q/kernel/exit.c @@ -112,7 +112,7 @@ static void __exit_signal(struct task_st sig->maj_flt += tsk->maj_flt; sig->nvcsw += tsk->nvcsw; sig->nivcsw += tsk->nivcsw; - sig->sched_time += tsk->sched_time; + sig->sum_sched_runtime += tsk->sum_exec_runtime; sig = NULL; /* Marker for below. */ } @@ -170,7 +170,6 @@ repeat: zap_leader = (leader->exit_signal == -1); } - sched_exit(p); write_unlock_irq(&tasklist_lock); proc_flush_task(p); release_thread(p); Index: linux-cfs-2.6.20.8.q/kernel/fork.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/fork.c +++ linux-cfs-2.6.20.8.q/kernel/fork.c @@ -874,7 +874,7 @@ static inline int copy_signal(unsigned l sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero; sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0; sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0; - sig->sched_time = 0; + sig->sum_sched_runtime = 0; INIT_LIST_HEAD(&sig->cpu_timers[0]); INIT_LIST_HEAD(&sig->cpu_timers[1]); INIT_LIST_HEAD(&sig->cpu_timers[2]); @@ -1037,7 +1037,7 @@ static struct task_struct *copy_process( p->utime = cputime_zero; p->stime = cputime_zero; - p->sched_time = 0; + p->rchar = 0; /* I/O counter: bytes read */ p->wchar = 0; /* I/O counter: bytes written */ p->syscr = 0; /* I/O counter: read syscalls */ Index: linux-cfs-2.6.20.8.q/kernel/hrtimer.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/hrtimer.c +++ linux-cfs-2.6.20.8.q/kernel/hrtimer.c @@ -45,7 +45,7 @@ * * returns the time in ktime_t format */ -static ktime_t ktime_get(void) +ktime_t ktime_get(void) { struct timespec now; Index: linux-cfs-2.6.20.8.q/kernel/posix-cpu-timers.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/posix-cpu-timers.c +++ linux-cfs-2.6.20.8.q/kernel/posix-cpu-timers.c @@ -161,7 +161,7 @@ static inline cputime_t virt_ticks(struc } static inline unsigned long long sched_ns(struct task_struct *p) { - return (p == current) ? current_sched_time(p) : p->sched_time; + return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime; } int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp) @@ -246,10 +246,10 @@ static int cpu_clock_sample_group_locked } while (t != p); break; case CPUCLOCK_SCHED: - cpu->sched = p->signal->sched_time; + cpu->sched = p->signal->sum_sched_runtime; /* Add in each other live thread. */ while ((t = next_thread(t)) != p) { - cpu->sched += t->sched_time; + cpu->sched += t->sum_exec_runtime; } cpu->sched += sched_ns(p); break; @@ -417,7 +417,7 @@ int posix_cpu_timer_del(struct k_itimer */ static void cleanup_timers(struct list_head *head, cputime_t utime, cputime_t stime, - unsigned long long sched_time) + unsigned long long sum_exec_runtime) { struct cpu_timer_list *timer, *next; cputime_t ptime = cputime_add(utime, stime); @@ -446,10 +446,10 @@ static void cleanup_timers(struct list_h ++head; list_for_each_entry_safe(timer, next, head, entry) { list_del_init(&timer->entry); - if (timer->expires.sched < sched_time) { + if (timer->expires.sched < sum_exec_runtime) { timer->expires.sched = 0; } else { - timer->expires.sched -= sched_time; + timer->expires.sched -= sum_exec_runtime; } } } @@ -462,7 +462,7 @@ static void cleanup_timers(struct list_h void posix_cpu_timers_exit(struct task_struct *tsk) { cleanup_timers(tsk->cpu_timers, - tsk->utime, tsk->stime, tsk->sched_time); + tsk->utime, tsk->stime, tsk->sum_exec_runtime); } void posix_cpu_timers_exit_group(struct task_struct *tsk) @@ -470,7 +470,7 @@ void posix_cpu_timers_exit_group(struct cleanup_timers(tsk->signal->cpu_timers, cputime_add(tsk->utime, tsk->signal->utime), cputime_add(tsk->stime, tsk->signal->stime), - tsk->sched_time + tsk->signal->sched_time); + tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime); } @@ -531,7 +531,7 @@ static void process_timer_rebalance(stru nsleft = max_t(unsigned long long, nsleft, 1); do { if (likely(!(t->flags & PF_EXITING))) { - ns = t->sched_time + nsleft; + ns = t->sum_exec_runtime + nsleft; if (t->it_sched_expires == 0 || t->it_sched_expires > ns) { t->it_sched_expires = ns; @@ -999,7 +999,7 @@ static void check_thread_timers(struct t struct cpu_timer_list *t = list_entry(timers->next, struct cpu_timer_list, entry); - if (!--maxfire || tsk->sched_time < t->expires.sched) { + if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) { tsk->it_sched_expires = t->expires.sched; break; } @@ -1019,7 +1019,7 @@ static void check_process_timers(struct int maxfire; struct signal_struct *const sig = tsk->signal; cputime_t utime, stime, ptime, virt_expires, prof_expires; - unsigned long long sched_time, sched_expires; + unsigned long long sum_sched_runtime, sched_expires; struct task_struct *t; struct list_head *timers = sig->cpu_timers; @@ -1039,12 +1039,12 @@ static void check_process_timers(struct */ utime = sig->utime; stime = sig->stime; - sched_time = sig->sched_time; + sum_sched_runtime = sig->sum_sched_runtime; t = tsk; do { utime = cputime_add(utime, t->utime); stime = cputime_add(stime, t->stime); - sched_time += t->sched_time; + sum_sched_runtime += t->sum_exec_runtime; t = next_thread(t); } while (t != tsk); ptime = cputime_add(utime, stime); @@ -1085,7 +1085,7 @@ static void check_process_timers(struct struct cpu_timer_list *t = list_entry(timers->next, struct cpu_timer_list, entry); - if (!--maxfire || sched_time < t->expires.sched) { + if (!--maxfire || sum_sched_runtime < t->expires.sched) { sched_expires = t->expires.sched; break; } @@ -1177,7 +1177,7 @@ static void check_process_timers(struct virt_left = cputime_sub(virt_expires, utime); virt_left = cputime_div_non_zero(virt_left, nthreads); if (sched_expires) { - sched_left = sched_expires - sched_time; + sched_left = sched_expires - sum_sched_runtime; do_div(sched_left, nthreads); sched_left = max_t(unsigned long long, sched_left, 1); } else { @@ -1203,7 +1203,7 @@ static void check_process_timers(struct t->it_virt_expires = ticks; } - sched = t->sched_time + sched_left; + sched = t->sum_exec_runtime + sched_left; if (sched_expires && (t->it_sched_expires == 0 || t->it_sched_expires > sched)) { t->it_sched_expires = sched; @@ -1295,7 +1295,7 @@ void run_posix_cpu_timers(struct task_st if (UNEXPIRED(prof) && UNEXPIRED(virt) && (tsk->it_sched_expires == 0 || - tsk->sched_time < tsk->it_sched_expires)) + tsk->sum_exec_runtime < tsk->it_sched_expires)) return; #undef UNEXPIRED Index: linux-cfs-2.6.20.8.q/kernel/sched.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/sched.c +++ linux-cfs-2.6.20.8.q/kernel/sched.c @@ -89,110 +89,13 @@ */ #define MIN_TIMESLICE max(5 * HZ / 1000, 1) #define DEF_TIMESLICE (100 * HZ / 1000) -#define ON_RUNQUEUE_WEIGHT 30 -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 -#define PRIO_BONUS_RATIO 25 -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) -#define INTERACTIVE_DELTA 2 -#define MAX_SLEEP_AVG (DEF_TIMESLICE * MAX_BONUS) -#define STARVATION_LIMIT (MAX_SLEEP_AVG) -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) - -/* - * If a task is 'interactive' then we reinsert it in the active - * array after it has expired its current timeslice. (it will not - * continue to run immediately, it will still roundrobin with - * other interactive tasks.) - * - * This part scales the interactivity limit depending on niceness. - * - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. - * Here are a few examples of different nice levels: - * - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] - * - * (the X axis represents the possible -5 ... 0 ... +5 dynamic - * priority range a task can explore, a value of '1' means the - * task is rated interactive.) - * - * Ie. nice +19 tasks can never get 'interactive' enough to be - * reinserted into the active array. And only heavily CPU-hog nice -20 - * tasks will be expired. Default nice 0 tasks are somewhere between, - * it takes some effort for them to get interactive, but it's not - * too hard. - */ - -#define CURRENT_BONUS(p) \ - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ - MAX_SLEEP_AVG) - -#define GRANULARITY (10 * HZ / 1000 ? : 1) - -#ifdef CONFIG_SMP -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ - num_online_cpus()) -#else -#define TIMESLICE_GRANULARITY(p) (GRANULARITY * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) -#endif - -#define SCALE(v1,v1_max,v2_max) \ - (v1) * (v2_max) / (v1_max) - -#define DELTA(p) \ - (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \ - INTERACTIVE_DELTA) - -#define TASK_INTERACTIVE(p) \ - ((p)->prio <= (p)->static_prio - DELTA(p)) - -#define INTERACTIVE_SLEEP(p) \ - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) - -#define TASK_PREEMPTS_CURR(p, rq) \ - ((p)->prio < (rq)->curr->prio) - -#define SCALE_PRIO(x, prio) \ - max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) - -static unsigned int static_prio_timeslice(int static_prio) -{ - if (static_prio < NICE_TO_PRIO(0)) - return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); - else - return SCALE_PRIO(DEF_TIMESLICE, static_prio); -} - -/* - * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] - * to time slice values: [800ms ... 100ms ... 5ms] - * - * The higher a thread's priority, the bigger timeslices - * it gets during one round of execution. But even the lowest - * priority thread gets MIN_TIMESLICE worth of execution time. - */ - -static inline unsigned int task_timeslice(struct task_struct *p) -{ - return static_prio_timeslice(p->static_prio); -} /* - * These are the runqueue data structures: + * This is the priority-queue data structure of the RT scheduling class: */ - struct prio_array { - unsigned int nr_active; - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */ - struct list_head queue[MAX_PRIO]; + DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */ + struct list_head queue[MAX_RT_PRIO]; }; /* @@ -209,12 +112,13 @@ struct rq { * nr_running and cpu_load should be in the same cacheline because * remote CPUs use both these fields when doing load calculation. */ - unsigned long nr_running; + long nr_running; unsigned long raw_weighted_load; -#ifdef CONFIG_SMP - unsigned long cpu_load[3]; -#endif - unsigned long long nr_switches; + #define CPU_LOAD_IDX_MAX 5 + unsigned long cpu_load[CPU_LOAD_IDX_MAX]; + + u64 nr_switches; + unsigned long nr_load_updates; /* * This is part of a global counter where only the total sum @@ -224,14 +128,29 @@ struct rq { */ unsigned long nr_uninterruptible; - unsigned long expired_timestamp; - /* Cached timestamp set by update_cpu_clock() */ - unsigned long long most_recent_timestamp; struct task_struct *curr, *idle; unsigned long next_balance; struct mm_struct *prev_mm; - struct prio_array *active, *expired, arrays[2]; - int best_expired_prio; + + u64 clock, prev_clock_raw; + s64 clock_max_delta; + u64 fair_clock, prev_fair_clock; + u64 exec_clock, prev_exec_clock; + u64 wait_runtime; + + unsigned int clock_warps; + unsigned int clock_unstable_events; + + struct sched_class *load_balance_class; + + struct prio_array active; + int rt_load_balance_idx; + struct list_head *rt_load_balance_head, *rt_load_balance_curr; + + struct rb_root tasks_timeline; + struct rb_node *rb_leftmost; + struct rb_node *rb_load_balance_curr; + atomic_t nr_iowait; #ifdef CONFIG_SMP @@ -268,7 +187,107 @@ struct rq { struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; + +static inline void check_preempt_curr(struct rq *rq, struct task_struct *p) +{ + rq->curr->sched_class->check_preempt_curr(rq, p); +} + +#define SCALE_PRIO(x, prio) \ + max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE) + +/* + * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ] + * to time slice values: [800ms ... 100ms ... 5ms] + */ +static unsigned int static_prio_timeslice(int static_prio) +{ + if (static_prio == NICE_TO_PRIO(19)) + return 1; + + if (static_prio < NICE_TO_PRIO(0)) + return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio); + else + return SCALE_PRIO(DEF_TIMESLICE, static_prio); +} + +/* + * Print out various scheduling related per-task fields: + */ +char * sched_print_task_state(struct task_struct *p, char *buffer) +{ + struct rq *this_rq = &per_cpu(runqueues, raw_smp_processor_id()); + unsigned long long t0, t1; + +#define P(F) \ + buffer += sprintf(buffer, "%-25s:%20Ld\n", #F, (long long)p->F) + + P(wait_start); + P(wait_start_fair); + P(exec_start); + P(sleep_start); + P(block_start); + P(sleep_max); + P(block_max); + P(exec_max); + P(wait_max); + P(min_wait_runtime); + P(last_ran); + P(wait_runtime); + P(sum_exec_runtime); +#undef P + + t0 = sched_clock(); + t1 = sched_clock(); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "clock-delta", + (long long)t1-t0); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-wait_runtime", + (long long)this_rq->wait_runtime); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-exec_clock", + (long long)this_rq->exec_clock); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-fair_clock", + (long long)this_rq->fair_clock); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-clock", + (long long)this_rq->clock); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-prev_clock_raw", + (long long)this_rq->prev_clock_raw); + buffer += sprintf(buffer, "%-25s:%20Ld\n", "rq-clock_max_delta", + (long long)this_rq->clock_max_delta); + buffer += sprintf(buffer, "%-25s:%20u\n", "rq-clock_warps", + this_rq->clock_warps); + buffer += sprintf(buffer, "%-25s:%20u\n", "rq-clock_unstable_events", + this_rq->clock_unstable_events); + return buffer; +} + +/* + * Per-runqueue clock, as finegrained as the platform can give us: + */ +static inline unsigned long long __rq_clock(struct rq *rq) +{ + u64 now = sched_clock(); + u64 clock = rq->clock; + u64 prev_raw = rq->prev_clock_raw; + s64 delta = now - prev_raw; + + /* + * Protect against sched_clock() occasionally going backwards: + */ + if (unlikely(delta < 0)) { + clock++; + rq->clock_warps++; + } else { + if (unlikely(delta > rq->clock_max_delta)) + rq->clock_max_delta = delta; + clock += delta; + } + + rq->prev_clock_raw = now; + rq->clock = clock; + + return clock; +} static inline int cpu_of(struct rq *rq) { @@ -279,6 +298,16 @@ static inline int cpu_of(struct rq *rq) #endif } +static inline unsigned long long rq_clock(struct rq *rq) +{ + int this_cpu = smp_processor_id(); + + if (this_cpu == cpu_of(rq)) + return __rq_clock(rq); + + return rq->clock; +} + /* * The domain tree (rq->sd) is protected by RCU's quiescent state transition. * See detach_destroy_domains: synchronize_sched for details. @@ -423,134 +452,6 @@ static inline void task_rq_unlock(struct spin_unlock_irqrestore(&rq->lock, *flags); } -#ifdef CONFIG_SCHEDSTATS -/* - * bump this up when changing the output format or the meaning of an existing - * format, so that tools can adapt (or abort) - */ -#define SCHEDSTAT_VERSION 14 - -static int show_schedstat(struct seq_file *seq, void *v) -{ - int cpu; - - seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); - seq_printf(seq, "timestamp %lu\n", jiffies); - for_each_online_cpu(cpu) { - struct rq *rq = cpu_rq(cpu); -#ifdef CONFIG_SMP - struct sched_domain *sd; - int dcnt = 0; -#endif - - /* runqueue-specific stats */ - seq_printf(seq, - "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu", - cpu, rq->yld_both_empty, - rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt, - rq->sched_switch, rq->sched_cnt, rq->sched_goidle, - rq->ttwu_cnt, rq->ttwu_local, - rq->rq_sched_info.cpu_time, - rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt); - - seq_printf(seq, "\n"); - -#ifdef CONFIG_SMP - /* domain-specific stats */ - preempt_disable(); - for_each_domain(cpu, sd) { - enum idle_type itype; - char mask_str[NR_CPUS]; - - cpumask_scnprintf(mask_str, NR_CPUS, sd->span); - seq_printf(seq, "domain%d %s", dcnt++, mask_str); - for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES; - itype++) { - seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu " - "%lu", - sd->lb_cnt[itype], - sd->lb_balanced[itype], - sd->lb_failed[itype], - sd->lb_imbalance[itype], - sd->lb_gained[itype], - sd->lb_hot_gained[itype], - sd->lb_nobusyq[itype], - sd->lb_nobusyg[itype]); - } - seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu" - " %lu %lu %lu\n", - sd->alb_cnt, sd->alb_failed, sd->alb_pushed, - sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed, - sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed, - sd->ttwu_wake_remote, sd->ttwu_move_affine, - sd->ttwu_move_balance); - } - preempt_enable(); -#endif - } - return 0; -} - -static int schedstat_open(struct inode *inode, struct file *file) -{ - unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32); - char *buf = kmalloc(size, GFP_KERNEL); - struct seq_file *m; - int res; - - if (!buf) - return -ENOMEM; - res = single_open(file, show_schedstat, NULL); - if (!res) { - m = file->private_data; - m->buf = buf; - m->size = size; - } else - kfree(buf); - return res; -} - -const struct file_operations proc_schedstat_operations = { - .open = schedstat_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - -/* - * Expects runqueue lock to be held for atomicity of update - */ -static inline void -rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies) -{ - if (rq) { - rq->rq_sched_info.run_delay += delta_jiffies; - rq->rq_sched_info.pcnt++; - } -} - -/* - * Expects runqueue lock to be held for atomicity of update - */ -static inline void -rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies) -{ - if (rq) - rq->rq_sched_info.cpu_time += delta_jiffies; -} -# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) -# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) -#else /* !CONFIG_SCHEDSTATS */ -static inline void -rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies) -{} -static inline void -rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies) -{} -# define schedstat_inc(rq, field) do { } while (0) -# define schedstat_add(rq, field, amt) do { } while (0) -#endif - /* * this_rq_lock - lock this runqueue and disable interrupts. */ @@ -566,178 +467,60 @@ static inline struct rq *this_rq_lock(vo return rq; } -#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) -/* - * Called when a process is dequeued from the active array and given - * the cpu. We should note that with the exception of interactive - * tasks, the expired queue will become the active queue after the active - * queue is empty, without explicitly dequeuing and requeuing tasks in the - * expired queue. (Interactive tasks may be requeued directly to the - * active queue, thus delaying tasks in the expired queue from running; - * see scheduler_tick()). - * - * This function is only called from sched_info_arrive(), rather than - * dequeue_task(). Even though a task may be queued and dequeued multiple - * times as it is shuffled about, we're really interested in knowing how - * long it was from the *first* time it was queued to the time that it - * finally hit a cpu. - */ -static inline void sched_info_dequeued(struct task_struct *t) -{ - t->sched_info.last_queued = 0; -} - /* - * Called when a task finally hits the cpu. We can now calculate how - * long it was waiting to run. We also note when it began so that we - * can keep stats on how long its timeslice is. + * CPU frequency is/was unstable - start new by setting prev_clock_raw: */ -static void sched_info_arrive(struct task_struct *t) +void sched_clock_unstable_event(void) { - unsigned long now = jiffies, delta_jiffies = 0; - - if (t->sched_info.last_queued) - delta_jiffies = now - t->sched_info.last_queued; - sched_info_dequeued(t); - t->sched_info.run_delay += delta_jiffies; - t->sched_info.last_arrival = now; - t->sched_info.pcnt++; + unsigned long flags; + struct rq *rq; - rq_sched_info_arrive(task_rq(t), delta_jiffies); + rq = task_rq_lock(current, &flags); + rq->prev_clock_raw = sched_clock(); + rq->clock_unstable_events++; + task_rq_unlock(rq, &flags); } /* - * Called when a process is queued into either the active or expired - * array. The time is noted and later used to determine how long we - * had to wait for us to reach the cpu. Since the expired queue will - * become the active queue after active queue is empty, without dequeuing - * and requeuing any tasks, we are interested in queuing to either. It - * is unusual but not impossible for tasks to be dequeued and immediately - * requeued in the same or another array: this can happen in sched_yield(), - * set_user_nice(), and even load_balance() as it moves tasks from runqueue - * to runqueue. + * resched_task - mark a task 'to be rescheduled now'. * - * This function is only called from enqueue_task(), but also only updates - * the timestamp if it is already not set. It's assumed that - * sched_info_dequeued() will clear that stamp when appropriate. - */ -static inline void sched_info_queued(struct task_struct *t) -{ - if (unlikely(sched_info_on())) - if (!t->sched_info.last_queued) - t->sched_info.last_queued = jiffies; -} - -/* - * Called when a process ceases being the active-running process, either - * voluntarily or involuntarily. Now we can calculate how long we ran. + * On UP this means the setting of the need_resched flag, on SMP it + * might also involve a cross-CPU call to trigger the scheduler on + * the target CPU. */ -static inline void sched_info_depart(struct task_struct *t) -{ - unsigned long delta_jiffies = jiffies - t->sched_info.last_arrival; +#ifdef CONFIG_SMP - t->sched_info.cpu_time += delta_jiffies; - rq_sched_info_depart(task_rq(t), delta_jiffies); -} +#ifndef tsk_is_polling +#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG) +#endif -/* - * Called when tasks are switched involuntarily due, typically, to expiring - * their time slice. (This may also be called when switching to or from - * the idle task.) We are only called when prev != next. - */ -static inline void -__sched_info_switch(struct task_struct *prev, struct task_struct *next) +static void resched_task(struct task_struct *p) { - struct rq *rq = task_rq(prev); - - /* - * prev now departs the cpu. It's not interesting to record - * stats about how efficient we were at scheduling the idle - * process, however. - */ - if (prev != rq->idle) - sched_info_depart(prev); + int cpu; - if (next != rq->idle) - sched_info_arrive(next); -} -static inline void -sched_info_switch(struct task_struct *prev, struct task_struct *next) -{ - if (unlikely(sched_info_on())) - __sched_info_switch(prev, next); -} -#else -#define sched_info_queued(t) do { } while (0) -#define sched_info_switch(t, next) do { } while (0) -#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */ + assert_spin_locked(&task_rq(p)->lock); -/* - * Adding/removing a task to/from a priority array: - */ -static void dequeue_task(struct task_struct *p, struct prio_array *array) -{ - array->nr_active--; - list_del(&p->run_list); - if (list_empty(array->queue + p->prio)) - __clear_bit(p->prio, array->bitmap); -} + if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED))) + return; -static void enqueue_task(struct task_struct *p, struct prio_array *array) -{ - sched_info_queued(p); - list_add_tail(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); - array->nr_active++; - p->array = array; -} + set_tsk_thread_flag(p, TIF_NEED_RESCHED); -/* - * Put task to the end of the run list without the overhead of dequeue - * followed by enqueue. - */ -static void requeue_task(struct task_struct *p, struct prio_array *array) -{ - list_move_tail(&p->run_list, array->queue + p->prio); -} + cpu = task_cpu(p); + if (cpu == smp_processor_id()) + return; -static inline void -enqueue_task_head(struct task_struct *p, struct prio_array *array) -{ - list_add(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); - array->nr_active++; - p->array = array; + /* NEED_RESCHED must be visible before we test polling */ + smp_mb(); + if (!tsk_is_polling(p)) + smp_send_reschedule(cpu); } - -/* - * __normal_prio - return the priority that is based on the static - * priority but is modified by bonuses/penalties. - * - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] - * into the -5 ... 0 ... +5 bonus/penalty range. - * - * We use 25% of the full 0...39 priority range so that: - * - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. - * - * Both properties are important to certain workloads. - */ - -static inline int __normal_prio(struct task_struct *p) +#else +static inline void resched_task(struct task_struct *p) { - int bonus, prio; - - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; - - prio = p->static_prio - bonus; - if (prio < MAX_RT_PRIO) - prio = MAX_RT_PRIO; - if (prio > MAX_PRIO-1) - prio = MAX_PRIO-1; - return prio; + assert_spin_locked(&task_rq(p)->lock); + set_tsk_need_resched(p); } +#endif /* * To aid in avoiding the subversion of "niceness" due to uneven distribution @@ -761,22 +544,33 @@ static inline int __normal_prio(struct t #define RTPRIO_TO_LOAD_WEIGHT(rp) \ (PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp)) +/* + * Nice levels are logarithmic. These are the load shifts assigned + * to nice levels, where a step of every 2 nice levels means a + * multiplicator of 2: + */ +const int prio_to_load_shift[40] = { +/* -20 */ 20, 19, 19, 18, 18, 17, 17, 16, 16, 15, +/* -10 */ 15, 14, 14, 13, 13, 12, 12, 11, 11, 10, +/* 0 */ 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, +/* 10 */ 5, 4, 4, 3, 3, 2, 2, 1, 1, 0 +}; + +static int get_load_shift(struct task_struct *p) +{ + int prio = p->static_prio; + + if (rt_prio(prio) || p->policy == SCHED_BATCH) + return 0; + + return prio_to_load_shift[prio - MAX_RT_PRIO]; +} + static void set_load_weight(struct task_struct *p) { - if (has_rt_policy(p)) { -#ifdef CONFIG_SMP - if (p == task_rq(p)->migration_thread) - /* - * The migration thread does the actual balancing. - * Giving its load any weight will skew balancing - * adversely. - */ - p->load_weight = 0; - else -#endif - p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority); - } else - p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio); + p->load_shift = get_load_shift(p); + p->load_weight = 1 << p->load_shift; + p->wait_runtime = 0; } static inline void @@ -803,6 +597,40 @@ static inline void dec_nr_running(struct dec_raw_weighted_load(rq, p); } +static void activate_task(struct rq *rq, struct task_struct *p, int wakeup); + +#include "sched_stats.h" +#include "sched_rt.c" +#include "sched_fair.c" +#include "sched_debug.c" + +#define sched_class_highest (&rt_sched_class) + +static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup) +{ + u64 now = rq_clock(rq); + + sched_info_queued(p); + p->sched_class->enqueue_task(rq, p, wakeup, now); + p->on_rq = 1; +} + +static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) +{ + u64 now = rq_clock(rq); + + p->sched_class->dequeue_task(rq, p, sleep, now); + p->on_rq = 0; +} + +/* + * __normal_prio - return the priority that is based on the static prio + */ +static inline int __normal_prio(struct task_struct *p) +{ + return p->static_prio; +} + /* * Calculate the expected normal priority: i.e. priority * without taking RT-inheritance into account. Might be @@ -842,210 +670,31 @@ static int effective_prio(struct task_st } /* - * __activate_task - move a task to the runqueue. + * activate_task - move a task to the runqueue. */ -static void __activate_task(struct task_struct *p, struct rq *rq) +static void activate_task(struct rq *rq, struct task_struct *p, int wakeup) { - struct prio_array *target = rq->active; - - if (batch_task(p)) - target = rq->expired; - enqueue_task(p, target); + enqueue_task(rq, p, wakeup); inc_nr_running(p, rq); } /* - * __activate_idle_task - move idle task to the _front_ of runqueue. + * activate_idle_task - move idle task to the _front_ of runqueue. */ -static inline void __activate_idle_task(struct task_struct *p, struct rq *rq) +static inline void activate_idle_task(struct task_struct *p, struct rq *rq) { - enqueue_task_head(p, rq->active); + enqueue_task(rq, p, 0); inc_nr_running(p, rq); } /* - * Recalculate p->normal_prio and p->prio after having slept, - * updating the sleep-average too: - */ -static int recalc_task_prio(struct task_struct *p, unsigned long long now) -{ - /* Caller must always ensure 'now >= p->timestamp' */ - unsigned long sleep_time = now - p->timestamp; - - if (batch_task(p)) - sleep_time = 0; - - if (likely(sleep_time > 0)) { - /* - * This ceiling is set to the lowest priority that would allow - * a task to be reinserted into the active array on timeslice - * completion. - */ - unsigned long ceiling = INTERACTIVE_SLEEP(p); - - if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) { - /* - * Prevents user tasks from achieving best priority - * with one single large enough sleep. - */ - p->sleep_avg = ceiling; - /* - * Using INTERACTIVE_SLEEP() as a ceiling places a - * nice(0) task 1ms sleep away from promotion, and - * gives it 700ms to round-robin with no chance of - * being demoted. This is more than generous, so - * mark this sleep as non-interactive to prevent the - * on-runqueue bonus logic from intervening should - * this task not receive cpu immediately. - */ - p->sleep_type = SLEEP_NONINTERACTIVE; - } else { - /* - * Tasks waking from uninterruptible sleep are - * limited in their sleep_avg rise as they - * are likely to be waiting on I/O - */ - if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) { - if (p->sleep_avg >= ceiling) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - ceiling) { - p->sleep_avg = ceiling; - sleep_time = 0; - } - } - - /* - * This code gives a bonus to interactive tasks. - * - * The boost works by updating the 'average sleep time' - * value here, based on ->timestamp. The more time a - * task spends sleeping, the higher the average gets - - * and the higher the priority boost gets as well. - */ - p->sleep_avg += sleep_time; - - } - if (p->sleep_avg > NS_MAX_SLEEP_AVG) - p->sleep_avg = NS_MAX_SLEEP_AVG; - } - - return effective_prio(p); -} - -/* - * activate_task - move a task to the runqueue and do priority recalculation - * - * Update all the scheduling statistics stuff. (sleep average - * calculation, priority modifiers, etc.) - */ -static void activate_task(struct task_struct *p, struct rq *rq, int local) -{ - unsigned long long now; - - if (rt_task(p)) - goto out; - - now = sched_clock(); -#ifdef CONFIG_SMP - if (!local) { - /* Compensate for drifting sched_clock */ - struct rq *this_rq = this_rq(); - now = (now - this_rq->most_recent_timestamp) - + rq->most_recent_timestamp; - } -#endif - - /* - * Sleep time is in units of nanosecs, so shift by 20 to get a - * milliseconds-range estimation of the amount of time that the task - * spent sleeping: - */ - if (unlikely(prof_on == SLEEP_PROFILING)) { - if (p->state == TASK_UNINTERRUPTIBLE) - profile_hits(SLEEP_PROFILING, (void *)get_wchan(p), - (now - p->timestamp) >> 20); - } - - p->prio = recalc_task_prio(p, now); - - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (p->sleep_type == SLEEP_NORMAL) { - /* - * Tasks which were woken up by interrupts (ie. hw events) - * are most likely of interactive nature. So we give them - * the credit of extending their sleep time to the period - * of time they spend on the runqueue, waiting for execution - * on a CPU, first time around: - */ - if (in_interrupt()) - p->sleep_type = SLEEP_INTERRUPTED; - else { - /* - * Normal first-time wakeups get a credit too for - * on-runqueue time, but it will be weighted down: - */ - p->sleep_type = SLEEP_INTERACTIVE; - } - } - p->timestamp = now; -out: - __activate_task(p, rq); -} - -/* * deactivate_task - remove a task from the runqueue. */ -static void deactivate_task(struct task_struct *p, struct rq *rq) +static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep) { + dequeue_task(rq, p, sleep); dec_nr_running(p, rq); - dequeue_task(p, p->array); - p->array = NULL; -} - -/* - * resched_task - mark a task 'to be rescheduled now'. - * - * On UP this means the setting of the need_resched flag, on SMP it - * might also involve a cross-CPU call to trigger the scheduler on - * the target CPU. - */ -#ifdef CONFIG_SMP - -#ifndef tsk_is_polling -#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG) -#endif - -static void resched_task(struct task_struct *p) -{ - int cpu; - - assert_spin_locked(&task_rq(p)->lock); - - if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED))) - return; - - set_tsk_thread_flag(p, TIF_NEED_RESCHED); - - cpu = task_cpu(p); - if (cpu == smp_processor_id()) - return; - - /* NEED_RESCHED must be visible before we test polling */ - smp_mb(); - if (!tsk_is_polling(p)) - smp_send_reschedule(cpu); -} -#else -static inline void resched_task(struct task_struct *p) -{ - assert_spin_locked(&task_rq(p)->lock); - set_tsk_need_resched(p); } -#endif /** * task_curr - is this task currently executing on a CPU? @@ -1085,7 +734,7 @@ migrate_task(struct task_struct *p, int * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!p->on_rq && !task_running(rq, p)) { set_task_cpu(p, dest_cpu); return 0; } @@ -1116,7 +765,7 @@ void wait_task_inactive(struct task_stru repeat: rq = task_rq_lock(p, &flags); /* Must be off runqueue entirely, not preempted. */ - if (unlikely(p->array || task_running(rq, p))) { + if (unlikely(p->on_rq || task_running(rq, p))) { /* If it's preempted, we yield. It could be a while. */ preempted = !task_running(rq, p); task_rq_unlock(rq, &flags); @@ -1292,9 +941,9 @@ static int sched_balance_self(int cpu, i struct sched_domain *tmp, *sd = NULL; for_each_domain(cpu, tmp) { - /* - * If power savings logic is enabled for a domain, stop there. - */ + /* + * If power savings logic is enabled for a domain, stop there. + */ if (tmp->flags & SD_POWERSAVINGS_BALANCE) break; if (tmp->flags & flag) @@ -1412,7 +1061,7 @@ static int try_to_wake_up(struct task_st if (!(old_state & state)) goto out; - if (p->array) + if (p->on_rq) goto out_running; cpu = task_cpu(p); @@ -1505,7 +1154,7 @@ out_set_cpu: old_state = p->state; if (!(old_state & state)) goto out; - if (p->array) + if (p->on_rq) goto out_running; this_cpu = smp_processor_id(); @@ -1514,25 +1163,10 @@ out_set_cpu: out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - /* - * Tasks on involuntary sleep don't earn - * sleep_avg beyond just interactive state. - */ - p->sleep_type = SLEEP_NONINTERACTIVE; - } else - /* - * Tasks that have marked their sleep as noninteractive get - * woken up with their sleep average not weighted in an - * interactive way. - */ - if (old_state & TASK_NONINTERACTIVE) - p->sleep_type = SLEEP_NONINTERACTIVE; - - - activate_task(p, rq, cpu == this_cpu); + activate_task(rq, p, 1); /* * Sync wakeups (i.e. those types of wakeups where the waker * has indicated that it will leave the CPU in short order) @@ -1541,10 +1175,8 @@ out_activate: * the waker guarantees that the freshly woken up task is going * to be considered on this CPU.) */ - if (!sync || cpu != this_cpu) { - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); - } + if (!sync || cpu != this_cpu) + check_preempt_curr(rq, p); success = 1; out_running: @@ -1567,19 +1199,35 @@ int fastcall wake_up_state(struct task_s return try_to_wake_up(p, state, 0); } -static void task_running_tick(struct rq *rq, struct task_struct *p); +/* + * The task was running during this tick - call the class tick + * (to update the time slice counter and other statistics, etc.): + */ +static void task_running_tick(struct rq *rq, struct task_struct *p) +{ + spin_lock(&rq->lock); + p->sched_class->task_tick(rq, p); + spin_unlock(&rq->lock); +} + /* * Perform scheduler related setup for a newly forked process p. * p is forked by current. + * + * __sched_fork() is basic setup used by init_idle() too: */ -void fastcall sched_fork(struct task_struct *p, int clone_flags) +static void __sched_fork(struct task_struct *p) { - int cpu = get_cpu(); + p->wait_start_fair = p->wait_start = p->exec_start = p->last_ran = 0; + p->sum_exec_runtime = p->wait_runtime = 0; + p->sum_wait_runtime = 0; + p->sleep_start = p->block_start = 0; + p->sleep_max = p->block_max = p->exec_max = p->wait_max = 0; -#ifdef CONFIG_SMP - cpu = sched_balance_self(cpu, SD_BALANCE_FORK); -#endif - set_task_cpu(p, cpu); + INIT_LIST_HEAD(&p->run_list); + p->on_rq = 0; + p->nr_switches = 0; + p->min_wait_runtime = 0; /* * We mark the process as running here, but have not actually @@ -1588,16 +1236,29 @@ void fastcall sched_fork(struct task_str * event cannot wake it up and insert it on the runqueue either. */ p->state = TASK_RUNNING; +} + +/* + * fork()/clone()-time setup: + */ +void sched_fork(struct task_struct *p, int clone_flags) +{ + int cpu = get_cpu(); + + __sched_fork(p); + +#ifdef CONFIG_SMP + cpu = sched_balance_self(cpu, SD_BALANCE_FORK); +#endif + set_task_cpu(p, cpu); /* * Make sure we do not leak PI boosting priority to the child: */ p->prio = current->normal_prio; - INIT_LIST_HEAD(&p->run_list); - p->array = NULL; #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) - if (unlikely(sched_info_on())) + if (likely(sched_info_on())) memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) @@ -1607,34 +1268,16 @@ void fastcall sched_fork(struct task_str /* Want to start with kernel preemption disabled. */ task_thread_info(p)->preempt_count = 1; #endif - /* - * Share the timeslice between parent and child, thus the - * total amount of pending timeslices in the system doesn't change, - * resulting in more scheduling fairness. - */ - local_irq_disable(); - p->time_slice = (current->time_slice + 1) >> 1; - /* - * The remainder of the first timeslice might be recovered by - * the parent if the child exits early enough. - */ - p->first_time_slice = 1; - current->time_slice >>= 1; - p->timestamp = sched_clock(); - if (unlikely(!current->time_slice)) { - /* - * This case is rare, it happens when the parent has only - * a single jiffy left from its timeslice. Taking the - * runqueue lock is not a problem. - */ - current->time_slice = 1; - task_running_tick(cpu_rq(cpu), current); - } - local_irq_enable(); put_cpu(); } /* + * After fork, child runs first. (default) If set to 0 then + * parent will (try to) run first. + */ +unsigned int __read_mostly sysctl_sched_child_runs_first = 1; + +/* * wake_up_new_task - wake up a newly created task for the first time. * * This function will do some initial scheduler statistics housekeeping @@ -1643,107 +1286,27 @@ void fastcall sched_fork(struct task_str */ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags) { - struct rq *rq, *this_rq; unsigned long flags; - int this_cpu, cpu; + struct rq *rq; + int this_cpu; rq = task_rq_lock(p, &flags); BUG_ON(p->state != TASK_RUNNING); - this_cpu = smp_processor_id(); - cpu = task_cpu(p); - - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. The parent - * (current) is done further down, under its lock. - */ - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + this_cpu = smp_processor_id(); /* parent's CPU */ p->prio = effective_prio(p); - if (likely(cpu == this_cpu)) { - if (!(clone_flags & CLONE_VM)) { - /* - * The VM isn't cloned, so we're in a good position to - * do child-runs-first in anticipation of an exec. This - * usually avoids a lot of COW overhead. - */ - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - p->normal_prio = current->normal_prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - inc_nr_running(p, rq); - } - set_need_resched(); - } else - /* Run child last */ - __activate_task(p, rq); - /* - * We skip the following code due to cpu == this_cpu - * - * task_rq_unlock(rq, &flags); - * this_rq = task_rq_lock(current, &flags); - */ - this_rq = rq; + if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) || + task_cpu(p) != this_cpu || !current->on_rq) { + activate_task(rq, p, 0); } else { - this_rq = cpu_rq(this_cpu); - - /* - * Not the local CPU - must adjust timestamp. This should - * get optimised away in the !CONFIG_SMP case. - */ - p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) - + rq->most_recent_timestamp; - __activate_task(p, rq); - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); - /* - * Parent and child are on different CPUs, now get the - * parent runqueue to update the parent's ->sleep_avg: + * Let the scheduling class do new task startup + * management (if any): */ - task_rq_unlock(rq, &flags); - this_rq = task_rq_lock(current, &flags); + p->sched_class->task_new(rq, p); } - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - task_rq_unlock(this_rq, &flags); -} - -/* - * Potentially available exiting-child timeslices are - * retrieved here - this way the parent does not get - * penalized for creating too many threads. - * - * (this cannot be used to 'generate' timeslices - * artificially, because any timeslice recovered here - * was given away by the parent in the first place.) - */ -void fastcall sched_exit(struct task_struct *p) -{ - unsigned long flags; - struct rq *rq; - - /* - * If the child was a (relative-) CPU hog then decrease - * the sleep_avg of the parent as well. - */ - rq = task_rq_lock(p->parent, &flags); - if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) { - p->parent->time_slice += p->time_slice; - if (unlikely(p->parent->time_slice > task_timeslice(p))) - p->parent->time_slice = task_timeslice(p); - } - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / - (EXIT_WEIGHT + 1); + check_preempt_curr(rq, p); task_rq_unlock(rq, &flags); } @@ -1941,17 +1504,56 @@ unsigned long nr_active(void) return running + uninterruptible; } -#ifdef CONFIG_SMP - -/* - * Is this task likely cache-hot: - */ -static inline int -task_hot(struct task_struct *p, unsigned long long now, struct sched_domain *sd) +static void update_load_fair(struct rq *this_rq) { - return (long long)(now - p->last_ran) < (long long)sd->cache_hot_time; + unsigned long this_load, fair_delta, exec_delta, idle_delta; + unsigned int i, scale; + s64 fair_delta64, exec_delta64; + unsigned long tmp; + u64 tmp64; + + this_rq->nr_load_updates++; + + fair_delta64 = this_rq->fair_clock - this_rq->prev_fair_clock + 1; + this_rq->prev_fair_clock = this_rq->fair_clock; + WARN_ON_ONCE(fair_delta64 <= 0); + + exec_delta64 = this_rq->exec_clock - this_rq->prev_exec_clock + 1; + this_rq->prev_exec_clock = this_rq->exec_clock; + WARN_ON_ONCE(exec_delta64 <= 0); + + if (fair_delta64 > (s64)LONG_MAX) + fair_delta64 = (s64)LONG_MAX; + fair_delta = (unsigned long)fair_delta64; + + if (exec_delta64 > (s64)LONG_MAX) + exec_delta64 = (s64)LONG_MAX; + exec_delta = (unsigned long)exec_delta64; + if (exec_delta > TICK_NSEC) + exec_delta = TICK_NSEC; + + idle_delta = TICK_NSEC - exec_delta; + + tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta; + tmp64 = (u64)tmp * (u64)exec_delta; + do_div(tmp64, TICK_NSEC); + this_load = (unsigned long)tmp64; + + /* Update our load: */ + for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { + unsigned long old_load, new_load; + + /* scale is effectively 1 << i now, and >> i divides by scale */ + + old_load = this_rq->cpu_load[i]; + new_load = this_load; + + this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i; + } } +#ifdef CONFIG_SMP + /* * double_rq_lock - safely lock two runqueues * @@ -2068,23 +1670,17 @@ void sched_exec(void) * pull_task - move a task from a remote runqueue to the local runqueue. * Both runqueues must be locked. */ -static void pull_task(struct rq *src_rq, struct prio_array *src_array, - struct task_struct *p, struct rq *this_rq, - struct prio_array *this_array, int this_cpu) +static void pull_task(struct rq *src_rq, struct task_struct *p, + struct rq *this_rq, int this_cpu) { - dequeue_task(p, src_array); - dec_nr_running(p, src_rq); + deactivate_task(src_rq, p, 0); set_task_cpu(p, this_cpu); - inc_nr_running(p, this_rq); - enqueue_task(p, this_array); - p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) - + this_rq->most_recent_timestamp; + activate_task(this_rq, p, 0); /* * Note that idle threads have a prio of MAX_PRIO, for this test * to be always true for them. */ - if (TASK_PREEMPTS_CURR(p, this_rq)) - resched_task(this_rq->curr); + check_preempt_curr(this_rq, p); } /* @@ -2109,25 +1705,59 @@ int can_migrate_task(struct task_struct return 0; /* - * Aggressive migration if: - * 1) task is cache cold, or - * 2) too many balance attempts have failed. + * Aggressive migration if too many balance attempts have failed: */ - - if (sd->nr_balance_failed > sd->cache_nice_tries) { -#ifdef CONFIG_SCHEDSTATS - if (task_hot(p, rq->most_recent_timestamp, sd)) - schedstat_inc(sd, lb_hot_gained[idle]); -#endif + if (sd->nr_balance_failed > sd->cache_nice_tries) return 1; - } - if (task_hot(p, rq->most_recent_timestamp, sd)) - return 0; return 1; } -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio) +/* + * Load-balancing iterator: iterate through the hieararchy of scheduling + * classes, starting with the highest-prio one: + */ + +struct task_struct * load_balance_start(struct rq *rq) +{ + struct sched_class *class = sched_class_highest; + struct task_struct *p; + + do { + p = class->load_balance_start(rq); + if (p) { + rq->load_balance_class = class; + return p; + } + class = class->next; + } while (class); + + return NULL; +} + +struct task_struct * load_balance_next(struct rq *rq) +{ + struct sched_class *class = rq->load_balance_class; + struct task_struct *p; + + p = class->load_balance_next(rq); + if (p) + return p; + /* + * Pick up the next class (if any) and attempt to start + * the iterator there: + */ + while ((class = class->next)) { + p = class->load_balance_start(rq); + if (p) { + rq->load_balance_class = class; + return p; + } + } + return NULL; +} + +#define rq_best_prio(rq) (rq)->curr->prio /* * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted @@ -2141,11 +1771,9 @@ static int move_tasks(struct rq *this_rq struct sched_domain *sd, enum idle_type idle, int *all_pinned) { - int idx, pulled = 0, pinned = 0, this_best_prio, best_prio, + int pulled = 0, pinned = 0, this_best_prio, best_prio, best_prio_seen, skip_for_load; - struct prio_array *array, *dst_array; - struct list_head *head, *curr; - struct task_struct *tmp; + struct task_struct *p; long rem_load_move; if (max_nr_move == 0 || max_load_move == 0) @@ -2165,76 +1793,41 @@ static int move_tasks(struct rq *this_rq best_prio_seen = best_prio == busiest->curr->prio; /* - * We first consider expired tasks. Those will likely not be - * executed in the near future, and they are most likely to - * be cache-cold, thus switching CPUs has the least effect - * on them. - */ - if (busiest->expired->nr_active) { - array = busiest->expired; - dst_array = this_rq->expired; - } else { - array = busiest->active; - dst_array = this_rq->active; - } - -new_array: - /* Start searching at priority 0: */ - idx = 0; -skip_bitmap: - if (!idx) - idx = sched_find_first_bit(array->bitmap); - else - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); - if (idx >= MAX_PRIO) { - if (array == busiest->expired && busiest->active->nr_active) { - array = busiest->active; - dst_array = this_rq->active; - goto new_array; - } + * Start the load-balancing iterator: + */ + p = load_balance_start(busiest); +next: + if (!p) goto out; - } - - head = array->queue + idx; - curr = head->prev; -skip_queue: - tmp = list_entry(curr, struct task_struct, run_list); - - curr = curr->prev; - /* * To help distribute high priority tasks accross CPUs we don't * skip a task if it will be the highest priority task (i.e. smallest * prio value) on its new queue regardless of its load weight */ - skip_for_load = tmp->load_weight > rem_load_move; - if (skip_for_load && idx < this_best_prio) - skip_for_load = !best_prio_seen && idx == best_prio; + skip_for_load = p->load_weight > rem_load_move; + if (skip_for_load && p->prio < this_best_prio) + skip_for_load = !best_prio_seen && p->prio == best_prio; if (skip_for_load || - !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) { + !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) { - best_prio_seen |= idx == best_prio; - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; + best_prio_seen |= p->prio == best_prio; + p = load_balance_next(busiest); + goto next; } - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); + pull_task(busiest, p, this_rq, this_cpu); pulled++; - rem_load_move -= tmp->load_weight; + rem_load_move -= p->load_weight; /* * We only want to steal up to the prescribed number of tasks * and the prescribed amount of weighted load. */ if (pulled < max_nr_move && rem_load_move > 0) { - if (idx < this_best_prio) - this_best_prio = idx; - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; + if (p->prio < this_best_prio) + this_best_prio = p->prio; + p = load_balance_next(busiest); + goto next; } out: /* @@ -2360,8 +1953,8 @@ find_busiest_group(struct sched_domain * * Busy processors will not participate in power savings * balance. */ - if (idle == NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE)) - goto group_next; + if (idle == NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE)) + goto group_next; /* * If the local group is idle or completely loaded @@ -2371,42 +1964,42 @@ find_busiest_group(struct sched_domain * !this_nr_running)) power_savings_balance = 0; - /* + /* * If a group is already running at full capacity or idle, * don't include that group in power savings calculations - */ - if (!power_savings_balance || sum_nr_running >= group_capacity + */ + if (!power_savings_balance || sum_nr_running >= group_capacity || !sum_nr_running) - goto group_next; + goto group_next; - /* + /* * Calculate the group which has the least non-idle load. - * This is the group from where we need to pick up the load - * for saving power - */ - if ((sum_nr_running < min_nr_running) || - (sum_nr_running == min_nr_running && + * This is the group from where we need to pick up the load + * for saving power + */ + if ((sum_nr_running < min_nr_running) || + (sum_nr_running == min_nr_running && first_cpu(group->cpumask) < first_cpu(group_min->cpumask))) { - group_min = group; - min_nr_running = sum_nr_running; + group_min = group; + min_nr_running = sum_nr_running; min_load_per_task = sum_weighted_load / sum_nr_running; - } + } - /* + /* * Calculate the group which is almost near its - * capacity but still has some space to pick up some load - * from other group and save more power - */ - if (sum_nr_running <= group_capacity - 1) { - if (sum_nr_running > leader_nr_running || - (sum_nr_running == leader_nr_running && - first_cpu(group->cpumask) > - first_cpu(group_leader->cpumask))) { - group_leader = group; - leader_nr_running = sum_nr_running; - } + * capacity but still has some space to pick up some load + * from other group and save more power + */ + if (sum_nr_running <= group_capacity - 1) { + if (sum_nr_running > leader_nr_running || + (sum_nr_running == leader_nr_running && + first_cpu(group->cpumask) > + first_cpu(group_leader->cpumask))) { + group_leader = group; + leader_nr_running = sum_nr_running; + } } group_next: #endif @@ -2461,7 +2054,7 @@ group_next: * a think about bumping its value to force at least one task to be * moved */ - if (*imbalance < busiest_load_per_task) { + if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) { unsigned long tmp, pwr_now, pwr_move; unsigned int imbn; @@ -2475,7 +2068,8 @@ small_imbalance: } else this_load_per_task = SCHED_LOAD_SCALE; - if (max_load - this_load >= busiest_load_per_task * imbn) { + if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >= + busiest_load_per_task * imbn) { *imbalance = busiest_load_per_task; return busiest; } @@ -2884,30 +2478,6 @@ static void active_load_balance(struct r spin_unlock(&target_rq->lock); } -static void update_load(struct rq *this_rq) -{ - unsigned long this_load; - int i, scale; - - this_load = this_rq->raw_weighted_load; - - /* Update our load: */ - for (i = 0, scale = 1; i < 3; i++, scale <<= 1) { - unsigned long old_load, new_load; - - old_load = this_rq->cpu_load[i]; - new_load = this_load; - /* - * Round up the averaging division if load is increasing. This - * prevents us from getting stuck on 9 if the load is 10, for - * example. - */ - if (new_load > old_load) - new_load += scale-1; - this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) / scale; - } -} - /* * run_rebalance_domains is triggered when needed from the scheduler tick. * @@ -2987,76 +2557,27 @@ static inline void idle_balance(int cpu, } #endif -static inline void wake_priority_sleeper(struct rq *rq) -{ -#ifdef CONFIG_SCHED_SMT - if (!rq->nr_running) - return; - - spin_lock(&rq->lock); - /* - * If an SMT sibling task has been put to sleep for priority - * reasons reschedule the idle task to see if it can now run. - */ - if (rq->nr_running) - resched_task(rq->idle); - spin_unlock(&rq->lock); -#endif -} - DEFINE_PER_CPU(struct kernel_stat, kstat); EXPORT_PER_CPU_SYMBOL(kstat); /* - * This is called on clock ticks and on context switches. - * Bank in p->sched_time the ns elapsed since the last tick or switch. - */ -static inline void -update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now) -{ - p->sched_time += now - p->last_ran; - p->last_ran = rq->most_recent_timestamp = now; -} - -/* - * Return current->sched_time plus any more ns on the sched_clock + * Return current->sum_exec_runtime plus any more ns on the sched_clock * that have not yet been banked. */ -unsigned long long current_sched_time(const struct task_struct *p) +unsigned long long current_sched_runtime(const struct task_struct *p) { unsigned long long ns; unsigned long flags; local_irq_save(flags); - ns = p->sched_time + sched_clock() - p->last_ran; + ns = p->sum_exec_runtime + sched_clock() - p->last_ran; local_irq_restore(flags); return ns; } /* - * We place interactive tasks back into the active array, if possible. - * - * To guarantee that this does not starve expired tasks we ignore the - * interactivity of a task if the first expired task had to wait more - * than a 'reasonable' amount of time. This deadline timeout is - * load-dependent, as the frequency of array switched decreases with - * increasing number of running tasks. We also ignore the interactivity - * if a better static_prio task has expired: - */ -static inline int expired_starving(struct rq *rq) -{ - if (rq->curr->static_prio > rq->best_expired_prio) - return 1; - if (!STARVATION_LIMIT || !rq->expired_timestamp) - return 0; - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running) - return 1; - return 0; -} - -/* * Account user cpu time to a process. * @p: the process that the cpu time gets accounted to * @hardirq_offset: the offset to subtract from hardirq_count() @@ -3129,81 +2650,6 @@ void account_steal_time(struct task_stru cpustat->steal = cputime64_add(cpustat->steal, tmp); } -static void task_running_tick(struct rq *rq, struct task_struct *p) -{ - if (p->array != rq->active) { - /* Task has expired but was not scheduled yet */ - set_tsk_need_resched(p); - return; - } - spin_lock(&rq->lock); - /* - * The task was running during this tick - update the - * time slice counter. Note: we do not update a thread's - * priority until it either goes to sleep or uses up its - * timeslice. This makes it possible for interactive tasks - * to use up their timeslices at their highest priority levels. - */ - if (rt_task(p)) { - /* - * RR tasks need a special form of timeslice management. - * FIFO tasks have no timeslices. - */ - if ((p->policy == SCHED_RR) && !--p->time_slice) { - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - set_tsk_need_resched(p); - - /* put it at the end of the queue: */ - requeue_task(p, rq->active); - } - goto out_unlock; - } - if (!--p->time_slice) { - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - - if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { - enqueue_task(p, rq->expired); - if (p->static_prio < rq->best_expired_prio) - rq->best_expired_prio = p->static_prio; - } else - enqueue_task(p, rq->active); - } else { - /* - * Prevent a too long timeslice allowing a task to monopolize - * the CPU. We do this by splitting up the timeslice into - * smaller pieces. - * - * Note: this does not mean the task's timeslices expire or - * get lost in any way, they just might be preempted by - * another task of equal priority. (one with higher - * priority would have preempted this task already.) We - * requeue this task to the end of the list on this priority - * level, which is in essence a round-robin of tasks with - * equal priority. - * - * This only applies to tasks in the interactive - * delta range with at least TIMESLICE_GRANULARITY to requeue. - */ - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - - p->time_slice) % TIMESLICE_GRANULARITY(p)) && - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && - (p->array == rq->active)) { - - requeue_task(p, rq->active); - set_tsk_need_resched(p); - } - } -out_unlock: - spin_unlock(&rq->lock); -} - /* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. @@ -3213,155 +2659,19 @@ out_unlock: */ void scheduler_tick(void) { - unsigned long long now = sched_clock(); struct task_struct *p = current; int cpu = smp_processor_id(); struct rq *rq = cpu_rq(cpu); - update_cpu_clock(p, rq, now); - - if (p == rq->idle) - /* Task on the idle queue */ - wake_priority_sleeper(rq); - else + if (p != rq->idle) task_running_tick(rq, p); + update_load_fair(rq); #ifdef CONFIG_SMP - update_load(rq); if (time_after_eq(jiffies, rq->next_balance)) raise_softirq(SCHED_SOFTIRQ); #endif } -#ifdef CONFIG_SCHED_SMT -static inline void wakeup_busy_runqueue(struct rq *rq) -{ - /* If an SMT runqueue is sleeping due to priority reasons wake it up */ - if (rq->curr == rq->idle && rq->nr_running) - resched_task(rq->idle); -} - -/* - * Called with interrupt disabled and this_rq's runqueue locked. - */ -static void wake_sleeping_dependent(int this_cpu) -{ - struct sched_domain *tmp, *sd = NULL; - int i; - - for_each_domain(this_cpu, tmp) { - if (tmp->flags & SD_SHARE_CPUPOWER) { - sd = tmp; - break; - } - } - - if (!sd) - return; - - for_each_cpu_mask(i, sd->span) { - struct rq *smt_rq = cpu_rq(i); - - if (i == this_cpu) - continue; - if (unlikely(!spin_trylock(&smt_rq->lock))) - continue; - - wakeup_busy_runqueue(smt_rq); - spin_unlock(&smt_rq->lock); - } -} - -/* - * number of 'lost' timeslices this task wont be able to fully - * utilize, if another task runs on a sibling. This models the - * slowdown effect of other tasks running on siblings: - */ -static inline unsigned long -smt_slice(struct task_struct *p, struct sched_domain *sd) -{ - return p->time_slice * (100 - sd->per_cpu_gain) / 100; -} - -/* - * To minimise lock contention and not have to drop this_rq's runlock we only - * trylock the sibling runqueues and bypass those runqueues if we fail to - * acquire their lock. As we only trylock the normal locking order does not - * need to be obeyed. - */ -static int -dependent_sleeper(int this_cpu, struct rq *this_rq, struct task_struct *p) -{ - struct sched_domain *tmp, *sd = NULL; - int ret = 0, i; - - /* kernel/rt threads do not participate in dependent sleeping */ - if (!p->mm || rt_task(p)) - return 0; - - for_each_domain(this_cpu, tmp) { - if (tmp->flags & SD_SHARE_CPUPOWER) { - sd = tmp; - break; - } - } - - if (!sd) - return 0; - - for_each_cpu_mask(i, sd->span) { - struct task_struct *smt_curr; - struct rq *smt_rq; - - if (i == this_cpu) - continue; - - smt_rq = cpu_rq(i); - if (unlikely(!spin_trylock(&smt_rq->lock))) - continue; - - smt_curr = smt_rq->curr; - - if (!smt_curr->mm) - goto unlock; - - /* - * If a user task with lower static priority than the - * running task on the SMT sibling is trying to schedule, - * delay it till there is proportionately less timeslice - * left of the sibling task to prevent a lower priority - * task from using an unfair proportion of the - * physical cpu's resources. -ck - */ - if (rt_task(smt_curr)) { - /* - * With real time tasks we run non-rt tasks only - * per_cpu_gain% of the time. - */ - if ((jiffies % DEF_TIMESLICE) > - (sd->per_cpu_gain * DEF_TIMESLICE / 100)) - ret = 1; - } else { - if (smt_curr->static_prio < p->static_prio && - !TASK_PREEMPTS_CURR(p, smt_rq) && - smt_slice(smt_curr, sd) > task_timeslice(p)) - ret = 1; - } -unlock: - spin_unlock(&smt_rq->lock); - } - return ret; -} -#else -static inline void wake_sleeping_dependent(int this_cpu) -{ -} -static inline int -dependent_sleeper(int this_cpu, struct rq *this_rq, struct task_struct *p) -{ - return 0; -} -#endif - #if defined(CONFIG_PREEMPT) && defined(CONFIG_DEBUG_PREEMPT) void fastcall add_preempt_count(int val) @@ -3400,49 +2710,27 @@ EXPORT_SYMBOL(sub_preempt_count); #endif -static inline int interactive_sleep(enum sleep_type sleep_type) -{ - return (sleep_type == SLEEP_INTERACTIVE || - sleep_type == SLEEP_INTERRUPTED); -} - /* - * schedule() is the main scheduler function. + * Various schedule()-time debugging checks and statistics: */ -asmlinkage void __sched schedule(void) +static inline void schedule_debug(struct rq *rq, struct task_struct *prev) { - struct task_struct *prev, *next; - struct prio_array *array; - struct list_head *queue; - unsigned long long now; - unsigned long run_time; - int cpu, idx, new_prio; - long *switch_count; - struct rq *rq; - /* * Test if we are atomic. Since do_exit() needs to call into * schedule() atomically, we ignore that path for now. * Otherwise, whine if we are scheduling when we should not be. */ - if (unlikely(in_atomic() && !current->exit_state)) { + if (unlikely(in_atomic_preempt_off() && !prev->exit_state)) { printk(KERN_ERR "BUG: scheduling while atomic: " "%s/0x%08x/%d\n", - current->comm, preempt_count(), current->pid); - debug_show_held_locks(current); + prev->comm, preempt_count(), prev->pid); + debug_show_held_locks(prev); if (irqs_disabled()) - print_irqtrace_events(current); + print_irqtrace_events(prev); dump_stack(); } profile_hit(SCHED_PROFILING, __builtin_return_address(0)); -need_resched: - preempt_disable(); - prev = current; - release_kernel_lock(prev); -need_resched_nonpreemptible: - rq = this_rq(); - /* * The idle thread is not allowed to schedule! * Remove this check after it has been exercised a bit. @@ -3453,19 +2741,45 @@ need_resched_nonpreemptible: } schedstat_inc(rq, sched_cnt); - now = sched_clock(); - if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { - run_time = now - prev->timestamp; - if (unlikely((long long)(now - prev->timestamp) < 0)) - run_time = 0; - } else - run_time = NS_MAX_SLEEP_AVG; +} - /* - * Tasks charged proportionately less run_time at high sleep_avg to - * delay them losing their interactive status - */ - run_time /= (CURRENT_BONUS(prev) ? : 1); +static inline struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev) +{ + struct sched_class *class = sched_class_highest; + u64 now = __rq_clock(rq); + struct task_struct *p; + + prev->sched_class->put_prev_task(rq, prev, now); + + do { + p = class->pick_next_task(rq, now); + if (p) + return p; + class = class->next; + } while (class); + + return NULL; +} + +/* + * schedule() is the main scheduler function. + */ +asmlinkage void __sched schedule(void) +{ + struct task_struct *prev, *next; + long *switch_count; + struct rq *rq; + int cpu; + +need_resched: + preempt_disable(); + prev = current; + release_kernel_lock(prev); +need_resched_nonpreemptible: + rq = this_rq(); + + schedule_debug(rq, prev); spin_lock_irq(&rq->lock); @@ -3478,7 +2792,7 @@ need_resched_nonpreemptible: else { if (prev->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; - deactivate_task(prev, rq); + deactivate_task(rq, prev, 1); } } @@ -3486,68 +2800,25 @@ need_resched_nonpreemptible: if (unlikely(!rq->nr_running)) { idle_balance(cpu, rq); if (!rq->nr_running) { + prev->sched_class->put_prev_task(rq, prev, + __rq_clock(rq)); next = rq->idle; - rq->expired_timestamp = 0; - wake_sleeping_dependent(cpu); + schedstat_inc(rq, sched_goidle); goto switch_tasks; } } - array = rq->active; - if (unlikely(!array->nr_active)) { - /* - * Switch the active and expired arrays. - */ - schedstat_inc(rq, sched_switch); - rq->active = rq->expired; - rq->expired = array; - array = rq->active; - rq->expired_timestamp = 0; - rq->best_expired_prio = MAX_PRIO; - } - - idx = sched_find_first_bit(array->bitmap); - queue = array->queue + idx; - next = list_entry(queue->next, struct task_struct, run_list); - - if (!rt_task(next) && interactive_sleep(next->sleep_type)) { - unsigned long long delta = now - next->timestamp; - if (unlikely((long long)(now - next->timestamp) < 0)) - delta = 0; - - if (next->sleep_type == SLEEP_INTERACTIVE) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; - - array = next->array; - new_prio = recalc_task_prio(next, next->timestamp + delta); - - if (unlikely(next->prio != new_prio)) { - dequeue_task(next, array); - next->prio = new_prio; - enqueue_task(next, array); - } - } - next->sleep_type = SLEEP_NORMAL; - if (rq->nr_running == 1 && dependent_sleeper(cpu, rq, next)) - next = rq->idle; + next = pick_next_task(rq, prev); + next->nr_switches++; + switch_tasks: - if (next == rq->idle) - schedstat_inc(rq, sched_goidle); prefetch(next); prefetch_stack(next); clear_tsk_need_resched(prev); rcu_qsctr_inc(task_cpu(prev)); - update_cpu_clock(prev, rq, now); - - prev->sleep_avg -= run_time; - if ((long)prev->sleep_avg <= 0) - prev->sleep_avg = 0; - prev->timestamp = prev->last_ran = now; - sched_info_switch(prev, next); if (likely(prev != next)) { - next->timestamp = next->last_ran = now; rq->nr_switches++; rq->curr = next; ++*switch_count; @@ -3978,29 +3249,28 @@ EXPORT_SYMBOL(sleep_on_timeout); */ void rt_mutex_setprio(struct task_struct *p, int prio) { - struct prio_array *array; unsigned long flags; + int oldprio, on_rq; struct rq *rq; - int oldprio; BUG_ON(prio < 0 || prio > MAX_PRIO); rq = task_rq_lock(p, &flags); oldprio = p->prio; - array = p->array; - if (array) - dequeue_task(p, array); + on_rq = p->on_rq; + if (on_rq) + dequeue_task(rq, p, 0); + + if (rt_prio(prio)) + p->sched_class = &rt_sched_class; + else + p->sched_class = &fair_sched_class; + p->prio = prio; - if (array) { - /* - * If changing to an RT priority then queue it - * in the active array! - */ - if (rt_task(p)) - array = rq->active; - enqueue_task(p, array); + if (on_rq) { + enqueue_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on @@ -4009,8 +3279,9 @@ void rt_mutex_setprio(struct task_struct if (task_running(rq, p)) { if (p->prio > oldprio) resched_task(rq->curr); - } else if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); + } else { + check_preempt_curr(rq, p); + } } task_rq_unlock(rq, &flags); } @@ -4019,8 +3290,7 @@ void rt_mutex_setprio(struct task_struct void set_user_nice(struct task_struct *p, long nice) { - struct prio_array *array; - int old_prio, delta; + int old_prio, delta, on_rq; unsigned long flags; struct rq *rq; @@ -4041,9 +3311,9 @@ void set_user_nice(struct task_struct *p p->static_prio = NICE_TO_PRIO(nice); goto out_unlock; } - array = p->array; - if (array) { - dequeue_task(p, array); + on_rq = p->on_rq; + if (on_rq) { + dequeue_task(rq, p, 0); dec_raw_weighted_load(rq, p); } @@ -4053,8 +3323,8 @@ void set_user_nice(struct task_struct *p p->prio = effective_prio(p); delta = p->prio - old_prio; - if (array) { - enqueue_task(p, array); + if (on_rq) { + enqueue_task(rq, p, 0); inc_raw_weighted_load(rq, p); /* * If the task increased its priority or is running and @@ -4175,20 +3445,27 @@ static inline struct task_struct *find_p } /* Actually do priority change: must hold rq lock. */ -static void __setscheduler(struct task_struct *p, int policy, int prio) +static void +__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio) { - BUG_ON(p->array); + BUG_ON(p->on_rq); p->policy = policy; + switch (p->policy) { + case SCHED_NORMAL: + case SCHED_BATCH: + p->sched_class = &fair_sched_class; + break; + case SCHED_FIFO: + case SCHED_RR: + p->sched_class = &rt_sched_class; + break; + } + p->rt_priority = prio; p->normal_prio = normal_prio(p); /* we are holding p->pi_lock already */ p->prio = rt_mutex_getprio(p); - /* - * SCHED_BATCH tasks are treated as perpetual CPU hogs: - */ - if (policy == SCHED_BATCH) - p->sleep_avg = 0; set_load_weight(p); } @@ -4204,8 +3481,7 @@ static void __setscheduler(struct task_s int sched_setscheduler(struct task_struct *p, int policy, struct sched_param *param) { - int retval, oldprio, oldpolicy = -1; - struct prio_array *array; + int retval, oldprio, oldpolicy = -1, on_rq; unsigned long flags; struct rq *rq; @@ -4279,13 +3555,13 @@ recheck: spin_unlock_irqrestore(&p->pi_lock, flags); goto recheck; } - array = p->array; - if (array) - deactivate_task(p, rq); + on_rq = p->on_rq; + if (on_rq) + deactivate_task(rq, p, 0); oldprio = p->prio; - __setscheduler(p, policy, param->sched_priority); - if (array) { - __activate_task(p, rq); + __setscheduler(rq, p, policy, param->sched_priority); + if (on_rq) { + activate_task(rq, p, 0); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on @@ -4294,8 +3570,9 @@ recheck: if (task_running(rq, p)) { if (p->prio > oldprio) resched_task(rq->curr); - } else if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); + } else { + check_preempt_curr(rq, p); + } } __task_rq_unlock(rq); spin_unlock_irqrestore(&p->pi_lock, flags); @@ -4558,50 +3835,66 @@ asmlinkage long sys_sched_getaffinity(pi if (ret < 0) return ret; - if (copy_to_user(user_mask_ptr, &mask, sizeof(cpumask_t))) - return -EFAULT; + if (copy_to_user(user_mask_ptr, &mask, sizeof(cpumask_t))) + return -EFAULT; + + return sizeof(cpumask_t); +} + +/** + * sys_sched_yield - yield the current processor to other threads. + * + * This function yields the current CPU to other tasks. If there are no + * other threads running on this CPU then this function will return. + */ +asmlinkage long sys_sched_yield(void) +{ + struct rq *rq = this_rq_lock(); + + schedstat_inc(rq, yld_cnt); + if (rq->nr_running == 1) + schedstat_inc(rq, yld_act_empty); + else + current->sched_class->yield_task(rq, current, NULL); + + /* + * Since we are going to call schedule() anyway, there's + * no need to preempt or enable interrupts: + */ + __release(rq->lock); + spin_release(&rq->lock.dep_map, 1, _THIS_IP_); + _raw_spin_unlock(&rq->lock); + preempt_enable_no_resched(); + + schedule(); - return sizeof(cpumask_t); + return 0; } /** - * sys_sched_yield - yield the current processor to other threads. + * sys_sched_yield_to - yield the current processor to another thread * - * this function yields the current CPU by moving the calling thread + * This function yields the current CPU by moving the calling thread * to the expired array. If there are no other threads running on this * CPU then this function will return. */ -asmlinkage long sys_sched_yield(void) +asmlinkage long sys_sched_yield_to(pid_t pid) { - struct rq *rq = this_rq_lock(); - struct prio_array *array = current->array, *target = rq->expired; + struct task_struct *p_to; + struct rq *rq; - schedstat_inc(rq, yld_cnt); - /* - * We implement yielding by moving the task into the expired - * queue. - * - * (special rule: RT tasks will just roundrobin in the active - * array.) - */ - if (rt_task(current)) - target = rq->active; + rcu_read_lock(); + p_to = find_task_by_pid(pid); + if (!p_to) + goto out_unlock; - if (array->nr_active == 1) { + rq = this_rq_lock(); + + schedstat_inc(rq, yld_cnt); + if (rq->nr_running == 1) schedstat_inc(rq, yld_act_empty); - if (!rq->expired->nr_active) - schedstat_inc(rq, yld_both_empty); - } else if (!rq->expired->nr_active) - schedstat_inc(rq, yld_exp_empty); - - if (array != target) { - dequeue_task(current, array); - enqueue_task(current, target); - } else - /* - * requeue_task is cheaper so perform that if possible. - */ - requeue_task(current, array); + else + current->sched_class->yield_task(rq, current, p_to); /* * Since we are going to call schedule() anyway, there's @@ -4610,13 +3903,19 @@ asmlinkage long sys_sched_yield(void) __release(rq->lock); spin_release(&rq->lock.dep_map, 1, _THIS_IP_); _raw_spin_unlock(&rq->lock); + rcu_read_unlock(); preempt_enable_no_resched(); schedule(); return 0; + +out_unlock: + rcu_read_unlock(); + return -ESRCH; } + static void __cond_resched(void) { #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP @@ -4812,7 +4111,7 @@ long sys_sched_rr_get_interval(pid_t pid goto out_unlock; jiffies_to_timespec(p->policy == SCHED_FIFO ? - 0 : task_timeslice(p), &t); + 0 : static_prio_timeslice(p->static_prio), &t); read_unlock(&tasklist_lock); retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; out_nounlock: @@ -4915,7 +4214,7 @@ void show_state_filter(unsigned long sta * console might take alot of time: */ touch_nmi_watchdog(); - if (p->state & state_filter) + if (!state_filter || (p->state & state_filter)) show_task(p); } while_each_thread(g, p); @@ -4925,6 +4224,7 @@ void show_state_filter(unsigned long sta */ if (state_filter == -1) debug_show_all_locks(); + sysrq_sched_debug_show(); } /** @@ -4940,11 +4240,10 @@ void __cpuinit init_idle(struct task_str struct rq *rq = cpu_rq(cpu); unsigned long flags; - idle->timestamp = sched_clock(); - idle->sleep_avg = 0; - idle->array = NULL; + __sched_fork(idle); + idle->exec_start = sched_clock(); + idle->prio = idle->normal_prio = MAX_PRIO; - idle->state = TASK_RUNNING; idle->cpus_allowed = cpumask_of_cpu(cpu); set_task_cpu(idle, cpu); @@ -5062,19 +4361,10 @@ static int __migrate_task(struct task_st goto out; set_task_cpu(p, dest_cpu); - if (p->array) { - /* - * Sync timestamp with rq_dest's before activating. - * The same thing could be achieved by doing this step - * afterwards, and pretending it was a local activate. - * This way is cleaner and logically correct. - */ - p->timestamp = p->timestamp - rq_src->most_recent_timestamp - + rq_dest->most_recent_timestamp; - deactivate_task(p, rq_src); - __activate_task(p, rq_dest); - if (TASK_PREEMPTS_CURR(p, rq_dest)) - resched_task(rq_dest->curr); + if (p->on_rq) { + deactivate_task(rq_src, p, 0); + activate_task(rq_dest, p, 0); + check_preempt_curr(rq_dest, p); } ret = 1; out: @@ -5246,10 +4536,10 @@ void sched_idle_next(void) */ spin_lock_irqsave(&rq->lock, flags); - __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1); + __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1); /* Add idle task to the _front_ of its priority queue: */ - __activate_idle_task(p, rq); + activate_idle_task(p, rq); spin_unlock_irqrestore(&rq->lock, flags); } @@ -5299,16 +4589,15 @@ static void migrate_dead(unsigned int de static void migrate_dead_tasks(unsigned int dead_cpu) { struct rq *rq = cpu_rq(dead_cpu); - unsigned int arr, i; + struct task_struct *next; - for (arr = 0; arr < 2; arr++) { - for (i = 0; i < MAX_PRIO; i++) { - struct list_head *list = &rq->arrays[arr].queue[i]; - - while (!list_empty(list)) - migrate_dead(dead_cpu, list_entry(list->next, - struct task_struct, run_list)); - } + for (;;) { + if (!rq->nr_running) + break; + next = pick_next_task(rq, rq->curr); + if (!next) + break; + migrate_dead(dead_cpu, next); } } #endif /* CONFIG_HOTPLUG_CPU */ @@ -5334,7 +4623,7 @@ migration_call(struct notifier_block *nf kthread_bind(p, cpu); /* Must be high prio: stop_machine expects to yield to it. */ rq = task_rq_lock(p, &flags); - __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1); + __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1); task_rq_unlock(rq, &flags); cpu_rq(cpu)->migration_thread = p; break; @@ -5362,9 +4651,9 @@ migration_call(struct notifier_block *nf rq->migration_thread = NULL; /* Idle task back to normal (off runqueue, low prio) */ rq = task_rq_lock(rq->idle, &flags); - deactivate_task(rq->idle, rq); + deactivate_task(rq, rq->idle, 0); rq->idle->static_prio = MAX_PRIO; - __setscheduler(rq->idle, SCHED_NORMAL, 0); + __setscheduler(rq, rq->idle, SCHED_NORMAL, 0); migrate_dead_tasks(cpu); task_rq_unlock(rq, &flags); migrate_nr_uninterruptible(rq); @@ -5665,483 +4954,6 @@ init_sched_build_groups(cpumask_t span, #define SD_NODES_PER_DOMAIN 16 -/* - * Self-tuning task migration cost measurement between source and target CPUs. - * - * This is done by measuring the cost of manipulating buffers of varying - * sizes. For a given buffer-size here are the steps that are taken: - * - * 1) the source CPU reads+dirties a shared buffer - * 2) the target CPU reads+dirties the same shared buffer - * - * We measure how long they take, in the following 4 scenarios: - * - * - source: CPU1, target: CPU2 | cost1 - * - source: CPU2, target: CPU1 | cost2 - * - source: CPU1, target: CPU1 | cost3 - * - source: CPU2, target: CPU2 | cost4 - * - * We then calculate the cost3+cost4-cost1-cost2 difference - this is - * the cost of migration. - * - * We then start off from a small buffer-size and iterate up to larger - * buffer sizes, in 5% steps - measuring each buffer-size separately, and - * doing a maximum search for the cost. (The maximum cost for a migration - * normally occurs when the working set size is around the effective cache - * size.) - */ -#define SEARCH_SCOPE 2 -#define MIN_CACHE_SIZE (64*1024U) -#define DEFAULT_CACHE_SIZE (5*1024*1024U) -#define ITERATIONS 1 -#define SIZE_THRESH 130 -#define COST_THRESH 130 - -/* - * The migration cost is a function of 'domain distance'. Domain - * distance is the number of steps a CPU has to iterate down its - * domain tree to share a domain with the other CPU. The farther - * two CPUs are from each other, the larger the distance gets. - * - * Note that we use the distance only to cache measurement results, - * the distance value is not used numerically otherwise. When two - * CPUs have the same distance it is assumed that the migration - * cost is the same. (this is a simplification but quite practical) - */ -#define MAX_DOMAIN_DISTANCE 32 - -static unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] = - { [ 0 ... MAX_DOMAIN_DISTANCE-1 ] = -/* - * Architectures may override the migration cost and thus avoid - * boot-time calibration. Unit is nanoseconds. Mostly useful for - * virtualized hardware: - */ -#ifdef CONFIG_DEFAULT_MIGRATION_COST - CONFIG_DEFAULT_MIGRATION_COST -#else - -1LL -#endif -}; - -/* - * Allow override of migration cost - in units of microseconds. - * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost - * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs: - */ -static int __init migration_cost_setup(char *str) -{ - int ints[MAX_DOMAIN_DISTANCE+1], i; - - str = get_options(str, ARRAY_SIZE(ints), ints); - - printk("#ints: %d\n", ints[0]); - for (i = 1; i <= ints[0]; i++) { - migration_cost[i-1] = (unsigned long long)ints[i]*1000; - printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]); - } - return 1; -} - -__setup ("migration_cost=", migration_cost_setup); - -/* - * Global multiplier (divisor) for migration-cutoff values, - * in percentiles. E.g. use a value of 150 to get 1.5 times - * longer cache-hot cutoff times. - * - * (We scale it from 100 to 128 to long long handling easier.) - */ - -#define MIGRATION_FACTOR_SCALE 128 - -static unsigned int migration_factor = MIGRATION_FACTOR_SCALE; - -static int __init setup_migration_factor(char *str) -{ - get_option(&str, &migration_factor); - migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100; - return 1; -} - -__setup("migration_factor=", setup_migration_factor); - -/* - * Estimated distance of two CPUs, measured via the number of domains - * we have to pass for the two CPUs to be in the same span: - */ -static unsigned long domain_distance(int cpu1, int cpu2) -{ - unsigned long distance = 0; - struct sched_domain *sd; - - for_each_domain(cpu1, sd) { - WARN_ON(!cpu_isset(cpu1, sd->span)); - if (cpu_isset(cpu2, sd->span)) - return distance; - distance++; - } - if (distance >= MAX_DOMAIN_DISTANCE) { - WARN_ON(1); - distance = MAX_DOMAIN_DISTANCE-1; - } - - return distance; -} - -static unsigned int migration_debug; - -static int __init setup_migration_debug(char *str) -{ - get_option(&str, &migration_debug); - return 1; -} - -__setup("migration_debug=", setup_migration_debug); - -/* - * Maximum cache-size that the scheduler should try to measure. - * Architectures with larger caches should tune this up during - * bootup. Gets used in the domain-setup code (i.e. during SMP - * bootup). - */ -unsigned int max_cache_size; - -static int __init setup_max_cache_size(char *str) -{ - get_option(&str, &max_cache_size); - return 1; -} - -__setup("max_cache_size=", setup_max_cache_size); - -/* - * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This - * is the operation that is timed, so we try to generate unpredictable - * cachemisses that still end up filling the L2 cache: - */ -static void touch_cache(void *__cache, unsigned long __size) -{ - unsigned long size = __size / sizeof(long); - unsigned long chunk1 = size / 3; - unsigned long chunk2 = 2 * size / 3; - unsigned long *cache = __cache; - int i; - - for (i = 0; i < size/6; i += 8) { - switch (i % 6) { - case 0: cache[i]++; - case 1: cache[size-1-i]++; - case 2: cache[chunk1-i]++; - case 3: cache[chunk1+i]++; - case 4: cache[chunk2-i]++; - case 5: cache[chunk2+i]++; - } - } -} - -/* - * Measure the cache-cost of one task migration. Returns in units of nsec. - */ -static unsigned long long -measure_one(void *cache, unsigned long size, int source, int target) -{ - cpumask_t mask, saved_mask; - unsigned long long t0, t1, t2, t3, cost; - - saved_mask = current->cpus_allowed; - - /* - * Flush source caches to RAM and invalidate them: - */ - sched_cacheflush(); - - /* - * Migrate to the source CPU: - */ - mask = cpumask_of_cpu(source); - set_cpus_allowed(current, mask); - WARN_ON(smp_processor_id() != source); - - /* - * Dirty the working set: - */ - t0 = sched_clock(); - touch_cache(cache, size); - t1 = sched_clock(); - - /* - * Migrate to the target CPU, dirty the L2 cache and access - * the shared buffer. (which represents the working set - * of a migrated task.) - */ - mask = cpumask_of_cpu(target); - set_cpus_allowed(current, mask); - WARN_ON(smp_processor_id() != target); - - t2 = sched_clock(); - touch_cache(cache, size); - t3 = sched_clock(); - - cost = t1-t0 + t3-t2; - - if (migration_debug >= 2) - printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n", - source, target, t1-t0, t1-t0, t3-t2, cost); - /* - * Flush target caches to RAM and invalidate them: - */ - sched_cacheflush(); - - set_cpus_allowed(current, saved_mask); - - return cost; -} - -/* - * Measure a series of task migrations and return the average - * result. Since this code runs early during bootup the system - * is 'undisturbed' and the average latency makes sense. - * - * The algorithm in essence auto-detects the relevant cache-size, - * so it will properly detect different cachesizes for different - * cache-hierarchies, depending on how the CPUs are connected. - * - * Architectures can prime the upper limit of the search range via - * max_cache_size, otherwise the search range defaults to 20MB...64K. - */ -static unsigned long long -measure_cost(int cpu1, int cpu2, void *cache, unsigned int size) -{ - unsigned long long cost1, cost2; - int i; - - /* - * Measure the migration cost of 'size' bytes, over an - * average of 10 runs: - * - * (We perturb the cache size by a small (0..4k) - * value to compensate size/alignment related artifacts. - * We also subtract the cost of the operation done on - * the same CPU.) - */ - cost1 = 0; - - /* - * dry run, to make sure we start off cache-cold on cpu1, - * and to get any vmalloc pagefaults in advance: - */ - measure_one(cache, size, cpu1, cpu2); - for (i = 0; i < ITERATIONS; i++) - cost1 += measure_one(cache, size - i * 1024, cpu1, cpu2); - - measure_one(cache, size, cpu2, cpu1); - for (i = 0; i < ITERATIONS; i++) - cost1 += measure_one(cache, size - i * 1024, cpu2, cpu1); - - /* - * (We measure the non-migrating [cached] cost on both - * cpu1 and cpu2, to handle CPUs with different speeds) - */ - cost2 = 0; - - measure_one(cache, size, cpu1, cpu1); - for (i = 0; i < ITERATIONS; i++) - cost2 += measure_one(cache, size - i * 1024, cpu1, cpu1); - - measure_one(cache, size, cpu2, cpu2); - for (i = 0; i < ITERATIONS; i++) - cost2 += measure_one(cache, size - i * 1024, cpu2, cpu2); - - /* - * Get the per-iteration migration cost: - */ - do_div(cost1, 2 * ITERATIONS); - do_div(cost2, 2 * ITERATIONS); - - return cost1 - cost2; -} - -static unsigned long long measure_migration_cost(int cpu1, int cpu2) -{ - unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0; - unsigned int max_size, size, size_found = 0; - long long cost = 0, prev_cost; - void *cache; - - /* - * Search from max_cache_size*5 down to 64K - the real relevant - * cachesize has to lie somewhere inbetween. - */ - if (max_cache_size) { - max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE); - size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE); - } else { - /* - * Since we have no estimation about the relevant - * search range - */ - max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE; - size = MIN_CACHE_SIZE; - } - - if (!cpu_online(cpu1) || !cpu_online(cpu2)) { - printk("cpu %d and %d not both online!\n", cpu1, cpu2); - return 0; - } - - /* - * Allocate the working set: - */ - cache = vmalloc(max_size); - if (!cache) { - printk("could not vmalloc %d bytes for cache!\n", 2 * max_size); - return 1000000; /* return 1 msec on very small boxen */ - } - - while (size <= max_size) { - prev_cost = cost; - cost = measure_cost(cpu1, cpu2, cache, size); - - /* - * Update the max: - */ - if (cost > 0) { - if (max_cost < cost) { - max_cost = cost; - size_found = size; - } - } - /* - * Calculate average fluctuation, we use this to prevent - * noise from triggering an early break out of the loop: - */ - fluct = abs(cost - prev_cost); - avg_fluct = (avg_fluct + fluct)/2; - - if (migration_debug) - printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): " - "(%8Ld %8Ld)\n", - cpu1, cpu2, size, - (long)cost / 1000000, - ((long)cost / 100000) % 10, - (long)max_cost / 1000000, - ((long)max_cost / 100000) % 10, - domain_distance(cpu1, cpu2), - cost, avg_fluct); - - /* - * If we iterated at least 20% past the previous maximum, - * and the cost has dropped by more than 20% already, - * (taking fluctuations into account) then we assume to - * have found the maximum and break out of the loop early: - */ - if (size_found && (size*100 > size_found*SIZE_THRESH)) - if (cost+avg_fluct <= 0 || - max_cost*100 > (cost+avg_fluct)*COST_THRESH) { - - if (migration_debug) - printk("-> found max.\n"); - break; - } - /* - * Increase the cachesize in 10% steps: - */ - size = size * 10 / 9; - } - - if (migration_debug) - printk("[%d][%d] working set size found: %d, cost: %Ld\n", - cpu1, cpu2, size_found, max_cost); - - vfree(cache); - - /* - * A task is considered 'cache cold' if at least 2 times - * the worst-case cost of migration has passed. - * - * (this limit is only listened to if the load-balancing - * situation is 'nice' - if there is a large imbalance we - * ignore it for the sake of CPU utilization and - * processing fairness.) - */ - return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE; -} - -static void calibrate_migration_costs(const cpumask_t *cpu_map) -{ - int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id(); - unsigned long j0, j1, distance, max_distance = 0; - struct sched_domain *sd; - - j0 = jiffies; - - /* - * First pass - calculate the cacheflush times: - */ - for_each_cpu_mask(cpu1, *cpu_map) { - for_each_cpu_mask(cpu2, *cpu_map) { - if (cpu1 == cpu2) - continue; - distance = domain_distance(cpu1, cpu2); - max_distance = max(max_distance, distance); - /* - * No result cached yet? - */ - if (migration_cost[distance] == -1LL) - migration_cost[distance] = - measure_migration_cost(cpu1, cpu2); - } - } - /* - * Second pass - update the sched domain hierarchy with - * the new cache-hot-time estimations: - */ - for_each_cpu_mask(cpu, *cpu_map) { - distance = 0; - for_each_domain(cpu, sd) { - sd->cache_hot_time = migration_cost[distance]; - distance++; - } - } - /* - * Print the matrix: - */ - if (migration_debug) - printk("migration: max_cache_size: %d, cpu: %d MHz:\n", - max_cache_size, -#ifdef CONFIG_X86 - cpu_khz/1000 -#else - -1 -#endif - ); - if (system_state == SYSTEM_BOOTING && num_online_cpus() > 1) { - printk("migration_cost="); - for (distance = 0; distance <= max_distance; distance++) { - if (distance) - printk(","); - printk("%ld", (long)migration_cost[distance] / 1000); - } - printk("\n"); - } - j1 = jiffies; - if (migration_debug) - printk("migration: %ld seconds\n", (j1-j0) / HZ); - - /* - * Move back to the original CPU. NUMA-Q gets confused - * if we migrate to another quad during bootup. - */ - if (raw_smp_processor_id() != orig_cpu) { - cpumask_t mask = cpumask_of_cpu(orig_cpu), - saved_mask = current->cpus_allowed; - - set_cpus_allowed(current, mask); - set_cpus_allowed(current, saved_mask); - } -} - #ifdef CONFIG_NUMA /** @@ -6671,10 +5483,6 @@ static int build_sched_domains(const cpu #endif cpu_attach_domain(sd, i); } - /* - * Tune cache-hot values: - */ - calibrate_migration_costs(cpu_map); return 0; @@ -6875,6 +5683,16 @@ void __init sched_init_smp(void) /* Move init over to a non-isolated CPU */ if (set_cpus_allowed(current, non_isolated_cpus) < 0) BUG(); + /* + * Increase the granularity value when there are more CPUs, + * because with more CPUs the 'effective latency' as visible + * to users decreases. But the relationship is not linear, + * so pick a second-best guess by going with the log2 of the + * number of CPUs. + * + * This idea comes from the SD scheduler of Con Kolivas: + */ + sysctl_sched_granularity *= 1 + ilog2(num_online_cpus()); } #else void __init sched_init_smp(void) @@ -6894,7 +5712,14 @@ int in_sched_functions(unsigned long add void __init sched_init(void) { - int i, j, k; + int i, j; + + current->sched_class = &fair_sched_class; + /* + * Link up the scheduling class hierarchy: + */ + rt_sched_class.next = &fair_sched_class; + fair_sched_class.next = NULL; for_each_possible_cpu(i) { struct prio_array *array; @@ -6904,14 +5729,13 @@ void __init sched_init(void) spin_lock_init(&rq->lock); lockdep_set_class(&rq->lock, &rq->rq_lock_key); rq->nr_running = 0; - rq->active = rq->arrays; - rq->expired = rq->arrays + 1; - rq->best_expired_prio = MAX_PRIO; + rq->tasks_timeline = RB_ROOT; + rq->clock = rq->fair_clock = 1; + for (j = 0; j < CPU_LOAD_IDX_MAX; j++) + rq->cpu_load[j] = 0; #ifdef CONFIG_SMP rq->sd = NULL; - for (j = 1; j < 3; j++) - rq->cpu_load[j] = 0; rq->active_balance = 0; rq->push_cpu = 0; rq->cpu = i; @@ -6920,15 +5744,13 @@ void __init sched_init(void) #endif atomic_set(&rq->nr_iowait, 0); - for (j = 0; j < 2; j++) { - array = rq->arrays + j; - for (k = 0; k < MAX_PRIO; k++) { - INIT_LIST_HEAD(array->queue + k); - __clear_bit(k, array->bitmap); - } - // delimiter for bitsearch - __set_bit(MAX_PRIO, array->bitmap); + array = &rq->active; + for (j = 0; j < MAX_RT_PRIO; j++) { + INIT_LIST_HEAD(array->queue + j); + __clear_bit(j, array->bitmap); } + /* delimiter for bitsearch: */ + __set_bit(MAX_RT_PRIO, array->bitmap); } set_load_weight(&init_task); @@ -6984,28 +5806,54 @@ EXPORT_SYMBOL(__might_sleep); #ifdef CONFIG_MAGIC_SYSRQ void normalize_rt_tasks(void) { - struct prio_array *array; struct task_struct *p; unsigned long flags; struct rq *rq; + int on_rq; read_lock_irq(&tasklist_lock); for_each_process(p) { - if (!rt_task(p)) + p->fair_key = 0; + p->wait_runtime = 0; + p->wait_start_fair = 0; + p->wait_start = 0; + p->exec_start = 0; + p->sleep_start = 0; + p->block_start = 0; + task_rq(p)->fair_clock = 0; + task_rq(p)->clock = 0; + + if (!rt_task(p)) { + /* + * Renice negative nice level userspace + * tasks back to 0: + */ + if (TASK_NICE(p) < 0 && p->mm) + set_user_nice(p, 0); continue; + } spin_lock_irqsave(&p->pi_lock, flags); rq = __task_rq_lock(p); +#ifdef CONFIG_SMP + /* + * Do not touch the migration thread: + */ + if (p == rq->migration_thread) + goto out_unlock; +#endif - array = p->array; - if (array) - deactivate_task(p, task_rq(p)); - __setscheduler(p, SCHED_NORMAL, 0); - if (array) { - __activate_task(p, task_rq(p)); + on_rq = p->on_rq; + if (on_rq) + deactivate_task(task_rq(p), p, 0); + __setscheduler(rq, p, SCHED_NORMAL, 0); + if (on_rq) { + activate_task(task_rq(p), p, 0); resched_task(rq->curr); } - +#ifdef CONFIG_SMP + out_unlock: +#endif __task_rq_unlock(rq); spin_unlock_irqrestore(&p->pi_lock, flags); } Index: linux-cfs-2.6.20.8.q/kernel/sched_debug.c =================================================================== --- /dev/null +++ linux-cfs-2.6.20.8.q/kernel/sched_debug.c @@ -0,0 +1,161 @@ +/* + * kernel/time/sched_debug.c + * + * Print the CFS rbtree + * + * Copyright(C) 2007, Red Hat, Inc., Ingo Molnar + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include + +typedef void (*print_fn_t)(struct seq_file *m, unsigned int *classes); + +/* + * This allows printing both to /proc/sched_debug and + * to the console + */ +#define SEQ_printf(m, x...) \ + do { \ + if (m) \ + seq_printf(m, x); \ + else \ + printk(x); \ + } while (0) + +static void +print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now) +{ + if (rq->curr == p) + SEQ_printf(m, "R"); + else + SEQ_printf(m, " "); + + SEQ_printf(m, "%14s %5d %15Ld %13Ld %13Ld %9Ld %5d " + "%15Ld %15Ld %15Ld\n", + p->comm, p->pid, + (long long)p->fair_key, (long long)p->fair_key - rq->fair_clock, + (long long)p->wait_runtime, + (long long)p->nr_switches, + p->prio, + (long long)p->wait_start_fair - rq->fair_clock, + (long long)p->sum_exec_runtime, + (long long)p->sum_wait_runtime); +} + +static void print_rq(struct seq_file *m, struct rq *rq, u64 now) +{ + struct task_struct *p; + struct rb_node *curr; + + SEQ_printf(m, + "\nrunnable tasks:\n" + " task PID tree-key delta waiting" + " switches prio wstart-fair" + " sum-exec sum-wait\n" + "-----------------------------------------------------------------" + "--------------------------------" + "--------------------------------\n"); + + curr = first_fair(rq); + while (curr) { + p = rb_entry(curr, struct task_struct, run_node); + print_task(m, rq, p, now); + + curr = rb_next(curr); + } +} + +static void print_cpu(struct seq_file *m, int cpu, u64 now) +{ + struct rq *rq = &per_cpu(runqueues, cpu); + + SEQ_printf(m, "\ncpu: %d\n", cpu); +#define P(x) \ + SEQ_printf(m, " .%-22s: %Lu\n", #x, (unsigned long long)(rq->x)) + + P(nr_running); + P(raw_weighted_load); + P(nr_switches); + P(nr_load_updates); + P(nr_uninterruptible); + P(next_balance); + P(curr->pid); + P(clock); + P(prev_clock_raw); + P(clock_warps); + P(clock_unstable_events); + P(clock_max_delta); + rq->clock_max_delta = 0; + P(fair_clock); + P(prev_fair_clock); + P(exec_clock); + P(prev_exec_clock); + P(wait_runtime); + P(cpu_load[0]); + P(cpu_load[1]); + P(cpu_load[2]); + P(cpu_load[3]); + P(cpu_load[4]); +#undef P + + print_rq(m, rq, now); +} + +static int sched_debug_show(struct seq_file *m, void *v) +{ + u64 now = ktime_to_ns(ktime_get()); + int cpu; + + SEQ_printf(m, "Sched Debug Version: v0.02\n"); + SEQ_printf(m, "now at %Lu nsecs\n", (unsigned long long)now); + + for_each_online_cpu(cpu) + print_cpu(m, cpu, now); + + SEQ_printf(m, "\n"); + + return 0; +} + +void sysrq_sched_debug_show(void) +{ + sched_debug_show(NULL, NULL); +} + +static int sched_debug_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_debug_show, NULL); +} + +static struct file_operations sched_debug_fops = { + .open = sched_debug_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static int __init init_sched_debug_procfs(void) +{ + struct proc_dir_entry *pe; + + pe = create_proc_entry("sched_debug", 0644, NULL); + if (!pe) + return -ENOMEM; + + pe->proc_fops = &sched_debug_fops; + + return 0; +} +__initcall(init_sched_debug_procfs); Index: linux-cfs-2.6.20.8.q/kernel/sched_fair.c =================================================================== --- /dev/null +++ linux-cfs-2.6.20.8.q/kernel/sched_fair.c @@ -0,0 +1,618 @@ +/* + * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH) + */ + +/* + * Preemption granularity: + * (default: 2 msec, units: nanoseconds) + * + * NOTE: this granularity value is not the same as the concept of + * 'timeslice length' - timeslices in CFS will typically be somewhat + * larger than this value. (to see the precise effective timeslice + * length of your workload, run vmstat and monitor the context-switches + * field) + * + * On SMP systems the value of this is multiplied by the log2 of the + * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way + * systems, 4x on 8-way systems, 5x on 16-way systems, etc.) + */ +unsigned int sysctl_sched_granularity __read_mostly = 2000000; + +unsigned int sysctl_sched_sleep_history_max __read_mostly = 2000000000; + +unsigned int sysctl_sched_load_smoothing = 2; + +/* + * Wake-up granularity. + * (default: 1 msec, units: nanoseconds) + * + * This option delays the preemption effects of decoupled workloads + * and reduces their over-scheduling. Synchronous workloads will still + * have immediate wakeup/sleep latencies. + */ +unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0; + + +extern struct sched_class fair_sched_class; + +/**************************************************************/ +/* Scheduling class tree data structure manipulation methods: + */ + +/* + * Enqueue a task into the rb-tree: + */ +static inline void __enqueue_task_fair(struct rq *rq, struct task_struct *p) +{ + struct rb_node **link = &rq->tasks_timeline.rb_node; + struct rb_node *parent = NULL; + struct task_struct *entry; + s64 key = p->fair_key; + int leftmost = 1; + + /* + * Find the right place in the rbtree: + */ + while (*link) { + parent = *link; + entry = rb_entry(parent, struct task_struct, run_node); + /* + * We dont care about collisions. Nodes with + * the same key stay together. + */ + if (key < entry->fair_key) { + link = &parent->rb_left; + } else { + link = &parent->rb_right; + leftmost = 0; + } + } + + /* + * Maintain a cache of leftmost tree entries (it is frequently + * used): + */ + if (leftmost) + rq->rb_leftmost = &p->run_node; + + rb_link_node(&p->run_node, parent, link); + rb_insert_color(&p->run_node, &rq->tasks_timeline); +} + +static inline void __dequeue_task_fair(struct rq *rq, struct task_struct *p) +{ + if (rq->rb_leftmost == &p->run_node) + rq->rb_leftmost = NULL; + rb_erase(&p->run_node, &rq->tasks_timeline); +} + +static inline struct rb_node * first_fair(struct rq *rq) +{ + if (rq->rb_leftmost) + return rq->rb_leftmost; + /* Cache the value returned by rb_first() */ + rq->rb_leftmost = rb_first(&rq->tasks_timeline); + return rq->rb_leftmost; +} + +static struct task_struct * __pick_next_task_fair(struct rq *rq) +{ + return rb_entry(first_fair(rq), struct task_struct, run_node); +} + +/**************************************************************/ +/* Scheduling class statistics methods: + */ + +static inline u64 +rescale_load(struct task_struct *p, u64 value) +{ + int load_shift = p->load_shift; + + if (load_shift == SCHED_LOAD_SHIFT) + return value; + + return (value << load_shift) >> SCHED_LOAD_SHIFT; +} + +static u64 +niced_granularity(struct rq *rq, struct task_struct *curr, + unsigned long granularity) +{ + return rescale_load(curr, granularity); +} + +/* + * Update the current task's runtime statistics. Skip current tasks that + * are not in our scheduling class. + */ +static inline void update_curr(struct rq *rq, u64 now) +{ + u64 delta_exec, delta_fair, delta_mine; + struct task_struct *curr = rq->curr; + unsigned long load; + + if (curr->sched_class != &fair_sched_class || curr == rq->idle + || !curr->on_rq) + return; + /* + * Get the amount of time the current task was running + * since the last time we changed raw_weighted_load: + */ + delta_exec = now - curr->exec_start; + if (unlikely(delta_exec > curr->exec_max)) + curr->exec_max = delta_exec; + + if (sysctl_sched_load_smoothing) { + delta_fair = delta_exec << SCHED_LOAD_SHIFT; + do_div(delta_fair, rq->raw_weighted_load); + + load = rq->cpu_load[CPU_LOAD_IDX_MAX-1] + 1; + if (sysctl_sched_load_smoothing & 2) + load = max(load, rq->raw_weighted_load); + + delta_mine = delta_exec << curr->load_shift; + do_div(delta_mine, load); + } else { + delta_fair = delta_exec << SCHED_LOAD_SHIFT; + do_div(delta_fair, rq->raw_weighted_load); + + delta_mine = delta_exec << curr->load_shift; + do_div(delta_mine, rq->raw_weighted_load); + } + + curr->sum_exec_runtime += delta_exec; + curr->exec_start = now; + + rq->fair_clock += delta_fair; + rq->exec_clock += delta_exec; + + /* + * We executed delta_exec amount of time on the CPU, + * but we were only entitled to delta_mine amount of + * time during that period (if nr_running == 1 then + * the two values are equal): + */ + + /* + * Task already marked for preemption, do not burden + * it with the cost of not having left the CPU yet. + */ + if (unlikely(test_tsk_thread_flag(curr, TIF_NEED_RESCHED))) + goto out_nowait; + + curr->wait_runtime -= delta_exec - delta_mine; + if (unlikely(curr->wait_runtime < curr->min_wait_runtime)) + curr->min_wait_runtime = curr->wait_runtime; + + rq->wait_runtime -= delta_exec - delta_mine; +out_nowait: + ; +} + +static inline void +update_stats_wait_start(struct rq *rq, struct task_struct *p, u64 now) +{ + p->wait_start_fair = rq->fair_clock; + p->wait_start = now; +} + +/* + * Task is being enqueued - update stats: + */ +static inline void +update_stats_enqueue(struct rq *rq, struct task_struct *p, u64 now) +{ + s64 key; + + /* + * Update the fair clock. + */ + update_curr(rq, now); + + /* + * Are we enqueueing a waiting task? (for current tasks + * a dequeue/enqueue event is a NOP) + */ + if (p != rq->curr) + update_stats_wait_start(rq, p, now); + /* + * Update the key: + */ + key = rq->fair_clock; + + /* + * Optimize the common nice 0 case: + */ + if (likely(p->load_shift == SCHED_LOAD_SHIFT)) { + key -= p->wait_runtime; + } else { + unsigned int delta_bits; + + if (p->load_shift < SCHED_LOAD_SHIFT) { + /* plus-reniced tasks get helped: */ + delta_bits = SCHED_LOAD_SHIFT - p->load_shift; + key -= p->wait_runtime << delta_bits; + } else { + /* negative-reniced tasks get hurt: */ + delta_bits = p->load_shift - SCHED_LOAD_SHIFT; + key -= p->wait_runtime >> delta_bits; + } + } + + p->fair_key = key; +} + +/* + * Note: must be called with a freshly updated rq->fair_clock. + */ +static inline void +update_stats_wait_end(struct rq *rq, struct task_struct *p, u64 now) +{ + u64 delta, fair_delta, delta_wait; + + delta_wait = now - p->wait_start; + if (unlikely(delta_wait > p->wait_max)) + p->wait_max = delta_wait; + + delta = rq->fair_clock - p->wait_start_fair; + fair_delta = rescale_load(p, delta); + + p->sum_wait_runtime += fair_delta; + rq->wait_runtime += fair_delta; + p->wait_runtime += fair_delta; + + p->wait_start_fair = 0; + p->wait_start = 0; +} + +static inline void +update_stats_dequeue(struct rq *rq, struct task_struct *p, u64 now) +{ + update_curr(rq, now); + /* + * Mark the end of the wait period if dequeueing a + * waiting task: + */ + if (p != rq->curr) + update_stats_wait_end(rq, p, now); +} + +/* + * We are picking a new current task - update its stats: + */ +static inline void +update_stats_curr_start(struct rq *rq, struct task_struct *p, u64 now) +{ + /* + * We are starting a new run period: + */ + p->exec_start = now; +} + +/* + * We are descheduling a task - update its stats: + */ +static inline void +update_stats_curr_end(struct rq *rq, struct task_struct *p, u64 now) +{ + update_curr(rq, now); + + p->exec_start = 0; +} + +/**************************************************************/ +/* Scheduling class queueing methods: + */ + +/* + * The enqueue_task method is called before nr_running is + * increased. Here we update the fair scheduling stats and + * then put the task into the rbtree: + */ +static void +enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now) +{ + unsigned long max_delta = sysctl_sched_sleep_history_max, factor; + u64 delta = 0; + + if (wakeup) { + if (p->sleep_start) { + delta = now - p->sleep_start; + if ((s64)delta < 0) + delta = 0; + + if (unlikely(delta > p->sleep_max)) + p->sleep_max = delta; + + p->sleep_start = 0; + } + if (p->block_start) { + delta = now - p->block_start; + if ((s64)delta < 0) + delta = 0; + + if (unlikely(delta > p->block_max)) + p->block_max = delta; + + p->block_start = 0; + } + + /* + * We are after a wait period, decay the + * wait_runtime value: + */ + if (max_delta != -1 && max_delta != -2) { + if (delta < max_delta) { + factor = 1024 * (max_delta - + (unsigned long)delta) / max_delta; + p->wait_runtime *= (int)factor; + p->wait_runtime /= 1024; + } else { + p->wait_runtime = 0; + } + } + } + update_stats_enqueue(rq, p, now); + if (wakeup && max_delta == -2) + p->wait_runtime = 0; + __enqueue_task_fair(rq, p); +} + +/* + * The dequeue_task method is called before nr_running is + * decreased. We remove the task from the rbtree and + * update the fair scheduling stats: + */ +static void +dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now) +{ + update_stats_dequeue(rq, p, now); + if (sleep) { + if (p->state & TASK_INTERRUPTIBLE) + p->sleep_start = now; + if (p->state & TASK_UNINTERRUPTIBLE) + p->block_start = now; + } + __dequeue_task_fair(rq, p); +} + +/* + * sched_yield() support is very simple via the rbtree: we just + * dequeue the task and move it after the next task, which + * causes tasks to roundrobin. + */ +static void +yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to) +{ + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + u64 now; + + /* + * yield-to support: if we are on the same runqueue then + * give half of our wait_runtime (if it's positive) to the other task: + */ + if (p_to && p->wait_runtime > 0) { + p_to->wait_runtime += p->wait_runtime >> 1; + p->wait_runtime >>= 1; + } + curr = &p->run_node; + first = first_fair(rq); + /* + * Move this task to the second place in the tree: + */ + if (unlikely(curr != first)) { + next = first; + } else { + next = rb_next(curr); + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + } + + p_next = rb_entry(next, struct task_struct, run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->fair_key + 1; + + now = __rq_clock(rq); + dequeue_task_fair(rq, p, 0, now); + p->on_rq = 0; + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->fair_key < yield_key) + p->fair_key = yield_key; + + __enqueue_task_fair(rq, p); + p->on_rq = 1; +} + +/* + * Preempt the current task with a newly woken task if needed: + */ +static inline void +__check_preempt_curr_fair(struct rq *rq, struct task_struct *p, + struct task_struct *curr, unsigned long granularity) +{ + s64 __delta = curr->fair_key - p->fair_key; + + /* + * Take scheduling granularity into account - do not + * preempt the current task unless the best task has + * a larger than sched_granularity fairness advantage: + */ + if (__delta > niced_granularity(rq, curr, granularity)) + resched_task(curr); +} + +/* + * Preempt the current task with a newly woken task if needed: + */ +static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p) +{ + struct task_struct *curr = rq->curr; + + if ((curr == rq->idle) || rt_prio(p->prio)) { + resched_task(curr); + } else { + __check_preempt_curr_fair(rq, p, curr, + sysctl_sched_granularity); + } +} + +static struct task_struct * pick_next_task_fair(struct rq *rq, u64 now) +{ + struct task_struct *p = __pick_next_task_fair(rq); + + /* + * Any task has to be enqueued before it get to execute on + * a CPU. So account for the time it spent waiting on the + * runqueue. (note, here we rely on pick_next_task() having + * done a put_prev_task_fair() shortly before this, which + * updated rq->fair_clock - used by update_stats_wait_end()) + */ + update_stats_wait_end(rq, p, now); + update_stats_curr_start(rq, p, now); + + return p; +} + +/* + * Account for a descheduled task: + */ +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, u64 now) +{ + if (prev == rq->idle) + return; + + update_stats_curr_end(rq, prev, now); + /* + * If the task is still waiting for the CPU (it just got + * preempted), start the wait period: + */ + if (prev->on_rq) + update_stats_wait_start(rq, prev, now); +} + +/**************************************************************/ +/* Fair scheduling class load-balancing methods: + */ + +/* + * Load-balancing iterator. Note: while the runqueue stays locked + * during the whole iteration, the current task might be + * dequeued so the iterator has to be dequeue-safe. Here we + * achieve that by always pre-iterating before returning + * the current task: + */ +static struct task_struct * load_balance_start_fair(struct rq *rq) +{ + struct rb_node *first = first_fair(rq); + struct task_struct *p; + + if (!first) + return NULL; + + p = rb_entry(first, struct task_struct, run_node); + + rq->rb_load_balance_curr = rb_next(first); + + return p; +} + +static struct task_struct * load_balance_next_fair(struct rq *rq) +{ + struct rb_node *curr = rq->rb_load_balance_curr; + struct task_struct *p; + + if (!curr) + return NULL; + + p = rb_entry(curr, struct task_struct, run_node); + rq->rb_load_balance_curr = rb_next(curr); + + return p; +} + +/* + * scheduler tick hitting a task of our scheduling class: + */ +static void task_tick_fair(struct rq *rq, struct task_struct *curr) +{ + struct task_struct *next; + u64 now = __rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_task_fair(rq, curr, 0, now); + curr->on_rq = 0; + enqueue_task_fair(rq, curr, 0, now); + curr->on_rq = 1; + + /* + * Reschedule if another task tops the current one. + */ + next = __pick_next_task_fair(rq); + if (next == curr) + return; + + if ((curr == rq->idle) || (rt_prio(next->prio) && + (next->prio < curr->prio))) + resched_task(curr); + else + __check_preempt_curr_fair(rq, next, curr, + sysctl_sched_granularity); +} + +/* + * Share the fairness runtime between parent and child, thus the + * total amount of pressure for CPU stays equal - new tasks + * get a chance to run but frequent forkers are not allowed to + * monopolize the CPU. Note: the parent runqueue is locked, + * the child is not running yet. + */ +static void task_new_fair(struct rq *rq, struct task_struct *p) +{ + sched_info_queued(p); + update_stats_enqueue(rq, p, rq_clock(rq)); + /* + * Child runs first: we let it run before the parent + * until it reschedules once. We set up the key so that + * it will preempt the parent: + */ + p->fair_key = current->fair_key - niced_granularity(rq, rq->curr, + sysctl_sched_granularity) - 1; + __enqueue_task_fair(rq, p); + p->on_rq = 1; + inc_nr_running(p, rq); +} + +/* + * All the scheduling class methods: + */ +struct sched_class fair_sched_class __read_mostly = { + .enqueue_task = enqueue_task_fair, + .dequeue_task = dequeue_task_fair, + .yield_task = yield_task_fair, + + .check_preempt_curr = check_preempt_curr_fair, + + .pick_next_task = pick_next_task_fair, + .put_prev_task = put_prev_task_fair, + + .load_balance_start = load_balance_start_fair, + .load_balance_next = load_balance_next_fair, + .task_tick = task_tick_fair, + .task_new = task_new_fair, +}; Index: linux-cfs-2.6.20.8.q/kernel/sched_rt.c =================================================================== --- /dev/null +++ linux-cfs-2.6.20.8.q/kernel/sched_rt.c @@ -0,0 +1,184 @@ +/* + * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR + * policies) + */ + +static void +enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now) +{ + struct prio_array *array = &rq->active; + + list_add_tail(&p->run_list, array->queue + p->prio); + __set_bit(p->prio, array->bitmap); +} + +/* + * Adding/removing a task to/from a priority array: + */ +static void +dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep, u64 now) +{ + struct prio_array *array = &rq->active; + + list_del(&p->run_list); + if (list_empty(array->queue + p->prio)) + __clear_bit(p->prio, array->bitmap); +} + +/* + * Put task to the end of the run list without the overhead of dequeue + * followed by enqueue. + */ +static void requeue_task_rt(struct rq *rq, struct task_struct *p) +{ + struct prio_array *array = &rq->active; + + list_move_tail(&p->run_list, array->queue + p->prio); +} + +static void +yield_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *p_to) +{ + requeue_task_rt(rq, p); +} + +/* + * Preempt the current task with a newly woken task if needed: + */ +static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) +{ + if (p->prio < rq->curr->prio) + resched_task(rq->curr); +} + +static struct task_struct * pick_next_task_rt(struct rq *rq, u64 now) +{ + struct prio_array *array = &rq->active; + struct list_head *queue; + int idx; + + idx = sched_find_first_bit(array->bitmap); + if (idx >= MAX_RT_PRIO) + return NULL; + + queue = array->queue + idx; + return list_entry(queue->next, struct task_struct, run_list); +} + +/* + * No accounting done when RT tasks are descheduled: + */ +static void put_prev_task_rt(struct rq *rq, struct task_struct *p, u64 now) +{ +} + +/* + * Load-balancing iterator. Note: while the runqueue stays locked + * during the whole iteration, the current task might be + * dequeued so the iterator has to be dequeue-safe. Here we + * achieve that by always pre-iterating before returning + * the current task: + */ +static struct task_struct * load_balance_start_rt(struct rq *rq) +{ + struct prio_array *array = &rq->active; + struct list_head *head, *curr; + struct task_struct *p; + int idx; + + idx = sched_find_first_bit(array->bitmap); + if (idx >= MAX_RT_PRIO) + return NULL; + + head = array->queue + idx; + curr = head->prev; + + p = list_entry(curr, struct task_struct, run_list); + + curr = curr->prev; + + rq->rt_load_balance_idx = idx; + rq->rt_load_balance_head = head; + rq->rt_load_balance_curr = curr; + + return p; +} + +static struct task_struct * load_balance_next_rt(struct rq *rq) +{ + struct prio_array *array = &rq->active; + struct list_head *head, *curr; + struct task_struct *p; + int idx; + + idx = rq->rt_load_balance_idx; + head = rq->rt_load_balance_head; + curr = rq->rt_load_balance_curr; + + /* + * If we arrived back to the head again then + * iterate to the next queue (if any): + */ + if (unlikely(head == curr)) { + int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1); + + if (next_idx >= MAX_RT_PRIO) + return NULL; + + idx = next_idx; + head = array->queue + idx; + curr = head->prev; + + rq->rt_load_balance_idx = idx; + rq->rt_load_balance_head = head; + } + + p = list_entry(curr, struct task_struct, run_list); + + curr = curr->prev; + + rq->rt_load_balance_curr = curr; + + return p; +} + +static void task_tick_rt(struct rq *rq, struct task_struct *p) +{ + /* + * RR tasks need a special form of timeslice management. + * FIFO tasks have no timeslices. + */ + if ((p->policy == SCHED_RR) && !--p->time_slice) { + p->time_slice = static_prio_timeslice(p->static_prio); + set_tsk_need_resched(p); + + /* put it at the end of the queue: */ + requeue_task_rt(rq, p); + } +} + +/* + * No parent/child timeslice management necessary for RT tasks, + * just activate them: + */ +static void task_new_rt(struct rq *rq, struct task_struct *p) +{ + activate_task(rq, p, 1); +} + +static struct sched_class rt_sched_class __read_mostly = { + .enqueue_task = enqueue_task_rt, + .dequeue_task = dequeue_task_rt, + .yield_task = yield_task_rt, + + .check_preempt_curr = check_preempt_curr_rt, + + .pick_next_task = pick_next_task_rt, + .put_prev_task = put_prev_task_rt, + + .load_balance_start = load_balance_start_rt, + .load_balance_next = load_balance_next_rt, + + .task_tick = task_tick_rt, + .task_new = task_new_rt, +}; Index: linux-cfs-2.6.20.8.q/kernel/sched_stats.h =================================================================== --- /dev/null +++ linux-cfs-2.6.20.8.q/kernel/sched_stats.h @@ -0,0 +1,235 @@ + +#ifdef CONFIG_SCHEDSTATS +/* + * bump this up when changing the output format or the meaning of an existing + * format, so that tools can adapt (or abort) + */ +#define SCHEDSTAT_VERSION 14 + +static int show_schedstat(struct seq_file *seq, void *v) +{ + int cpu; + + seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); + seq_printf(seq, "timestamp %lu\n", jiffies); + for_each_online_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); +#ifdef CONFIG_SMP + struct sched_domain *sd; + int dcnt = 0; +#endif + + /* runqueue-specific stats */ + seq_printf(seq, + "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu", + cpu, rq->yld_both_empty, + rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt, + rq->sched_switch, rq->sched_cnt, rq->sched_goidle, + rq->ttwu_cnt, rq->ttwu_local, + rq->rq_sched_info.cpu_time, + rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt); + + seq_printf(seq, "\n"); + +#ifdef CONFIG_SMP + /* domain-specific stats */ + preempt_disable(); + for_each_domain(cpu, sd) { + enum idle_type itype; + char mask_str[NR_CPUS]; + + cpumask_scnprintf(mask_str, NR_CPUS, sd->span); + seq_printf(seq, "domain%d %s", dcnt++, mask_str); + for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES; + itype++) { + seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu " + "%lu", + sd->lb_cnt[itype], + sd->lb_balanced[itype], + sd->lb_failed[itype], + sd->lb_imbalance[itype], + sd->lb_gained[itype], + sd->lb_hot_gained[itype], + sd->lb_nobusyq[itype], + sd->lb_nobusyg[itype]); + } + seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu" + " %lu %lu %lu\n", + sd->alb_cnt, sd->alb_failed, sd->alb_pushed, + sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed, + sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed, + sd->ttwu_wake_remote, sd->ttwu_move_affine, + sd->ttwu_move_balance); + } + preempt_enable(); +#endif + } + return 0; +} + +static int schedstat_open(struct inode *inode, struct file *file) +{ + unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32); + char *buf = kmalloc(size, GFP_KERNEL); + struct seq_file *m; + int res; + + if (!buf) + return -ENOMEM; + res = single_open(file, show_schedstat, NULL); + if (!res) { + m = file->private_data; + m->buf = buf; + m->size = size; + } else + kfree(buf); + return res; +} + +const struct file_operations proc_schedstat_operations = { + .open = schedstat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +/* + * Expects runqueue lock to be held for atomicity of update + */ +static inline void +rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies) +{ + if (rq) { + rq->rq_sched_info.run_delay += delta_jiffies; + rq->rq_sched_info.pcnt++; + } +} + +/* + * Expects runqueue lock to be held for atomicity of update + */ +static inline void +rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies) +{ + if (rq) + rq->rq_sched_info.cpu_time += delta_jiffies; +} +# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) +# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) +#else /* !CONFIG_SCHEDSTATS */ +static inline void +rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies) +{} +static inline void +rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies) +{} +# define schedstat_inc(rq, field) do { } while (0) +# define schedstat_add(rq, field, amt) do { } while (0) +#endif + +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) +/* + * Called when a process is dequeued from the active array and given + * the cpu. We should note that with the exception of interactive + * tasks, the expired queue will become the active queue after the active + * queue is empty, without explicitly dequeuing and requeuing tasks in the + * expired queue. (Interactive tasks may be requeued directly to the + * active queue, thus delaying tasks in the expired queue from running; + * see scheduler_tick()). + * + * This function is only called from sched_info_arrive(), rather than + * dequeue_task(). Even though a task may be queued and dequeued multiple + * times as it is shuffled about, we're really interested in knowing how + * long it was from the *first* time it was queued to the time that it + * finally hit a cpu. + */ +static inline void sched_info_dequeued(struct task_struct *t) +{ + t->sched_info.last_queued = 0; +} + +/* + * Called when a task finally hits the cpu. We can now calculate how + * long it was waiting to run. We also note when it began so that we + * can keep stats on how long its timeslice is. + */ +static void sched_info_arrive(struct task_struct *t) +{ + unsigned long now = jiffies, delta_jiffies = 0; + + if (t->sched_info.last_queued) + delta_jiffies = now - t->sched_info.last_queued; + sched_info_dequeued(t); + t->sched_info.run_delay += delta_jiffies; + t->sched_info.last_arrival = now; + t->sched_info.pcnt++; + + rq_sched_info_arrive(task_rq(t), delta_jiffies); +} + +/* + * Called when a process is queued into either the active or expired + * array. The time is noted and later used to determine how long we + * had to wait for us to reach the cpu. Since the expired queue will + * become the active queue after active queue is empty, without dequeuing + * and requeuing any tasks, we are interested in queuing to either. It + * is unusual but not impossible for tasks to be dequeued and immediately + * requeued in the same or another array: this can happen in sched_yield(), + * set_user_nice(), and even load_balance() as it moves tasks from runqueue + * to runqueue. + * + * This function is only called from enqueue_task(), but also only updates + * the timestamp if it is already not set. It's assumed that + * sched_info_dequeued() will clear that stamp when appropriate. + */ +static inline void sched_info_queued(struct task_struct *t) +{ + if (unlikely(sched_info_on())) + if (!t->sched_info.last_queued) + t->sched_info.last_queued = jiffies; +} + +/* + * Called when a process ceases being the active-running process, either + * voluntarily or involuntarily. Now we can calculate how long we ran. + */ +static inline void sched_info_depart(struct task_struct *t) +{ + unsigned long delta_jiffies = jiffies - t->sched_info.last_arrival; + + t->sched_info.cpu_time += delta_jiffies; + rq_sched_info_depart(task_rq(t), delta_jiffies); +} + +/* + * Called when tasks are switched involuntarily due, typically, to expiring + * their time slice. (This may also be called when switching to or from + * the idle task.) We are only called when prev != next. + */ +static inline void +__sched_info_switch(struct task_struct *prev, struct task_struct *next) +{ + struct rq *rq = task_rq(prev); + + /* + * prev now departs the cpu. It's not interesting to record + * stats about how efficient we were at scheduling the idle + * process, however. + */ + if (prev != rq->idle) + sched_info_depart(prev); + + if (next != rq->idle) + sched_info_arrive(next); +} +static inline void +sched_info_switch(struct task_struct *prev, struct task_struct *next) +{ + if (unlikely(sched_info_on())) + __sched_info_switch(prev, next); +} +#else +#define sched_info_queued(t) do { } while (0) +#define sched_info_switch(t, next) do { } while (0) +#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */ + Index: linux-cfs-2.6.20.8.q/kernel/sysctl.c =================================================================== --- linux-cfs-2.6.20.8.q.orig/kernel/sysctl.c +++ linux-cfs-2.6.20.8.q/kernel/sysctl.c @@ -320,6 +320,46 @@ static ctl_table kern_table[] = { .strategy = &sysctl_uts_string, }, { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_granularity_ns", + .data = &sysctl_sched_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_wakeup_granularity_ns", + .data = &sysctl_sched_wakeup_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_sleep_history_max_ns", + .data = &sysctl_sched_sleep_history_max, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_child_runs_first", + .data = &sysctl_sched_child_runs_first, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_load_smoothing", + .data = &sysctl_sched_load_smoothing, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = KERN_PANIC, .procname = "panic", .data = &panic_timeout,