Discussion:
another approach to rss : sloppy rss
(too old to reply)
Christoph Lameter
2004-11-18 19:34:21 UTC
Permalink
But I don't know what the appropriate solution is. My priorities
may be wrong, but I dislike the thought of a struct mm dominated
by a huge percpu array of rss longs (or cachelines?), even if the
machines on which it would be huge are ones which could well afford
the waste of memory. It just offends my sense of proportion, when
the exact rss is of no importance. I'm more attracted to just
leaving it unatomic, and living with the fact that it's racy
and approximate (but /proc report negatives as 0).
Here is a patch that enables handling of rss outside of the page table
lock by simply ignoring errors introduced by not locking. The loss
of rss was always less than 1%.

The patch insures that negative rss values are not displayed and removes 3
checks in mm/rmap.c that utilized rss (unecessarily AFAIK).

Some numbers:

4 Gigabyte concurrent alocation from 4 cpus:

rss protect by page_table_lock:

margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262479 RSS=262234
Size=262415 RSS=262233
4 3 4 0.180s 16.271s 5.010s 47801.151 154059.862
margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262415 RSS=262233
Size=262415 RSS=262233
4 3 4 0.155s 14.616s 4.081s 53239.852 163270.962
margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262479 RSS=262234
Size=262415 RSS=262233
4 3 4 0.172s 16.192s 5.018s 48055.018 151621.738

with sloppy rss:

margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=261120
Size=262415 RSS=261074
Size=262415 RSS=261215
4 3 4 0.161s 13.058s 4.060s 59489.254 170939.864
margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=260900
Size=262543 RSS=261001
Size=262415 RSS=261053
4 3 4 0.152s 13.565s 4.031s 57329.397 182103.081
margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=260988
Size=262479 RSS=261112
Size=262479 RSS=261343
4 3 4 0.143s 12.994s 4.060s 59860.702 170770.399

32 GB allocation with 32 cpus.

with page_table_lock:

Size=2099307 RSS=2097270
Size=2099371 RSS=2097271
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
32 10 32 18.105s 5466.913s 202.027s 3823.418 103676.172

sloppy rss:

Size=2099307 RSS=2094018
Size=2099307 RSS=2093738
Size=2099307 RSS=2093907
Size=2099307 RSS=2093634
Size=2099307 RSS=2093731
Size=2099307 RSS=2094343
Size=2099307 RSS=2094072
Size=2099307 RSS=2094185
Size=2099307 RSS=2093845
Size=2099307 RSS=2093396
32 10 32 14.872s 1036.711s 55.023s 19942.800 379701.332



Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-17 06:58:51.000000000 -0800
@@ -216,7 +216,7 @@
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -252,6 +252,19 @@
struct kioctx default_kioctx;
};

+/*
+ * rss and anon_rss are incremented and decremented in some locations without
+ * proper locking. This function insures that these values do not become negative
+ * and is called before reporting rss based statistics
+ */
+static void inline rss_fixup(struct mm_struct *mm)
+{
+ if ((long)mm->rss < 0)
+ mm->rss = 0;
+ if ((long)mm->anon_rss < 0)
+ mm->anon_rss = 0;
+}
+
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-17 06:58:51.000000000 -0800
@@ -11,6 +11,7 @@
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+ rss_fixup(mm);
buffer += sprintf(buffer,
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
@@ -37,6 +38,7 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
+ rss_fixup(mm);
*shared = mm->rss - mm->anon_rss;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
PAGE_SHIFT;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-17 06:58:51.000000000 -0800
@@ -325,6 +325,7 @@
vsize = task_vsize(mm);
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
+ rss_fixup(mm);
}

get_task_comm(tcomm, task);
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-17 07:07:00.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -504,8 +502,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -788,8 +784,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 01:40:42 UTC
Permalink
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.

The patch implements a fastpath where the page_table_lock is not dropped
in do_anonymous_page. The fastpath steals a page from the hot or cold
lists to get a page quickly.

Results (4 GB and 32 GB allocation on up to 32 processors gradually
incrementing the number of processors)

with patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.524s 24.524s 25.005s104653.150 104642.920
4 10 2 0.456s 29.458s 15.082s 87629.462 165633.410
4 10 4 0.453s 37.064s 11.002s 69872.279 237796.809
4 10 8 0.574s 99.258s 15.003s 26258.236 174308.765
4 10 16 2.171s 279.211s 21.001s 9316.271 124721.683
4 10 32 2.544s 741.273s 27.093s 3524.299 93827.660

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.124s 358.469s 362.061s 57837.481 57834.144
32 10 2 4.217s 440.333s 235.043s 47174.609 89076.709
32 10 4 3.778s 321.754s 100.069s 64422.222 208270.694
32 10 8 3.830s 789.580s 117.067s 26432.116 178211.592
32 10 16 3.921s 2360.026s 170.021s 8871.395 123203.040
32 10 32 9.140s 6213.944s 224.068s 3369.955 93338.297

w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.449s 24.992s 25.044s103038.282 103022.448
4 10 2 0.448s 30.290s 16.027s 85282.541 161110.770
4 10 4 0.420s 38.700s 11.061s 67008.319 225702.353
4 10 8 0.612s 93.862s 14.059s 27747.547 179564.131
4 10 16 1.554s 265.199s 20.016s 9827.180 129994.843
4 10 32 8.088s 657.280s 25.074s 3939.826 101822.835

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456
32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453
32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486
32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674
32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973
32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521

Only a minimal increase if at all. At the high end the patch leads to
even more contention.

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-18 16:53:01.000000000 -0800
@@ -1436,28 +1436,56 @@

/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
-
/* ..except if it's a write access */
if (write_access) {
+ struct per_cpu_pageset *pageset;
+ unsigned long flags;
+ int temperature;
+
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
-
- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
-
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
+ /* This is not numa compatible yet! */
+ pageset = NODE_DATA(numa_node_id())->node_zonelists[GFP_HIGHUSER & GFP_ZONEMASK].zones[0]->pageset+smp_processor_id();
+
+ /* Fastpath for the case that the anonvma is already setup and there are
+ * pages available in the per_cpu_pageset for this node. If so steal
+ * pages from the pageset and avoid dropping the page_table_lock.
+ */
+ local_irq_save(flags);
+ temperature=1;
+ if (vma->anon_vma && (pageset->pcp[temperature].count || pageset->pcp[--temperature].count)) {
+ /* Fastpath for hot/cold pages */
+ page = list_entry(pageset->pcp[temperature].list.next, struct page, lru);
+ list_del(&page->lru);
+ pageset->pcp[temperature].count--;
+ local_irq_restore(flags);
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ 1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_checked | 1 << PG_mappedtodisk);
+ page->private = 0;
+ set_page_count(page, 1);
+ /* We skipped updating the zone statistics !*/
+ } else {
+ /* Slow path */
+ local_irq_restore(flags);
spin_unlock(&mm->page_table_lock);
- goto out;
+
+ if (unlikely(anon_vma_prepare(vma)))
+ goto no_mem;
+ page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ if (!page)
+ goto no_mem;
+
+ spin_lock(&mm->page_table_lock);
+ page_table = pte_offset_map(pmd, addr);
+
+ if (!pte_none(*page_table)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
}
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1473,7 +1501,10 @@

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
+
spin_unlock(&mm->page_table_lock);
+ if (write_access)
+ clear_user_highpage(page, addr);
out:
return VM_FAULT_MINOR;
no_mem:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-19 02:19:11 UTC
Permalink
Post by Christoph Lameter
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.
I had a similar wild idea. Mine was to just make sure we have a spare
per-CPU page ready before taking any locks.

Ahh, you're doing clear_user_highpage after the pte is already set up?
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 02:38:47 UTC
Permalink
Post by Nick Piggin
Ahh, you're doing clear_user_highpage after the pte is already set up?
The huge page code also has that optimization. Clearing of pages
may take some time which is one reason the kernel drops the page table
lock for anonymous page allocation and then reacquires it. The patch does
not relinquish the lock on the fast path thus the move outside of the
lock.
Post by Nick Piggin
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
If you do the clearing with the page table lock held then performance will
suffer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-19 02:44:25 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
Ahh, you're doing clear_user_highpage after the pte is already set up?
The huge page code also has that optimization. Clearing of pages
may take some time which is one reason the kernel drops the page table
lock for anonymous page allocation and then reacquires it. The patch does
not relinquish the lock on the fast path thus the move outside of the
lock.
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Post by Christoph Lameter
Post by Nick Piggin
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
If you do the clearing with the page table lock held then performance will
suffer.
Yeah very much, but if you allocate and clear a "just in case" page
_before_ taking any locks for the fault then you'd be able to go
straight through do_anonymous_page.

But yeah that has other issues like having a spare page per CPU (maybe
not so great a loss), and having anonymous faults much more likely to
get pages which are cache cold.

Anyway, glad to see your patches didn't improve things: now we don't
have to think about making *more* tradeoffs :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 03:28:41 UTC
Permalink
Post by Nick Piggin
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Nothing. If this had led to anything then we would have needed to address
this issue. The clearing had to be outside of the lock in order not to
impact the performance tests negatively.
Post by Nick Piggin
Post by Christoph Lameter
If you do the clearing with the page table lock held then performance will
suffer.
Yeah very much, but if you allocate and clear a "just in case" page
_before_ taking any locks for the fault then you'd be able to go
straight through do_anonymous_page.
But yeah that has other issues like having a spare page per CPU (maybe
not so great a loss), and having anonymous faults much more likely to
get pages which are cache cold.
You may be able to implement that using the hot and cold lists. Have
something that runs on the lists and prezeros and preformats these pages
(idle thread?).

Set some flag to indicate that a page has been prepared and then just zing
it in if do_anymous_page finds that flag said.

But I think this may be introduce way too much complexity
into the page fault handler.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt
2004-11-19 07:07:48 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Nothing. If this had led to anything then we would have needed to address
this issue. The clearing had to be outside of the lock in order not to
impact the performance tests negatively.
No, it's clearly a bug. We even had a very hard to track down bug
recently on ppc64 which was caused by the fact that set_pte didn't
contain a barrier, thus the stores done by the _previous_
clear_user_high_page() could be re-ordered with the store to the PTE.
That could cause another process to "see" the PTE before the writes of 0
to the page, and thus start writing to the page before all zero's went
in, thus ending up with corrupted data. We had a real life testcase of
this one. This test case would blow up right away with your code I
think.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 19:42:39 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>

Changes from V10->V11 of this patch:
- cmpxchg_i386: Optimize code generated after feedback from Linus. Various
fixes.
- drop make_rss_atomic in favor of rss_sloppy
- generic: adapt to new changes in Linus tree, some fixes to fallback
functions. Add generic ptep_xchg_flush based on xchg.
- S390: remove use of page_table_lock from ptep_xchg_flush (deadlock)
- x86_64: remove ptep_xchg
- i386: integrated Nick Piggin's changes for PAE mode. Create ptep_xchg_flush and
various fixes.
- ia64: if necessary flush icache before ptep_cmpxchg. Remove ptep_xchg

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 32 processors allocating 32 GB with an increasing
number of cpus.

Without the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456
32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453
32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486
32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674
32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973
32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521

With the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.772s 330.629s 334.042s 62713.587 62708.706
32 10 2 3.767s 352.252s 185.077s 58905.502 112886.222
32 10 4 3.549s 255.683s 77.000s 80898.177 272326.496
32 10 8 3.522s 263.879s 52.030s 78427.083 400965.857
32 10 16 5.193s 384.813s 42.076s 53772.158 490378.852
32 10 32 15.806s 996.890s 54.077s 20708.587 382879.208

With a high number of CPUs the page fault rate improves more than
twofold and may reach 500000 faults/sec betweenr 16-512 cpus. The
fault rate drops if a process is running on all processors as also
here for the 32 cpu case.

Note that the measurements were done on a NUMA system and this
test uses off node memory. Variations may exist due to allocations in
memory areas in diverse distances to the local cpu. The slight drop
for 2 cpus is probably due to that effect.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its generic
emulation) on page table entries before doing an update_mmu_change without holding
the page table lock. However, we do similar things now with other atomic pte operations
such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear
a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch
operates on an *cleared* pte and replaces it with a pte pointing to valid memory.
The effect of this change on various architectures has to be thought through. Local
definitions of ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires the
flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch uses sloppy rss handling. mm->rss is incremented without
proper locking because locking would introduce too much overhead. Rss
is not essential for vm operations (3 uses of rss in rmap.c were not necessary and
were removed). The difference in rss values has been found to be less than 1% in
our tests (see also the separate email to linux-mm and linux-ia64 on the subject
of "sloppy rss"). The move away from using atomic operations for rss in earlier versions
of this patch also increases the performance of the page fault handler in the single
thread case over an unpatched kernel.

Note that I have posted two other approaches of dealing with the rss problem:

A. make_rss_atomic. The earlier releases contained that patch but then another
variable (such as anon_rss) was introduced that would have required additional
atomic operations. Atomic rss operations are also causing slowdowns on
machines with a high number of cpus due to memory contention.

B. remove_rss. Replace rss with a periodic scan over the vm to determine
rss and additional numbers. This was also discussed on linux-mm and linux-ia64.
The scans while displaying /proc data were undesirable.

The patchset is composed of 7 patches:

1/7: Sloppy rss

Removes mm->rss usage from mm/rmap.c and insures that negative rss values
are not displayed.

2/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the page_table_lock
is held is simply reduced. The lock may then be acquired multiple
times during a page fault.

The remaining patches introduce the necessary atomic pte operations to avoid
the page_table_lock.

3/7: Atomic pte operations for ia64

4/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg instructions.
This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
cpus. The emulations are only included if a kernel is build for these old
cpus and are skipped for the real cmpxchg instructions if the kernel
that is build for 386 or 486 is then run on a more recent cpu.

This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for this patch to
preserve the ability to build kernels for 386 and 486.

6/7: Atomic pte operation for x86_64

7/7: Atomic pte operations for s390
Christoph Lameter
2004-11-19 19:43:30 UTC
Permalink
Changelog
* Enable the sloppy use of mm->rss and mm->anon_rss atomic without locking
* Insure that negative rss values are not given out by the /proc filesystem
* remove 3 checks of rss in mm/rmap.c
* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-18 13:04:30.000000000 -0800
@@ -216,7 +216,7 @@
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -252,6 +252,21 @@
struct kioctx default_kioctx;
};

+/*
+ * rss and anon_rss are incremented and decremented in some locations without
+ * proper locking. This function insures that these values do not become negative.
+ */
+static long inline get_rss(struct mm_struct *mm)
+{
+ long rss = mm->rss;
+
+ if (rss < 0)
+ mm->rss = rss = 0;
+ if ((long)mm->anon_rss < 0)
+ mm->anon_rss = 0;
+ return rss;
+}
+
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-18 12:56:26.000000000 -0800
@@ -22,7 +22,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_rss(mm) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,7 +37,9 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ *shared = get_rss(mm) - mm->anon_rss;
+ if (*shared <0)
+ *shared = 0;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-18 12:53:16.000000000 -0800
@@ -420,7 +420,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-18 12:26:45.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -504,8 +502,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -788,8 +784,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2004-11-19 20:50:59 UTC
Permalink
Sorry, against what tree do these patches apply?
Apparently not linux-2.6.9, nor latest -bk, nor -mm?

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-20 01:29:48 UTC
Permalink
2.6.10-rc2-bk3
Post by Hugh Dickins
Sorry, against what tree do these patches apply?
Apparently not linux-2.6.9, nor latest -bk, nor -mm?
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2004-11-22 15:00:37 UTC
Permalink
Post by Christoph Lameter
Post by Hugh Dickins
Sorry, against what tree do these patches apply?
Apparently not linux-2.6.9, nor latest -bk, nor -mm?
2.6.10-rc2-bk3
Ah, thanks - got it patched now, but your mailer (or something else)
is eating trailing spaces. Better than adding them, but we have to
apply this patch before your set:

--- 2.6.10-rc2-bk3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
#define cmpxchg(ptr,o,n)\
((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
(unsigned long)(n),sizeof(*(ptr))))
-
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */
--- 2.6.10-rc2-bk3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m

#endif /* __s390x__ */

-static inline void
+static inline void
pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
{
#ifndef __s390x__
--- 2.6.10-rc2-bk3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
/*
* We are called with the MM semaphore and page_table_lock
* spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * multithreaded programs.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
* Fall back to the linear mapping if the fs does not support
* ->populate:
*/
- if (!vma->vm_ops || !vma->vm_ops->populate ||
+ if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
return do_no_page(mm, vma, address, write_access, pte, pmd);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 21:50:55 UTC
Permalink
One way to solve the rss issues is--as discussed--to put rss into the
task structure and then have the page fault increment that rss.

The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.

The following patch does put the rss into task_struct. The page fault
handler is then incrementing current->rss if the page_table_lock is not
held.

The timer interrupt checks if task->rss is non zero (when doing
stime/utime updates. rss is defined near those so its hopefully on the
same cacheline and has a minimal impact).

If rss is non zero and the page_table_lock and the mmap_sem can be taken
then the mm->rss will be updated by the value of the task->rss and
task->rss will be zeroed. This avoids all proc issues. The only
disadvantage is that rss may be inaccurate for a couple of clock ticks.

This also adds some performance (sorry only a 4p system):

sloppy rss:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565
4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558
4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945

deferred rss:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580
4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659
4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-22 13:18:58.000000000 -0800
@@ -584,6 +584,10 @@
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
unsigned long utime, stime;
+ long rss; /* rss counter when mm->rss is not usable. mm->page_table_lock
+ * not held but mm->mmap_sem must be held for sync with
+ * the timer interrupt which clears rss and updates mm->rss.
+ */
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time;
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-22 11:16:02.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -507,8 +505,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -791,8 +787,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.9/kernel/fork.c
===================================================================
--- linux-2.6.9.orig/kernel/fork.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/fork.c 2004-11-22 11:16:02.000000000 -0800
@@ -876,6 +876,7 @@
p->io_context = NULL;
p->io_wait = NULL;
p->audit_context = NULL;
+ p->rss = 0;
#ifdef CONFIG_NUMA
p->mempolicy = mpol_copy(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
Index: linux-2.6.9/kernel/exit.c
===================================================================
--- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800
@@ -501,6 +501,9 @@
/* more a memory barrier than a real lock */
task_lock(tsk);
tsk->mm = NULL;
+ /* only holding mmap_sem here maybe get page_table_lock too? */
+ mm->rss += tsk->rss;
+ tsk->rss = 0;
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(tsk);
Index: linux-2.6.9/kernel/timer.c
===================================================================
--- linux-2.6.9.orig/kernel/timer.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/timer.c 2004-11-22 11:42:12.000000000 -0800
@@ -815,6 +815,15 @@
if (psecs / HZ >= p->signal->rlim[RLIMIT_CPU].rlim_max)
send_sig(SIGKILL, p, 1);
}
+ /* Update mm->rss if necessary */
+ if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
+ if (spin_trylock(&p->mm->page_table_lock)) {
+ p->mm->rss += p->rss;
+ p->rss = 0;
+ spin_unlock(&p->mm->page_table_lock);
+ }
+ up_write(&p->mm->mmap_sem);
+ }
}

static inline void do_it_virt(struct task_struct * p, unsigned long ticks)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2004-11-22 22:11:48 UTC
Permalink
Post by Christoph Lameter
One way to solve the rss issues is--as discussed--to put rss into the
task structure and then have the page fault increment that rss.
The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.
The following patch does put the rss into task_struct. The page fault
handler is then incrementing current->rss if the page_table_lock is not
held.
The timer interrupt checks if task->rss is non zero (when doing
stime/utime updates. rss is defined near those so its hopefully on the
same cacheline and has a minimal impact).
If rss is non zero and the page_table_lock and the mmap_sem can be taken
then the mm->rss will be updated by the value of the task->rss and
task->rss will be zeroed. This avoids all proc issues. The only
disadvantage is that rss may be inaccurate for a couple of clock ticks.
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565
4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558
4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580
4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659
4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196
hrm. I cannot see anywhere in this patch where you update task_struct.rss.
Post by Christoph Lameter
Index: linux-2.6.9/kernel/exit.c
===================================================================
--- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800
+++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800
@@ -501,6 +501,9 @@
/* more a memory barrier than a real lock */
task_lock(tsk);
tsk->mm = NULL;
+ /* only holding mmap_sem here maybe get page_table_lock too? */
+ mm->rss += tsk->rss;
+ tsk->rss = 0;
up_read(&mm->mmap_sem);
mmap_sem needs to be held for writing, surely?
Post by Christoph Lameter
+ /* Update mm->rss if necessary */
+ if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
+ if (spin_trylock(&p->mm->page_table_lock)) {
+ p->mm->rss += p->rss;
+ p->rss = 0;
+ spin_unlock(&p->mm->page_table_lock);
+ }
+ up_write(&p->mm->mmap_sem);
+ }
}
I'd also suggest that you do:

tsk->rss++;
if (tsk->rss > 16) {
spin_lock(&mm->page_table_lock);
mm->rss += tsk->rss;
spin_unlock(&mm->page_table_lock);
tsk->rss = 0;
}

just to prevent transient gross inaccuracies. For some value of "16".
Christoph Lameter
2004-11-22 22:13:06 UTC
Permalink
Post by Andrew Morton
hrm. I cannot see anywhere in this patch where you update task_struct.rss.
This is just the piece around it dealing with rss. The updating of rss
happens in the generic code. The change to that is trivial. I can repost
the whole shebang if you want.
Post by Andrew Morton
Post by Christoph Lameter
+ /* only holding mmap_sem here maybe get page_table_lock too? */
+ mm->rss += tsk->rss;
+ tsk->rss = 0;
up_read(&mm->mmap_sem);
mmap_sem needs to be held for writing, surely?
If there are no page faults occurring anymore then we would not need to
get the lock. Q: Is it safe to assume that no faults occur
anymore at this point?
Post by Andrew Morton
just to prevent transient gross inaccuracies. For some value of "16".
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Benjamin Herrenschmidt
2004-11-22 22:17:24 UTC
Permalink
Post by Christoph Lameter
Post by Andrew Morton
hrm. I cannot see anywhere in this patch where you update task_struct.rss.
This is just the piece around it dealing with rss. The updating of rss
happens in the generic code. The change to that is trivial. I can repost
the whole shebang if you want.
Post by Andrew Morton
Post by Christoph Lameter
+ /* only holding mmap_sem here maybe get page_table_lock too? */
+ mm->rss += tsk->rss;
+ tsk->rss = 0;
up_read(&mm->mmap_sem);
mmap_sem needs to be held for writing, surely?
If there are no page faults occurring anymore then we would not need to
get the lock. Q: Is it safe to assume that no faults occur
anymore at this point?
Why wouldn't the mm take faults on other CPUs ? (other threads)
Post by Christoph Lameter
Post by Andrew Morton
just to prevent transient gross inaccuracies. For some value of "16".
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
Benjamin Herrenschmidt <***@kernel.crashing.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2004-11-22 22:45:07 UTC
Permalink
Post by Christoph Lameter
Post by Andrew Morton
just to prevent transient gross inaccuracies. For some value of "16".
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
Sure. Take a million successive pagefaults and mm->rss is grossly
inaccurate. Hence my suggestion that it be spilled into mm->rss
periodically.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 22:48:22 UTC
Permalink
Post by Andrew Morton
Post by Christoph Lameter
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
Sure. Take a million successive pagefaults and mm->rss is grossly
inaccurate. Hence my suggestion that it be spilled into mm->rss
periodically.
It is spilled into mm->rss periodically. That is the whole point of the
patch.

The timer tick occurs every 1 ms. The maximum pagefault frequency that I
have seen is 500000 faults /second. The max deviation is therefore
less than 500 (could be greater if page table lock / mmap_sem always held
when the tick occurs).
Andrew Morton
2004-11-22 23:16:28 UTC
Permalink
Post by Christoph Lameter
Post by Andrew Morton
Post by Christoph Lameter
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
Sure. Take a million successive pagefaults and mm->rss is grossly
inaccurate. Hence my suggestion that it be spilled into mm->rss
periodically.
It is spilled into mm->rss periodically. That is the whole point of the
patch.
The timer tick occurs every 1 ms.
That only works if the task happens to have the CPU when the timer tick
occurs. There remains no theoretical upper bound to the error in mm->rss,
and that's very easy to fix.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 23:19:36 UTC
Permalink
Post by Andrew Morton
Post by Christoph Lameter
The timer tick occurs every 1 ms.
That only works if the task happens to have the CPU when the timer tick
occurs. There remains no theoretical upper bound to the error in mm->rss,
and that's very easy to fix.
Page fault intensive programs typically hog the cpu. But you are in
principle right.

The "easy fix" that you propose is to add additional logic to some very
hot code paths. Then we are probably better off with another approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 23:13:25 UTC
Permalink
Post by Christoph Lameter
The timer tick occurs every 1 ms. The maximum pagefault frequency that I
have seen is 500000 faults /second. The max deviation is therefore
less than 500 (could be greater if page table lock / mmap_sem always held
when the tick occurs).
I think that by the time you get the spilling code in, the mm-list method
will be looking positively elegant!
I do not care what gets in as long as something goes in to address the
performance issues. So far everyone seems to have their pet ideas. By all
means do the mm-list method and post it. But we have already seen
objections by other against loops in proc. So that will also cause
additional controversy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-22 23:09:34 UTC
Permalink
Post by Christoph Lameter
Post by Andrew Morton
Post by Christoph Lameter
The page fault code only increments rss. For larger transactions that
increase / decrease rss significantly the page_table_lock is taken and
mm->rss is updated directly. So no
gross inaccuracies can result.
Sure. Take a million successive pagefaults and mm->rss is grossly
inaccurate. Hence my suggestion that it be spilled into mm->rss
periodically.
It is spilled into mm->rss periodically. That is the whole point of the
patch.
The timer tick occurs every 1 ms. The maximum pagefault frequency that I
have seen is 500000 faults /second. The max deviation is therefore
less than 500 (could be greater if page table lock / mmap_sem always held
when the tick occurs).
You could imagine a situation where something pagefaults and sleeps in
lock-step with the timer though. Theoretical problem only?

I think that by the time you get the spilling code in, the mm-list method
will be looking positively elegant!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-22 22:22:30 UTC
Permalink
Post by Christoph Lameter
The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.
The alternative is to just add a simple list into the task_struct and the
head of it into mm_struct. Then, at fork, you just finish the fork() with

list_add(p->mm_list, p->mm->thread_list);

and do the proper list_del() in exit_mm() or wherever.

You'll still loop in /proc, but you'll do the minimal loop necessary.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 22:27:07 UTC
Permalink
Post by Linus Torvalds
The alternative is to just add a simple list into the task_struct and the
head of it into mm_struct. Then, at fork, you just finish the fork() with
list_add(p->mm_list, p->mm->thread_list);
and do the proper list_del() in exit_mm() or wherever.
You'll still loop in /proc, but you'll do the minimal loop necessary.
I think the approach that I posted is simpler unless there are other
benefits to be gained if it would be easy to figure out which tasks use an
mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-22 22:40:39 UTC
Permalink
Post by Christoph Lameter
I think the approach that I posted is simpler unless there are other
benefits to be gained if it would be easy to figure out which tasks use an
mm.
I'm just worried that your timer tick thing won't catch things in a timely
manner. That said, if that isn't an issue, and people don't have problems
with it. On the other hand, if /proc literally is the only real user, then
I guess it really can't matter.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:41:06 UTC
Permalink
Changes from V11->V12 of this patch:
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).

Without the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646
32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239
32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950
32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358
32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928
32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112
32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599
32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918
32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686

With the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.451s 140.151s 141.060s 44430.367 44428.115
32 3 2 1.399s 136.349s 73.041s 45673.303 85699.793
32 3 4 1.321s 129.760s 39.027s 47996.303 160197.217
32 3 8 1.279s 100.648s 20.039s 61724.641 308454.557
32 3 16 1.414s 153.975s 15.090s 40488.236 395681.716
32 3 32 2.534s 337.021s 17.016s 18528.487 366445.400
32 3 64 4.271s 709.872s 18.057s 8809.787 338656.440
32 3 128 18.734s 1805.094s 21.084s 3449.586 288005.644
32 3 256 14.698s 963.787s 11.078s 6429.787 534077.540
32 3 512 15.299s 453.990s 5.098s 13406.321 1050416.414

For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its
generic emulation) on page table entries before doing an
update_mmu_change without holding the page table lock. However, we do
similar things now with other atomic pte operations such as
ptep_get_and_clear and ptep_test_and_clear_dirty. These operations
clear a pte *after* doing an operation on it. The ptep_cmpxchg as used
in this patch operates on an *cleared* pte and replaces it with a pte
pointing to valid memory. The effect of this change on various
architectures has to be thought through. Local definitions of
ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires
the flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch introduces a split counter for rss handling to avoid atomic
operations and locks currently necessary for rss modifications. In
addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be
in the same cache line as tsk->mm (which is already used by the fault
handler) and thus tsk->rss can be incremented without locks
in a fast way. The cache line does not need to be shared between
processors in the page table handler.

A tasklist is generated for each mm (rcu based). Values in that list
are added up to calculate rss or anon_rss values.

The patchset is composed of 7 patches:

1/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the
page_table_lock is held is simply reduced. The lock may then be
acquired multiple times during a page fault.

2/7: Atomic pte operations for ia64

3/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg
instructions. This patch adds emulations for cmpxchg and cmpxchg8b
for old 80386 and 80486 cpus. The emulations are only included if a
kernel is build for these old cpus and are skipped for the real
cmpxchg instructions if the kernel that is build for 386 or 486 is
then run on a more recent cpu.

This patch may be used independently of the other patches.

4/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for
this patch to preserve the ability to build kernels for 386 and 486.

5/7: Atomic pte operation for x86_64

6/7: Atomic pte operations for s390

7/7: Split counter implementation for rss
Add tsk->rss and tsk->anon_rss. Add tasklist. Add logic
to calculate rss from tasklist.

There are some additional outstanding performance enhancements that are
not available yet but which require this patch. Those modifications
push the maximum page fault rate from ~ 1 mio faults per second as
shown above to above 3 mio faults per second.

The last editions of the sloppy rss, atomic rss and deferred rss patch
will be posted to linux-ia64 for archival purpose.

Signed-off-by: Christoph Lameter <***@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:42:05 UTC
Permalink
Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <***@sgi.com

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-23 10:06:03.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-23 10:07:55.000000000 -0800
@@ -1330,8 +1330,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1343,15 +1342,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1374,8 +1371,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1422,14 +1418,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1441,7 +1435,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1450,30 +1443,37 @@
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
- lru_cache_add_active(page);
mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ /* update the entry */
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ }
+ goto out;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ lru_cache_add_active(page);
+ page_add_anon_rmap(page, vma, addr);
+ mm->rss++;
+
+ }
pte_unmap(page_table);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1489,12 +1489,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1505,9 +1505,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1605,7 +1604,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1618,13 +1617,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

pgoff = pte_to_pgoff(*pte);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1643,49 +1641,40 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+ update_mmu_cache(vma, address, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -1703,22 +1692,45 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd
*/
- spin_lock(&mm->page_table_lock);
- pmd = pmd_alloc(mm, pgd, address);
+ if (unlikely(pgd_none(*pgd))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ /* Insure that the update is done in an atomic way */
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pgd, address);
+
+ if (likely(pmd)) {
+ pte_t *pte;
+
+ if (!pmd_present(*pmd)) {
+ struct page *new;

- if (pmd) {
- pte_t * pte = pte_alloc_map(mm, pmd, address);
- if (pte)
+ new = pte_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else
+ inc_page_state(nr_page_table_pages);
+ }
+
+ pte = pte_offset_map(pmd, address);
+ if (likely(pte))
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}

Index: linux-2.6.9/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-23 10:06:12.000000000 -0800
@@ -134,4 +134,60 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) set_pte(__ptep, __newval); \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-23 10:06:03.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-23 10:06:12.000000000 -0800
@@ -424,7 +424,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -568,11 +571,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -587,11 +585,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
page_remove_rmap(page);
page_cache_release(page);
@@ -678,15 +680,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:42:51 UTC
Permalink
Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PGD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -78,12 +82,19 @@
preempt_enable();
}

+
static inline void
pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
{
pgd_val(*pgd_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800
@@ -30,6 +30,8 @@
#define _PAGE_P_BIT 0
#define _PAGE_A_BIT 5
#define _PAGE_D_BIT 6
+#define _PAGE_IG_BITS 53
+#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */

#define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */
#define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */
@@ -58,6 +60,7 @@
#define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
#define _PAGE_PROTNONE (__IA64_UL(1) << 63)
+#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT)

/* Valid only for a PTE with the present bit cleared: */
#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */
@@ -270,6 +273,8 @@
#define pte_dirty(pte) ((pte_val(pte) & _PAGE_D) != 0)
#define pte_young(pte) ((pte_val(pte) & _PAGE_A) != 0)
#define pte_file(pte) ((pte_val(pte) & _PAGE_FILE) != 0)
+#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0)
+
/*
* Note: we convert AR_RWX to AR_RX and AR_RW to AR_R by clearing the 2nd bit in the
* access rights:
@@ -281,8 +286,15 @@
#define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A))
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
#define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte) (__pte(pte_val(pte) & ~_PAGE_LOCK))

/*
+ * Lock functions for pte's
+ */
+#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT, ptep)
+#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep); smp_mb__after_clear_bit(); }
+#define ptep_unlock_set(ptep, val) set_pte(ptep, pte_mkunlocked(val))
+/*
* Macro to a page protection value as "uncacheable". Note that "protection" is really a
* misnomer here as the protection value contains the memory attribute bits, dirty bits,
* and various other bits as well.
@@ -342,7 +354,6 @@
#define pte_unmap_nested(pte) do { } while (0)

/* atomic versions of the some PTE manipulations: */
-
static inline int
ptep_test_and_clear_young (pte_t *ptep)
{
@@ -414,6 +425,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -476,13 +507,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -560,6 +584,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:43:20 UTC
Permalink
Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor
systems if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value
via cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic
64 bit operations are useful)

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +325,32 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
struct alt_instr {
__u8 *instr; /* original instruction */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:44:26 UTC
Permalink
Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-23 10:59:01.000000000 -0800
@@ -7,16 +7,26 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pgd_populate(mm, pgd, pmd) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+ (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-22 15:08:43.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-23 10:59:01.000000000 -0800
@@ -437,6 +437,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:45:04 UTC
Permalink
Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800
@@ -567,6 +567,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -778,6 +787,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -791,6 +808,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.9/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:45:51 UTC
Permalink
Changelog
* Split rss counter into the task structure
* remove 3 checks of rss in mm/rmap.c
* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-30 20:33:31.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-30 20:33:50.000000000 -0800
@@ -30,6 +30,7 @@
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
+#include <linux/rcupdate.h>

struct exec_domain;

@@ -217,6 +218,7 @@
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ long rss, anon_rss;

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -226,7 +228,7 @@
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
@@ -236,6 +238,8 @@

/* Architecture-specific MM context */
mm_context_t context;
+ struct list_head task_list; /* Tasks using this mm */
+ struct rcu_head rcu_head; /* For freeing mm via rcu */

/* Token based thrashing protection. */
unsigned long swap_token_time;
@@ -545,6 +549,9 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ /* Split counters from mm */
+ long rss;
+ long anon_rss;

/* task state */
struct linux_binfmt *binfmt;
@@ -578,6 +585,9 @@
struct completion *vfork_done; /* for vfork() */
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
+
+ /* List of other tasks using the same mm */
+ struct list_head mm_tasks;

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -1111,6 +1121,14 @@

#endif

+unsigned long get_rss(struct mm_struct *mm);
+unsigned long get_anon_rss(struct mm_struct *mm);
+unsigned long get_shared(struct mm_struct *mm);
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk);
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk);
+
#endif /* __KERNEL__ */

#endif
+
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-30 20:33:26.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-30 20:33:50.000000000 -0800
@@ -22,7 +22,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_rss(mm) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,7 +37,7 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ *shared = get_shared(mm);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-30 20:33:26.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-30 20:33:50.000000000 -0800
@@ -420,7 +420,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-30 20:33:46.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-30 20:33:50.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -438,7 +436,7 @@
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
+ current->anon_rss++;

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
@@ -510,8 +508,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -799,8 +795,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.9/kernel/fork.c
===================================================================
--- linux-2.6.9.orig/kernel/fork.c 2004-11-30 20:33:42.000000000 -0800
+++ linux-2.6.9/kernel/fork.c 2004-11-30 20:33:50.000000000 -0800
@@ -151,6 +151,7 @@
*tsk = *orig;
tsk->thread_info = ti;
ti->task = tsk;
+ tsk->rss = 0;

/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
@@ -292,6 +293,7 @@
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ INIT_LIST_HEAD(&mm->task_list);
mm->core_waiters = 0;
mm->nr_ptes = 0;
spin_lock_init(&mm->page_table_lock);
@@ -323,6 +325,13 @@
return mm;
}

+static void rcu_free_mm(struct rcu_head *head)
+{
+ struct mm_struct *mm = container_of(head ,struct mm_struct, rcu_head);
+
+ free_mm(mm);
+}
+
/*
* Called when the last reference to the mm
* is dropped: either by a lazy thread or by
@@ -333,7 +342,7 @@
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
- free_mm(mm);
+ call_rcu(&mm->rcu_head, rcu_free_mm);
}

/*
@@ -400,6 +409,8 @@

/* Get rid of any cached register state */
deactivate_mm(tsk, mm);
+ if (mm)
+ mm_remove_thread(mm, tsk);

/* notify parent sleeping on vfork() */
if (vfork_done) {
@@ -447,8 +458,8 @@
* new threads start up in user mode using an mm, which
* allows optimizing out ipis; the tlb_gather_mmu code
* is an example.
+ * (mm_add_thread does use the ptl .... )
*/
- spin_unlock_wait(&oldmm->page_table_lock);
goto good_mm;
}

@@ -470,6 +481,7 @@
goto free_pt;

good_mm:
+ mm_add_thread(mm, tsk);
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-30 20:33:46.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-30 20:33:50.000000000 -0800
@@ -1467,7 +1467,7 @@
*/
lru_cache_add_active(page);
page_add_anon_rmap(page, vma, addr);
- mm->rss++;
+ current->rss++;

}
pte_unmap(page_table);
@@ -1859,3 +1859,87 @@
}

#endif
+
+unsigned long get_rss(struct mm_struct *mm)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss;
+
+ if (!mm)
+ return 0;
+
+ rcu_read_lock();
+ rss = mm->rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss += t->rss;
+ }
+ if (rss < 0)
+ rss = 0;
+ rcu_read_unlock();
+ return rss;
+}
+
+unsigned long get_anon_rss(struct mm_struct *mm)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss;
+
+ if (!mm)
+ return 0;
+
+ rcu_read_lock();
+ rss = mm->anon_rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss += t->anon_rss;
+ }
+ if (rss < 0)
+ rss = 0;
+ rcu_read_unlock();
+ return rss;
+}
+
+unsigned long get_shared(struct mm_struct *mm)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss;
+
+ if (!mm)
+ return 0;
+
+ rcu_read_lock();
+ rss = mm->rss - mm->anon_rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss += t->rss - t->anon_rss;
+ }
+ if (rss < 0)
+ rss = 0;
+ rcu_read_unlock();
+ return rss;
+}
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ if (!mm)
+ return;
+
+ spin_lock(&mm->page_table_lock);
+ mm->rss += tsk->rss;
+ mm->anon_rss += tsk->anon_rss;
+ list_del_rcu(&tsk->mm_tasks);
+ spin_unlock(&mm->page_table_lock);
+}
+
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ spin_lock(&mm->page_table_lock);
+ list_add_rcu(&tsk->mm_tasks, &mm->task_list);
+ spin_unlock(&mm->page_table_lock);
+}
+
+
Index: linux-2.6.9/include/linux/init_task.h
===================================================================
--- linux-2.6.9.orig/include/linux/init_task.h 2004-11-30 20:33:30.000000000 -0800
+++ linux-2.6.9/include/linux/init_task.h 2004-11-30 20:33:50.000000000 -0800
@@ -42,6 +42,7 @@
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
.default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \
+ .task_list = LIST_HEAD_INIT(name.task_list), \
}

#define INIT_SIGNALS(sig) { \
@@ -112,6 +113,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \
}


Index: linux-2.6.9/fs/exec.c
===================================================================
--- linux-2.6.9.orig/fs/exec.c 2004-11-30 20:33:41.000000000 -0800
+++ linux-2.6.9/fs/exec.c 2004-11-30 20:33:50.000000000 -0800
@@ -543,6 +543,7 @@
active_mm = tsk->active_mm;
tsk->mm = mm;
tsk->active_mm = mm;
+ mm_add_thread(mm, current);
activate_mm(active_mm, mm);
task_unlock(tsk);
arch_pick_mmap_layout(mm);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-12-01 23:43:53 UTC
Permalink
Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800
@@ -413,6 +413,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.9/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800
@@ -6,7 +6,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <***@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg <***@lameter.com>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -42,26 +43,15 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
-#define __HAVE_ARCH_SET_PTE_ATOMIC
-#define set_pte_atomic(pteptr,pteval) \
+#define set_pte(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))

+#define set_pte_atomic set_pte
+
/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
* the TLB via cr3 if the top-level pgd is changed...
@@ -142,4 +132,23 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800
@@ -82,4 +82,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +19,19 @@
((unsigned long long)page_to_pfn(pte) <<
(unsigned long long) PAGE_SHIFT)));
}
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -49,6 +65,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pgd_populate(mm, pmd, pte) BUG()
+#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })

#define check_pgt_cache() do { } while (0)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-12-02 00:10:18 UTC
Permalink
Post by Christoph Lameter
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).
Ok, consider me convinced. I don't want to apply this before I get 2.6.10
out the door, but I'm happy with it. I assume Andrew has already picked up
the previous version.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2004-12-02 00:55:38 UTC
Permalink
Post by Linus Torvalds
Post by Christoph Lameter
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).
Ok, consider me convinced. I don't want to apply this before I get 2.6.10
out the door, but I'm happy with it.
There were concerns about some architectures relying upon page_table_lock
for exclusivity within their own pte handling functions. Have they all
been resolved?
Post by Linus Torvalds
I assume Andrew has already picked up the previous version.
Nope. It has major clashes with the 4-level-pagetable work.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-02 01:46:04 UTC
Permalink
Post by Andrew Morton
Post by Linus Torvalds
Ok, consider me convinced. I don't want to apply this before I get 2.6.10
out the door, but I'm happy with it.
There were concerns about some architectures relying upon page_table_lock
for exclusivity within their own pte handling functions. Have they all
been resolved?
The patch will fall back on the page_table_lock if an architecture cannot
provide atomic pte operations.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2004-12-02 06:21:01 UTC
Permalink
Post by Linus Torvalds
Ok, consider me convinced. I don't want to apply this before I get 2.6.10
out the door, but I'm happy with it. I assume Andrew has already picked up
the previous version.
Does that mean that 2.6.10 is actually close to the door?

/me runs...

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-02 06:34:41 UTC
Permalink
Post by Jeff Garzik
Post by Linus Torvalds
Ok, consider me convinced. I don't want to apply this before I get 2.6.10
out the door, but I'm happy with it. I assume Andrew has already picked up
the previous version.
Does that mean that 2.6.10 is actually close to the door?
We need an -rc3 yet. And I need to do another pass through the
regressions-since-2.6.9 list. We've made pretty good progress there
recently. Mid to late December is looking like the 2.6.10 date.

We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.

Of course, nobody will test -rc3 and a zillion people will test final
2.6.10, which is when we get lots of useful bug reports. If this keeps on
happening then we'll need to get more serious about the 2.6.10.n process.

Or start alternating between stable and flakey releases, so 2.6.11 will be
a feature release with a 2-month development period and 2.6.12 will be a
bugfix-only release, with perhaps a 2-week development period, so people
know that the even-numbered releases are better stabilised.

We'll see. It all depends on how many bugs you can fix in the next two
weeks ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2004-12-02 06:48:25 UTC
Permalink
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.

Something like John Cherry's reports to lkml on warnings and errors
would be darned useful. His reports are IMO an ideal model: show
day-to-day _changes_ in test results. Don't just dump a huge list of
testsuite results, results which are often clogged with expected
failures and testsuite bug noise.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-02 07:02:17 UTC
Permalink
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.

However I have my doubts about how useful it will end up being. These test
suites don't seem to pick up many regressions. I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.

My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.

We simply get far better coverage testing by releasing code, because of all
the wild, whacky and weird things which people do with their computers.
Bless them.
Post by Jeff Garzik
Something like John Cherry's reports to lkml on warnings and errors
would be darned useful. His reports are IMO an ideal model: show
day-to-day _changes_ in test results. Don't just dump a huge list of
testsuite results, results which are often clogged with expected
failures and testsuite bug noise.
Yes, we need humans between the tests and the developers. Someone who has
good experience with the tests and who can say "hey, something changed
when I do X". If nothing changed, we don't hear anything.

It's a developer role, not a testing role. All testing is, really.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Martin J. Bligh
2004-12-02 07:26:59 UTC
Permalink
Post by Andrew Morton
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.
I already run a bunch of tests on a variety of machines for every new
kernel ... but don't have an automated way to compare the results as yet,
so don't actually look at them much ;-(. Sometime soon (quite possibly over
Christmas) things will calm down enough I'll get a couple of days to write
the appropriate perl script, and start publishing stuff.
Post by Andrew Morton
However I have my doubts about how useful it will end up being. These test
suites don't seem to pick up many regressions. I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.
My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2004-12-02 07:31:35 UTC
Permalink
Post by Martin J. Bligh
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.
My dream is that hardware vendors rotate their current machines through
a test shop :) It would be nice to make sure that the popular drivers
get daily test coverage.

Jeff, dreaming on


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
cliff white
2004-12-02 18:10:29 UTC
Permalink
On Thu, 02 Dec 2004 02:31:35 -0500
Post by Jeff Garzik
Post by Martin J. Bligh
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.
My dream is that hardware vendors rotate their current machines through
a test shop :) It would be nice to make sure that the popular drivers
get daily test coverage.
Jeff, dreaming on
OSDL has recently re-done the donation policy, and we're much better positioned
to support that sort of thing now - Contact Tom Hanrahan at OSDL if you
are a vendor, or know a vendor. ( Or you can become a vendor )

cliffw
Post by Jeff Garzik
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Gerrit Huizenga
2004-12-02 18:17:55 UTC
Permalink
Post by cliff white
On Thu, 02 Dec 2004 02:31:35 -0500
Post by Jeff Garzik
Post by Martin J. Bligh
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.
My dream is that hardware vendors rotate their current machines through
a test shop :) It would be nice to make sure that the popular drivers
get daily test coverage.
Jeff, dreaming on
OSDL has recently re-done the donation policy, and we're much better positioned
to support that sort of thing now - Contact Tom Hanrahan at OSDL if you
are a vendor, or know a vendor. ( Or you can become a vendor )
Specifically Tom Hanrahan == ***@osdl.org

gerrit
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
linux-os
2004-12-02 20:25:07 UTC
Permalink
Post by cliff white
On Thu, 02 Dec 2004 02:31:35 -0500
Post by Jeff Garzik
Post by Martin J. Bligh
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.
My dream is that hardware vendors rotate their current machines through
a test shop :) It would be nice to make sure that the popular drivers
get daily test coverage.
Jeff, dreaming on
It isn't going to happen until the time when the vendors
call somebody a liar, try to get them fired, and then
that somebody takes them to court and they lose 100
million dollars or so.

Until that happens, vendors will continue to make junk
and they will continue to lie about the performance of
that junk. It doesn't help that Software Engineering has
become a "hardware junk fixing" job.

Basically many vendors in the PC and PC peripheral
business are, for lack of a better word, liars who
are in the business of perpetrating fraud upon the
unsuspecting PC user.

We have vendors who convincingly change mega-bits
to mega-bytes, improving performance 8-fold without
any expense at all. We have vendors reducing the
size of a kilobyte and a megabyte, then getting
the new lies entered into dictionaries, etc. The
scheme goes on.

In the meantime, if you try to perform DMA
across a PCI/Bus at or near the specified rates,
you will learn that the specifications are
for "this chip" or "that chip", and have nothing
to do with the performance when these chips
get connected together. You will find that real
performance is about 20 percent of the specification.

Occasionally you find a vendor that doesn't lie and
the same chip-set magically performs close to
the published specifications. This is becoming
rare because it costs money to build motherboards
that work. This might require two or more
prototypes to get the timing just right so the
artificial delays and re-clocking, used to make
junk work, isn't required.

Once the PC (and not just the desk-top PC) became
a commodity, everything points to the bottom-line.
You get into the business by making something that
looks and smells new. Then you sell it by writing
specifications that are better than the most
expensive on the market. Your sales-price is
set below average market so you can unload this
junk as rapidly as possible.

Then, you do this over again, claiming that your
equipment is "state-of-the-art"! And if anybody
ever tests the junk and claims that it doesn't
work as specified, you contact the president of
his company and try to kill the messenger.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-08 17:24:06 UTC
Permalink
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.

In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.

If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.

The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.

The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.

The patch may be combined with the page fault scalability patch (another
edition of the patch is needed which will be forthcoming after the
page fault scalability patch has been included). The combined patches
will triple the possible page fault rate from ~1 mio faults sec to 3 mio
faults sec.

Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
number of threads (and thus increasing parallellism of page faults):

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646
32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239
32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950
32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358
32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928
32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112
32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599
32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918
32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686

Patched kernel:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085
32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292
32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242
32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377
32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151
32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617
32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509
32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838
32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521

These number are roughly equal to what can be accomplished with the
page fault scalability patches.

Kernel patches with both the page fault scalability patches and
prefaulting:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174
32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724
32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397
32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062
32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417
32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395
32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791
32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334
32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865

The fault rate doubles when both patches are applied.

And on the high end (512 processors allocating 256G) (No numbers
for regular kernels because they are extremely slow, also no
number for a low number of threads. Also very slow)

With prefaulting:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239
256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271
256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924
256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624
256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492
256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909
256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482
256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019

Page fault scalability patch and prefaulting. Max prefault order
increased to 5 (max preallocation of 32 pages):

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714

We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.

The one thing that takes up a lot of time is still be the zeroing
of pages in the page fault handler. There is a another
set of patches that I am working on which will prezero pages
and led to another an increase in performance by a factor of 2-4
(if prezeroed pages are available which may not always be the case).
Maybe we can reach 10 mio fault /sec that way.

Patch against 2.6.10-rc3-bk3:

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800
@@ -537,6 +537,8 @@
#endif

struct list_head tasks;
+ unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
+ int anon_fault_order; /* Last order of allocation on fault */
/*
* ptrace_list/ptrace_children forms the list of my children
* that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800
@@ -55,6 +55,7 @@

#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/pagevec.h>

#ifndef CONFIG_DISCONTIGMEM
/* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,8 +1433,106 @@
unsigned long addr)
{
pte_t entry;
- struct page * page = ZERO_PAGE(addr);
+ struct page * page;
+
+ addr &= PAGE_MASK;
+
+ if (current->anon_fault_next_addr == addr) {
+ unsigned long end_addr;
+ int order = current->anon_fault_order;
+
+ /* Sequence of page faults detected. Perform preallocation of pages */

+ /* The order of preallocations increases with each successful prediction */
+ order++;
+
+ if ((1 << order) < PAGEVEC_SIZE)
+ end_addr = addr + (1 << (order + PAGE_SHIFT));
+ else
+ end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
+
+ if (end_addr > vma->vm_end)
+ end_addr = vma->vm_end;
+ if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+ end_addr &= PMD_MASK;
+
+ current->anon_fault_next_addr = end_addr;
+ current->anon_fault_order = order;
+
+ if (write_access) {
+
+ struct pagevec pv;
+ unsigned long a;
+ struct page **p;
+
+ pte_unmap(page_table);
+ spin_unlock(&mm->page_table_lock);
+
+ pagevec_init(&pv, 0);
+
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ /* Allocate the necessary pages */
+ for(a = addr;a < end_addr ; a += PAGE_SIZE) {
+ struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+ if (p) {
+ clear_user_highpage(p, a);
+ pagevec_add(&pv,p);
+ } else
+ break;
+ }
+ end_addr = a;
+
+ spin_lock(&mm->page_table_lock);
+
+ for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
+
+ page_table = pte_offset_map(pmd, addr);
+ if (!pte_none(*page_table)) {
+ /* Someone else got there first */
+ page_cache_release(*p);
+ pte_unmap(page_table);
+ continue;
+ }
+
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ vma->vm_page_prot)),
+ vma);
+
+ mm->rss++;
+ lru_cache_add_active(*p);
+ mark_page_accessed(*p);
+ page_add_anon_rmap(*p, vma, addr);
+
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+ }
+ } else {
+ /* Read */
+ for(;addr < end_addr; addr += PAGE_SIZE) {
+ page_table = pte_offset_map(pmd, addr);
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+ set_pte(page_table, entry);
+ pte_unmap(page_table);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, addr, entry);
+
+ };
+ }
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+ }
+
+ current->anon_fault_next_addr = addr + PAGE_SIZE;
+ current->anon_fault_order = 0;
+
+ page = ZERO_PAGE(addr);
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
Jesse Barnes
2004-12-08 17:33:13 UTC
Permalink
Post by Christoph Lameter
Page fault scalability patch and prefaulting. Max prefault order
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714
We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.
Nice results! Any idea how many applications benefit from this sort of
anticipatory faulting? It has implications for NUMA allocation. Imagine an
app that allocates a large virtual address space and then tries to fault in
pages near each CPU in turn. With this patch applied, CPU 2 would be
referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which
would then be used by CPUs 4-6. Unless I'm missing something...

And again, I'm not sure how important that is, maybe this approach will work
well in the majority of cases (obviously it's a big win in faults/sec for
your benchmark, but I wonder about subsequent references from other CPUs to
those pages). You can look at /sys/devices/platform/nodeN/meminfo to see
where the pages are coming from.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-08 17:56:00 UTC
Permalink
Post by Jesse Barnes
Nice results! Any idea how many applications benefit from this sort of
anticipatory faulting? It has implications for NUMA allocation. Imagine an
app that allocates a large virtual address space and then tries to fault in
pages near each CPU in turn. With this patch applied, CPU 2 would be
referencing pages near CPU 1, and CPU 3 would then fault in 4 pages, which
would then be used by CPUs 4-6. Unless I'm missing something...
Faults are predicted for each thread executing on a different processor.
So each processor does its own predictions which will not generate
preallocations on a different processor (unless the thread is moved to
another processor but that is a very special situation).
Post by Jesse Barnes
And again, I'm not sure how important that is, maybe this approach will work
well in the majority of cases (obviously it's a big win in faults/sec for
your benchmark, but I wonder about subsequent references from other CPUs to
those pages). You can look at /sys/devices/platform/nodeN/meminfo to see
where the pages are coming from.
The origin of the pages has not changed and the existing locality
constraints are observed.

A patch like this is important for applications that allocate and preset
large amounts of memory on startup. It will drastically reduce the startup
times.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Jesse Barnes
2004-12-08 18:33:24 UTC
Permalink
Post by Christoph Lameter
Post by Jesse Barnes
And again, I'm not sure how important that is, maybe this approach will
work well in the majority of cases (obviously it's a big win in
faults/sec for your benchmark, but I wonder about subsequent references
from other CPUs to those pages). You can look at
/sys/devices/platform/nodeN/meminfo to see where the pages are coming
from.
The origin of the pages has not changed and the existing locality
constraints are observed.
A patch like this is important for applications that allocate and preset
large amounts of memory on startup. It will drastically reduce the startup
times.
Ok, that sounds good. My case was probably a bit contrived, but I'm glad to
see that you had already thought of it anyway.

Jesse
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David S. Miller
2004-12-08 21:26:27 UTC
Permalink
On Wed, 8 Dec 2004 09:56:00 -0800 (PST)
Post by Christoph Lameter
A patch like this is important for applications that allocate and preset
large amounts of memory on startup. It will drastically reduce the startup
times.
I see. Yet I noticed that while the patch makes system time decrease,
for some reason the wall time is increasing with the patch applied.
Why is that, or am I misreading your tables?
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2004-12-08 21:42:46 UTC
Permalink
Post by David S. Miller
I see. Yet I noticed that while the patch makes system time decrease,
for some reason the wall time is increasing with the patch applied.
Why is that, or am I misreading your tables?
I assume that you're looking at the final "both patches applied" case.

It has ten repetitions, while the other two tables only have three. That
would explain the discrepancy.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Hansen
2004-12-08 17:55:26 UTC
Permalink
Post by Christoph Lameter
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.
do_anonymous_page() is a relatively compact function at this point.
This would probably be a lot more readable if it was broken out into at
least another function or two that do_anonymous_page() calls into. That
way, you also get a much cleaner separation if anyone needs to turn it
off in the future.

Speaking of that, have you seen this impair performance on any other
workloads?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Martin J. Bligh
2004-12-08 19:07:27 UTC
Permalink
Post by Christoph Lameter
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.
In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.
If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.
The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.
The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.
The patch may be combined with the page fault scalability patch (another
edition of the patch is needed which will be forthcoming after the
page fault scalability patch has been included). The combined patches
will triple the possible page fault rate from ~1 mio faults sec to 3 mio
faults sec.
Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
Mmmm ... we tried doing this before for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the
extra cost in zap_pte_range, etc.

Perhaps the locality is better for the anon stuff, but the cost is also
higher. Exactly what benchmark were you running on this? If you just run
a microbenchmark that allocates memory, then it will definitely be faster.
On other things, I suspect not ...

M.



-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin J. Bligh
2004-12-08 22:50:07 UTC
Permalink
Post by Christoph Lameter
The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.
In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.
If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.
I tried benchmarking it ... but processes just segfault all the time.
Any chance you could try it out on SMP ia32 system?

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-09 19:32:07 UTC
Permalink
Post by Martin J. Bligh
I tried benchmarking it ... but processes just segfault all the time.
Any chance you could try it out on SMP ia32 system?
I tried it on my i386 system and it works fine. Sorry about the puny
memory sizes (the system is a PIII-450 with 384k memory)

***@schroedinger:~/pfault/code$ ./pft -t -b256000 -r3 -f1
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 3 1 0.000s 0.004s 0.000s 37407.481 29200.500
0 3 2 0.002s 0.002s 0.000s 31177.059 27227.723

***@schroedinger:~/pfault/code$ uname -a
Linux schroedinger 2.6.10-rc3-bk3-prezero #8 SMP Wed Dec 8 15:22:28 PST
2004 i686 GNU/Linux

Could you send me your .config?


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Machek
2004-12-09 10:57:53 UTC
Permalink
Hi!
Post by Christoph Lameter
Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
...
Post by Christoph Lameter
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
...
Post by Christoph Lameter
These number are roughly equal to what can be accomplished with the
page fault scalability patches.
Kernel patches with both the page fault scalability patches and
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
...
Post by Christoph Lameter
The fault rate doubles when both patches are applied.
...
Post by Christoph Lameter
We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.
Well, with both patches you also slow single-threaded case more than
twice. What are the effects of this patch on UP system?
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-12-09 11:32:38 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Christoph Lameter
Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
...
Post by Christoph Lameter
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
...
Post by Christoph Lameter
These number are roughly equal to what can be accomplished with the
page fault scalability patches.
Kernel patches with both the page fault scalability patches and
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
...
Post by Christoph Lameter
The fault rate doubles when both patches are applied.
...
Post by Christoph Lameter
We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.
Well, with both patches you also slow single-threaded case more than
twice. What are the effects of this patch on UP system?
fault/wsec is the important number.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-09 17:05:25 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Christoph Lameter
Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
...
Post by Christoph Lameter
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
...
Post by Christoph Lameter
These number are roughly equal to what can be accomplished with the
page fault scalability patches.
Kernel patches with both the page fault scalability patches and
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
...
Post by Christoph Lameter
The fault rate doubles when both patches are applied.
...
Post by Christoph Lameter
We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.
Well, with both patches you also slow single-threaded case more than
twice. What are the effects of this patch on UP system?
The faults per second are slightly increased, so its faster. The last
numbers are 10 repetitions and not 3. Do not look at the wall time.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
cliff white
2004-12-02 18:43:30 UTC
Permalink
On Wed, 01 Dec 2004 23:26:59 -0800
Post by Martin J. Bligh
Post by Andrew Morton
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.
I already run a bunch of tests on a variety of machines for every new
kernel ... but don't have an automated way to compare the results as yet,
so don't actually look at them much ;-(. Sometime soon (quite possibly over
Christmas) things will calm down enough I'll get a couple of days to write
the appropriate perl script, and start publishing stuff.
We've had the most success when one person has an itch to scratch, and works
with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very
glad he had the time to do such excellent work. We worked with Con Kolivas, likewise.

We've done tools to automate LTP comparisons ( ***@osdl.org has posted results )
and reaim, we've been able to post some regression to lkml, and tied in with developers
to get bugs fixed. But OSDL has been limited by manpower.

One of the issues with the performance tests is the amount of data produced -
for example, the deep IO tests produce ton's o' numbers, but the developer community wants
a single "+/- 5%" type response- we need some opinions and help on how to do the data reduction
necessary.

What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future
have a good starting place.
cliffw
OSDL
Post by Martin J. Bligh
Post by Andrew Morton
However I have my doubts about how useful it will end up being. These test
suites don't seem to pick up many regressions. I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.
My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.
Yeah, probably. Though the stress tests catch a lot more than the
functionality ones. The big pain in the ass is drivers, because I don't
have a hope in hell of testing more than 1% of them.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2004-12-06 19:33:27 UTC
Permalink
Post by cliff white
On Wed, 01 Dec 2004 23:26:59 -0800
Post by Martin J. Bligh
Post by Andrew Morton
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.
I already run a bunch of tests on a variety of machines for every new
kernel ... but don't have an automated way to compare the results as yet,
so don't actually look at them much ;-(. Sometime soon (quite possibly over
Christmas) things will calm down enough I'll get a couple of days to write
the appropriate perl script, and start publishing stuff.
We've had the most success when one person has an itch to scratch, and works
with us to scratch it. We (OSDL) worked with Sebastien at Bull, and we're very
glad he had the time to do such excellent work. We worked with Con Kolivas, likewise.
and reaim, we've been able to post some regression to lkml, and tied in with developers
to get bugs fixed. But OSDL has been limited by manpower.
One of the issues with the performance tests is the amount of data produced -
for example, the deep IO tests produce ton's o' numbers, but the developer community wants
a single "+/- 5%" type response- we need some opinions and help on how to do the data reduction
necessary.
Yep, reaim produces a single "global throughput" result in MB/s, which is wonderful
for readability.

Now iozone on the other extreme produces output for each kind of operation
(read, write, rw, sync version of those) for each client IIRC. tiobench also
has detailed output for each operation.

We ought to reduce all benchmark results to "read", "write" and "global" (read+write/2)
numbers.

I'm willing to work on the data reduction and graphic generation scripts
for STP results. I think I can do that.
Post by cliff white
What would be really kewl is some test/analysis code that could be re-used, so the Martin's of the future
have a good starting place.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gerrit Huizenga
2004-12-02 16:24:04 UTC
Permalink
Post by Andrew Morton
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.
However I have my doubts about how useful it will end up being. These test
suites don't seem to pick up many regressions. I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.
My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.
Yeah, sort of what Martin said. LTP, for instance, doesn't find a lot
of what is in our internal bugzilla or the bugme database. Automated
testing tends not to cover all the range of desktop peripherals and
drivers that make up a large quantity of the code but gets very little
coverage. Our stress testing is extensive and was finding 3 year old
problems when we first ran it but it is pretty expensive to run those
types of tests (machines, people, data analysis) so we typically run
those tests on distros rather than mainline to help validate distro
quality.

However, that said, the LTP stuff is still *necessary* - it would
catch quite a number of regressions if we were to regress. The good
thing is that most changes today haven't been leading to regressions.
That could change at any time, and one of the keys is to make sure that
when we do find regressions we get a test into LTP to make sure that
that particular regression never happens again.

I haven't looked at the code coverage for LTP in a while but it is
actually a high line count coverage test for core kernel. I don't remember
if it was over 80% or not, but usually 85-88% is the point of diminishing
returns for a regression suite. I think a more important proactive
step here is to understand what regressions we *do* have an whether
or not we can construct a test that in the future will catch that
regression (or better, a class of regressions).

And, maybe we need some kind of filter person or group for lkml that
can see what the key regressions are (e.g. akpm, if you know of a set
of regressions that you are working, maybe periodically sending those
to the ltp mailing list) we could focus on creating tests for those
regressions.

We are also working to set up large ISV applications in a couple of
spots - both inside IBM and there is a similar effort underway at OSDL.
Those ISV applications will catch a class of real world usage models
and also check for regressions. I don't know if it is possible to set
up a better testing environment for the wild, whacky and weird things
that people do but, yes, Bless them. ;-)
Post by Andrew Morton
We simply get far better coverage testing by releasing code, because of all
the wild, whacky and weird things which people do with their computers.
Bless them.
Post by Jeff Garzik
Something like John Cherry's reports to lkml on warnings and errors
would be darned useful. His reports are IMO an ideal model: show
day-to-day _changes_ in test results. Don't just dump a huge list of
testsuite results, results which are often clogged with expected
failures and testsuite bug noise.
Yes, we need humans between the tests and the developers. Someone who has
good experience with the tests and who can say "hey, something changed
when I do X". If nothing changed, we don't hear anything.
It's a developer role, not a testing role. All testing is, really.
Yep. However, smart developers continue to write scripts to automate
the rote and mundane tasks that they hate doing. Towards that end, there
was a recent effort at Bull on the NPTL work which serves as a very
interesting model:

http://nptl.bullopensource.org/Tests/results/run-browse.php

Basically, you can compare results from any test run with any other
and get a summary of differences. That helps give a quick status
check and helps you focus on the correct issues when tracking down
defects.

gerrit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
cliff white
2004-12-02 17:34:01 UTC
Permalink
On Wed, 1 Dec 2004 23:02:17 -0800
Post by Andrew Morton
Post by Jeff Garzik
Post by Andrew Morton
We need to be be achieving higher-quality major releases than we did in
2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer
stabilisation periods.
I'm still hoping that distros (like my employer) and orgs like OSDL will
step up, and hook 2.6.x BK snapshots into daily test harnesses.
I believe that both IBM and OSDL are doing this, or are getting geared up
to do this. With both Linus bk and -mm.
Gee, OSDL has been doing this sort of testing for > 1 years now. Getting
bandwidth to look at the results has been a problem. We need more eyeballs
and community support badly, i'm very glad Marcelo has shown recent interest.
Post by Andrew Morton
However I have my doubts about how useful it will end up being. These test
suites don't seem to pick up many regressions. I've challenged Gerrit to
go back through a release cycle's bugfixes and work out how many of those
bugs would have been detected by the test suite.
My suspicion is that the answer will be "a very small proportion", and that
really is the bottom line.
We simply get far better coverage testing by releasing code, because of all
the wild, whacky and weird things which people do with their computers.
Bless them.
Post by Jeff Garzik
Something like John Cherry's reports to lkml on warnings and errors
would be darned useful. His reports are IMO an ideal model: show
day-to-day _changes_ in test results. Don't just dump a huge list of
testsuite results, results which are often clogged with expected
failures and testsuite bug noise.
Yes, we need humans between the tests and the developers. Someone who has
good experience with the tests and who can say "hey, something changed
when I do X". If nothing changed, we don't hear anything.
I would agree, and would do almost anything to help/assist/enable any humans
interested. We need some expertise on when to run certain tests, to avoid
data overload.
I've noticed that when developer's submit test results with a patch, it sometimes
helps in the decision on patch acceptance. Is there a way to promote this sort of
behaviour?
cliffw
OSDL
Post by Andrew Morton
It's a developer role, not a testing role. All testing is, really.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
The church is near, but the road is icy.
The bar is far, but i will walk carefully. - Russian proverb
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Grant Grundler
2004-12-02 18:27:16 UTC
Permalink
Post by Andrew Morton
Of course, nobody will test -rc3 and a zillion people will test final
2.6.10, which is when we get lots of useful bug reports. If this keeps on
happening then we'll need to get more serious about the 2.6.10.n process.
Or start alternating between stable and flakey releases, so 2.6.11 will be
a feature release with a 2-month development period and 2.6.12 will be a
bugfix-only release, with perhaps a 2-week development period, so people
know that the even-numbered releases are better stabilised.
No matter what scheme you adopt, I (and others) will adapt as well.
When working on a new feature or bug fix, I don't chase -bk releases
since I don't want to find new, unrelated issues that interfere with
the issue I was originally chasing. I roll to a new release when
the issue I care about is "cooked". Anything that takes longer than
a month or so is just hopeless since I fall too far behind.

(e.g. IRQ handling in parisc-linux needs to be completely rewritten
to pickup irq_affinity support - I just don't have enough time to get
it done in < 2 monthes. We started on this last year and gave up.)

I see "2.6.10.n process" as the right way to handle bug fix only releases.
I'm happy to work on 2.6.10.0 and understand the initial release was a
"best effort".

2.6.odd/.even release described above is a variant of 2.6.10.n releases
where n = {0, 1}. The question is how many parallel releases do people
(you and linus) want us keep "alive" at the same time?
odd/even implies only one vs several if 2.6.X.n scheme is continued
beyond 2.6.8.1.

Also need to think about how well any scheme align's with what distro's
need to support releases. Like the "Adopt-a-Highway" program in
California to pickup trash along highways, I'm wondering if distros
would be willing/interested in adopting a particular release
and maintain it in bk. e.g. SuSE clearly has interest in some sort
of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n).
The question of *who* (at respective distro) would be the release
maintainer is a titanic sized rathole. But there is a release manager
today at each distro and perhaps it's easier if s/he remains invisible
to us.

hth,
grant
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2004-12-02 18:33:47 UTC
Permalink
Post by Grant Grundler
2.6.odd/.even release described above is a variant of 2.6.10.n releases
where n = {0, 1}. The question is how many parallel releases do people
(you and linus) want us keep "alive" at the same time?
2.6.odd/.even is actually a significantly different process. a) because
there's only one tree, linearly growing. That's considerably simpler than
maintaining a branch. And b) because everyone knows that there won't be a
new development tree opened until we've all knuckled down and fixed the
bugs which we put into the previous one, dammit.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2004-12-02 18:36:29 UTC
Permalink
Post by Grant Grundler
Also need to think about how well any scheme align's with what distro's
need to support releases. Like the "Adopt-a-Highway" program in
California to pickup trash along highways, I'm wondering if distros
would be willing/interested in adopting a particular release
and maintain it in bk. e.g. SuSE clearly has interest in some sort
of 2.6.5.n series for SLES9. ditto for RHEL4 (but for 2.6.9.n).
Unfortunately the SLES9 kernels don't really look anything like 2.6.5
except from the version number. There's far too much trash from Business
Partners in there.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Machek
2004-12-07 10:51:26 UTC
Permalink
Hi!
Post by Andrew Morton
Or start alternating between stable and flakey releases, so 2.6.11 will be
a feature release with a 2-month development period and 2.6.12 will be a
bugfix-only release, with perhaps a 2-week development period, so people
know that the even-numbered releases are better stabilised.
If you expect "feature 2.6.11", you might as well call it 2.7.0,
followed by 2.8.0.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-12-09 08:00:10 UTC
Permalink
Post by Christoph Lameter
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
[snip]
Post by Christoph Lameter
For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.
Those numbers are pretty impressive. I thought you'd said with earlier
patches that performance was about doubled from 8 to 512 CPUS. Did I
remember correctly? If so, where is the improvement coming from? The
per-thread RSS I guess?


On another note, these patches are basically only helpful to new
anonymous page faults. I guess this is the main thing you are concerned
about at the moment, but I wonder if you would see improvements with
my patch to remove the ptl from the other types of faults as well?

The downside of my patch - well the main downsides - compared to yours
are its intrusiveness, and the extra cost involved in copy_page_range
which yours appears not to require.

As I've said earlier though, I wouldn't mind your patches going in. At
least they should probably get into -mm soon, when Andrew has time (and
after the 4level patches are sorted out). That wouldn't stop my patch
(possibly) being merged some time after that if and when it was found
worthy...

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-09 17:03:53 UTC
Permalink
Post by Nick Piggin
Post by Christoph Lameter
For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.
Those numbers are pretty impressive. I thought you'd said with earlier
patches that performance was about doubled from 8 to 512 CPUS. Did I
remember correctly? If so, where is the improvement coming from? The
per-thread RSS I guess?
Right. The per-thread RSS seems to have made a big difference for high CPU
counts. Also I was conservative in the estimates in earlier post since I
did not have the numbers for the very high cpu counts.
Post by Nick Piggin
On another note, these patches are basically only helpful to new
anonymous page faults. I guess this is the main thing you are concerned
about at the moment, but I wonder if you would see improvements with
my patch to remove the ptl from the other types of faults as well?
I can try that but I am frankly a bit sceptical since the ptl protects
many other variables. It may be more efficient to have the ptl in these
cases than doing the atomic ops all over the place. Do you have any number
you could post? I believe I send you a copy of the code that I use for
performance tests last week or so,
Post by Nick Piggin
The downside of my patch - well the main downsides - compared to yours
are its intrusiveness, and the extra cost involved in copy_page_range
which yours appears not to require.
Is the patch known to be okay for ia64? I can try to see how it
does.
Post by Nick Piggin
As I've said earlier though, I wouldn't mind your patches going in. At
least they should probably get into -mm soon, when Andrew has time (and
after the 4level patches are sorted out). That wouldn't stop my patch
(possibly) being merged some time after that if and when it was found
worthy...
I'd certainly be willing to poke around and see how beneficial this is. If
it turns out to accellerate other functionality of the vm then you
have my full support.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-12-10 04:30:26 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
Post by Christoph Lameter
For more than 8 cpus the page fault rate increases by orders
of magnitude. For more than 64 cpus the improvement in performace
is 10 times better.
Those numbers are pretty impressive. I thought you'd said with earlier
patches that performance was about doubled from 8 to 512 CPUS. Did I
remember correctly? If so, where is the improvement coming from? The
per-thread RSS I guess?
Right. The per-thread RSS seems to have made a big difference for high CPU
counts. Also I was conservative in the estimates in earlier post since I
did not have the numbers for the very high cpu counts.
Ah OK.
Post by Christoph Lameter
Post by Nick Piggin
On another note, these patches are basically only helpful to new
anonymous page faults. I guess this is the main thing you are concerned
about at the moment, but I wonder if you would see improvements with
my patch to remove the ptl from the other types of faults as well?
I can try that but I am frankly a bit sceptical since the ptl protects
many other variables. It may be more efficient to have the ptl in these
cases than doing the atomic ops all over the place. Do you have any number
you could post? I believe I send you a copy of the code that I use for
performance tests last week or so,
Yep I have your test program. No real numbers because the biggest thing
I have to test on is a 4-way - there is improvement, but it is not so
impressive as your 512 way tests! :)
Post by Christoph Lameter
Post by Nick Piggin
The downside of my patch - well the main downsides - compared to yours
are its intrusiveness, and the extra cost involved in copy_page_range
which yours appears not to require.
Is the patch known to be okay for ia64? I can try to see how it
does.
I think it just needs one small fix to the swapping code, and it should
be pretty stable. So in fact it would probably work for you as is (if you
don't swap), but I'd rather have something more stable before I ask you
to test. I'll try to find time to do that in the next few days.
Post by Christoph Lameter
Post by Nick Piggin
As I've said earlier though, I wouldn't mind your patches going in. At
least they should probably get into -mm soon, when Andrew has time (and
after the 4level patches are sorted out). That wouldn't stop my patch
(possibly) being merged some time after that if and when it was found
worthy...
I'd certainly be willing to poke around and see how beneficial this is. If
it turns out to accellerate other functionality of the vm then you
have my full support.
Great, thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2004-12-09 18:37:40 UTC
Permalink
Post by Christoph Lameter
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).
Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.

Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.

Your scalability figures show a superb improvement. But they are (I
presume) for the best case: intense initial faulting of distinct areas
of anonymous memory by parallel cpus running a multithreaded process.
This is not a common case: how much do what real-world apps benefit?

Since you also avoid taking the page_table_lock in handle_pte_fault,
there should be some scalability benefit to all kinds of page fault:
do you have any results to show how much (perhaps hard to quantify,
since even tmpfs file faults introduce other scalability issues)?

How do the scalability figures compare if you omit patch 7/7 i.e. revert
the per-task rss complications you added in for Linus? I remain a fan
of sloppy rss, which you earlier showed to be accurate enough (I'd say),
though I guess should be checked on other architectures than your ia64.
I can't see the point of all that added ugliness for numbers which don't
need to be precise - but perhaps there's no way of rearranging fields,
and the point at which mm->(anon_)rss is updated (near up of mmap_sem?),
to avoid destructive cacheline bounce. What I'm asking is, do you have
numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus
this time, but only up to 32 with sloppy rss? The ratios do look better
with the latest, but the numbers are altogether lower so we don't know.

The split rss patch, if it stays, needs some work. For example,
task_statm uses "get_shared" to total up rss-anon_rss from the tasks,
but assumes mm->rss is already accurate. Scrap the separate get_rss,
get_anon_rss, get_shared functions: just one get_rss to make a single
pass through the tasks adding up both rss and anon_rss at the same time.

I am bothered that every read of /proc/<pid>/status or /proc/<pid>/statm
is going to reread through all of that task_list each time; yet in that
massively parallel case that concerns you, there should be little change
to rss after startup. Perhaps a later optimization would be to avoid
task_list completely for singly threaded processes. I'd like get_rss to
update mm->rss and mm->anon_rss and flag it uptodate to avoid subsequent
task_list iterations, but the locking might defeat your whole purpose.

Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task. You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that). You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm. Or... just go back to sloppy rss.

Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together. Obviously no problem on
architectures where update_mmu_cache is a no-op! But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures. I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm? You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.

A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?

(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)

Where handle_pte_fault does "entry = *pte" without page_table_lock:
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.

Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte. Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?

But by making every set_pte use set_64bit, you are significantly slowing
down many operations which do not need that atomicity. This is quite
visible in the fork/exec/shell results from lmbench on i386 PAE (and is
the only interesting difference, for good or bad, that I noticed with
your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are
no faults on dst mm (nor on src mm) while copy_page_range is copying,
so its set_ptes don't need to be atomic; likewise during zap_pte_range
(either mmap_sem is held exclusively, or it's in the final exit_mmap).
Probably revert set_pte and set_pte_atomic to what they were, and use
set_pte_atomic where it's needed.

Hugh
Christoph Lameter
2004-12-09 22:02:37 UTC
Permalink
Post by Hugh Dickins
How do the scalability figures compare if you omit patch 7/7 i.e. revert
the per-task rss complications you added in for Linus? I remain a fan
of sloppy rss, which you earlier showed to be accurate enough (I'd say),
though I guess should be checked on other architectures than your ia64.
I can't see the point of all that added ugliness for numbers which don't
need to be precise - but perhaps there's no way of rearranging fields,
and the point at which mm->(anon_)rss is updated (near up of mmap_sem?),
to avoid destructive cacheline bounce. What I'm asking is, do you have
numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus
this time, but only up to 32 with sloppy rss? The ratios do look better
with the latest, but the numbers are altogether lower so we don't know.
Here is a full set of numbers for sloppy and tasklist. The sloppy version
is 2.6.9-rc2-bk14 with the prefault patch also applied and the tasklist
version is 2.6.9-rc2-bk12 w/o prefault (you can get the numbers of
2.6.9-rc2-bk12 w prefault in the post titled "anticipatory prefaulting
in the page fault handler")). Even with this handicap
tasklist is still slightly better! I would expect tasklist to increase in
importance for combination patches which increase the fault rate even
more. The tasklist is likely to be unavoidable once I get the prezeroing
patch debugged and integrated which should at least give us a peak pulse
performance for page faults > 5 mio faults /sec.

I was not also able to get the high numbers of > 3 mio faults with atomic
rss + prefaulting but was able to get that with tasklist + prefault. The
atomic version shares the locality problems with the sloppy approach.

sloppy (2.6.10-bk14-rss-sloppy-prefault):
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 10 1 0.040s 6.505s 6.054s100117.616 100072.760
1 10 2 0.041s 7.394s 4.005s 88138.739 161535.358
1 10 4 0.049s 7.863s 2.049s 82819.743 262839.190
1 10 8 0.093s 8.657s 1.077s 74889.898 369606.184
1 10 16 0.621s 13.278s 1.076s 47150.165 371506.561
1 10 32 3.154s 35.337s 2.029s 17025.784 285469.956
1 10 64 11.602s 77.548s 2.086s 7351.089 228908.831
1 10 128 41.999s 217.106s 4.030s 2529.316 152087.458
1 10 256 40.482s 106.627s 3.022s 4454.885 203363.548
1 10 512 63.673s 61.361s 3.040s 5241.403 192528.941
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.176s 41.276s 41.045s 63238.628 63237.008
4 10 2 0.154s 31.074s 16.095s 83943.753 154606.489
4 10 4 0.193s 31.886s 9.096s 81715.471 263190.941
4 10 8 0.210s 33.577s 6.061s 77584.707 396402.083
4 10 16 0.473s 52.997s 6.036s 49025.701 411640.587
4 10 32 3.331s 142.296s 7.093s 18000.934 330197.326
4 10 64 10.820s 318.485s 8.088s 7960.503 295042.520
4 10 128 56.012s 928.004s 12.037s 2664.019 211812.600
4 10 256 46.197s 464.579s 7.026s 5132.263 360940.189
4 10 512 57.396s 225.876s 4.081s 9254.125 544185.485
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 10 1 0.948s 221.167s 222.009s 47208.624 47212.786
16 10 2 0.824s 205.021s 110.022s 50939.876 95134.456
16 10 4 0.689s 168.670s 53.055s 61914.226 195802.740
16 10 8 0.683s 137.278s 27.034s 76004.706 383471.968
16 10 16 0.969s 216.288s 24.031s 48264.109 431329.422
16 10 32 3.932s 587.987s 30.002s 17714.820 349219.905
16 10 64 13.542s 1253.834s 32.051s 8273.588 322528.516
16 10 128 54.197s 3161.896s 38.064s 3260.403 271357.849
16 10 256 57.610s 1668.913s 21.038s 6073.335 490410.386
16 10 512 36.721s 833.691s 11.069s 12046.872 896970.623
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 2.080s 470.722s 472.075s 44355.728 44360.409
32 10 2 1.836s 456.343s 242.088s 45771.267 86344.100
32 10 4 1.671s 432.569s 131.065s 48294.609 159291.360
32 10 8 1.457s 354.825s 71.027s 58862.070 294242.410
32 10 16 1.660s 431.057s 48.038s 48464.636 433466.055
32 10 32 3.639s 1190.388s 59.040s 17563.676 353012.708
32 10 64 14.623s 2490.393s 63.040s 8371.808 330750.309
32 10 128 68.481s 6415.265s 76.053s 3234.476 274023.655
32 10 256 63.428s 3216.337s 39.044s 6394.212 531665.931
32 10 512 50.693s 1644.307s 21.035s 12372.572 982183.559
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
64 10 1 4.457s 1021.948s1026.030s 40863.994 40868.119
64 10 2 3.929s 994.825s 525.030s 41995.308 79844.658
64 10 4 3.661s 931.523s 269.014s 44849.990 155838.443
64 10 8 3.355s 858.565s 153.098s 48662.260 272381.402
64 10 16 3.130s 904.485s 101.090s 46212.285 411581.778
64 10 32 5.007s 2366.494s 116.079s 17686.275 359107.203
64 10 64 17.472s 5195.222s 126.012s 8046.325 332545.646
64 10 128 65.249s 12515.845s 147.053s 3333.815 284290.928
64 10 256 61.328s 6706.566s 78.061s 6197.354 533523.711
64 10 512 60.656s 3201.068s 39.095s 12859.162 1049637.054
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
128 10 8 7.481s 1875.297s 318.049s 44554.389 263386.340
128 10 16 7.128s 2048.919s 230.060s 40799.672 363757.736
128 10 32 9.584s 4758.868s 241.094s 17591.883 346711.571
128 10 64 17.955s 10135.674s 249.025s 8261.684 336547.279
128 10 128 66.939s 25006.914s 287.019s 3345.560 292086.404
128 10 256 62.454s 12892.242s 149.035s 6475.341 561653.696
128 10 512 59.082s 6456.965s 77.002s 12873.768 1089026.647
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 10 8 17.201s 4672.781s 860.094s 35772.446 194870.225
256 10 16 16.641s 5071.433s 588.076s 32973.603 284954.772
256 10 32 17.745s 9193.335s 478.005s 18214.166 350950.045
256 10 64 25.474s 20440.137s 510.037s 8197.759 328725.189
256 10 128 65.451s 50015.195s 572.044s 3350.040 293079.914
256 10 256 61.296s 25191.675s 290.084s 6643.660 576852.282
256 10 512 58.911s 12589.530s 149.012s 13264.255 1125015.367

tasklist (2.6.10-rc2-bk12-rss-tasklist):

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 3 1 0.045s 2.042s 2.009s 94121.837 94039.902
1 3 2 0.053s 2.217s 1.022s 86554.869 160093.661
1 3 4 0.036s 2.325s 0.074s 83261.622 265213.249
1 3 8 0.065s 2.507s 0.053s 76404.784 370587.422
1 3 16 0.168s 4.727s 0.057s 40152.877 341385.368
1 3 32 0.829s 11.408s 0.070s 16066.277 280690.973
1 3 64 4.324s 25.591s 0.093s 6571.995 209956.473
1 3 128 19.370s 81.568s 1.055s 1947.799 126774.712
1 3 256 13.042s 46.608s 1.009s 3295.950 179708.774
1 3 512 19.410s 28.085s 0.092s 4139.454 211823.959
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.161s 12.698s 12.086s 61156.292 61149.853
4 3 2 0.152s 10.469s 5.073s 74037.518 137039.041
4 3 4 0.179s 9.401s 2.098s 82081.949 263750.289
4 3 8 0.156s 10.194s 1.098s 75979.430 395361.526
4 3 16 0.407s 18.084s 2.010s 42527.778 373673.111
4 3 32 0.824s 44.316s 2.031s 17421.815 339975.566
4 3 64 4.706s 96.587s 2.066s 7763.856 295588.217
4 3 128 17.453s 259.672s 3.053s 2837.813 222395.530
4 3 256 17.090s 136.816s 2.017s 5109.777 361440.098
4 3 512 13.466s 78.242s 1.043s 8575.295 548859.306
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 3 1 0.678s 61.548s 62.023s 50551.998 50544.748
16 3 2 0.691s 63.381s 34.027s 49095.790 91791.474
16 3 4 0.663s 52.083s 16.086s 59639.041 186542.124
16 3 8 0.585s 43.339s 9.031s 71614.583 337721.897
16 3 16 0.744s 75.174s 8.003s 41435.328 391278.035
16 3 32 1.713s 171.942s 8.086s 18114.674 354760.887
16 3 64 4.720s 366.803s 9.055s 8467.079 329273.168
16 3 128 22.637s 849.059s 10.093s 3608.741 287572.764
16 3 256 15.849s 472.565s 6.009s 6440.683 515916.601
16 3 512 15.479s 245.305s 3.046s 12062.521 909147.611
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 3 1 1.451s 140.151s 141.060s 44430.367 44428.115
32 3 2 1.399s 136.349s 73.041s 45673.303 85699.793
32 3 4 1.321s 129.760s 39.027s 47996.303 160197.217
32 3 8 1.279s 100.648s 20.039s 61724.641 308454.557
32 3 16 1.414s 153.975s 15.090s 40488.236 395681.716
32 3 32 2.534s 337.021s 17.016s 18528.487 366445.400
32 3 64 4.271s 709.872s 18.057s 8809.787 338656.440
32 3 128 18.734s 1805.094s 21.084s 3449.586 288005.644
32 3 256 14.698s 963.787s 11.078s 6429.787 534077.540
32 3 512 15.299s 453.990s 5.098s 13406.321 1050416.414
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
64 3 1 3.018s 301.014s 304.004s 41386.617 41384.901
64 3 2 2.941s 296.780s 157.005s 41981.967 80116.179
64 3 4 2.810s 280.803s 82.047s 44366.266 152575.551
64 3 8 2.763s 268.745s 48.099s 46344.377 256813.576
64 3 16 2.764s 332.029s 34.030s 37584.030 366744.317
64 3 32 3.337s 704.321s 34.074s 17781.025 362195.710
64 3 64 7.395s 1475.497s 36.078s 8485.379 342026.888
64 3 128 22.227s 3188.934s 40.044s 3918.492 311115.971
64 3 256 18.004s 1834.246s 21.093s 6793.308 573753.797
64 3 512 19.367s 861.324s 10.099s 14287.531 1144168.224

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
128 3 4 5.857s 626.055s 189.010s 39824.798 133076.331
128 3 8 5.837s 592.587s 107.080s 42053.423 233443.791
128 3 16 5.852s 666.252s 71.008s 37443.301 354011.649
128 3 32 6.305s 1365.184s 69.075s 18349.259 360755.364
128 3 64 8.450s 2914.730s 72.046s 8609.057 347288.474
128 3 128 21.188s 6719.590s 79.078s 3733.370 315402.750
128 3 256 18.263s 3672.379s 43.049s 6818.817 578587.427
128 3 512 17.625s 1901.969s 22.082s 13109.967 1102629.479

128 3 256 24.035s 3392.117s 40.074s 7366.714 617628.607
128 3 512 17.000s 1820.242s 21.072s 13697.601 1158632.106

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
256 3 4 11.976s 1660.924s 514.023s 30086.443 97877.018
256 3 8 11.618s 1301.448s 223.063s 38331.361 225057.902
256 3 16 11.696s 1409.158s 148.074s 35423.488 338379.838
256 3 32 12.678s 2668.417s 140.042s 18772.788 358421.926
256 3 64 15.933s 5833.804s 145.068s 8604.085 345487.685
256 3 128 32.640s 13437.080s 159.079s 3736.651 314981.569
256 3 256 23.875s 6835.241s 81.007s 7337.919 620777.397
256 3 512 17.566s 3392.148s 41.003s 14761.249 1226507.319

256 3 256 21.314s 6648.629s 79.085s 7546.038 630270.726
256 3 512 15.994s 3400.378s 40.087s 14732.481 1231399.906
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-09 22:52:37 UTC
Permalink
Post by Christoph Lameter
Post by Hugh Dickins
How do the scalability figures compare if you omit patch 7/7 i.e. revert
the per-task rss complications you added in for Linus? I remain a fan
of sloppy rss, which you earlier showed to be accurate enough (I'd say),
though I guess should be checked on other architectures than your ia64.
I can't see the point of all that added ugliness for numbers which don't
need to be precise - but perhaps there's no way of rearranging fields,
and the point at which mm->(anon_)rss is updated (near up of mmap_sem?),
to avoid destructive cacheline bounce. What I'm asking is, do you have
numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus
this time, but only up to 32 with sloppy rss? The ratios do look better
with the latest, but the numbers are altogether lower so we don't know.
Here is a full set of numbers for sloppy and tasklist.
Yes, but that only tests the thing-which-you're-trying-to-improve. We also
need to work out the impact of that tasklist walk on other people's worst
cases.
It would be helpful if you could generate a breif summary of benchmarking
results as well as dumping the raw numbers, please.
William Lee Irwin III
2004-12-09 22:52:59 UTC
Permalink
Post by Christoph Lameter
I was not also able to get the high numbers of > 3 mio faults with atomic
rss + prefaulting but was able to get that with tasklist + prefault. The
atomic version shares the locality problems with the sloppy approach.
The implementation of the atomic version at least improperly places
the counter's cacheline, so the results for that are gibberish.

Unless the algorithms being compared are properly implemented, they're
straw men, not valid comparisons.


-- wli
Christoph Lameter
2004-12-09 23:07:13 UTC
Permalink
Post by William Lee Irwin III
Unless the algorithms being compared are properly implemented, they're
straw men, not valid comparisons.
Sloppy rss left the rss in the section of mm that contained the counters.
So that has a separate cacheline. The idea of putting the atomic ops in a
group was to only have one exclusive cacheline for mmap_sem and the rss.
Which could lead to more bouncing of a single cache line rather than
bouncing multiple cache lines less. But it seems to me that the problem
essentially remains the same if the rss counter is not split.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-12-09 23:29:45 UTC
Permalink
Post by Christoph Lameter
Sloppy rss left the rss in the section of mm that contained the counters.
So that has a separate cacheline. The idea of putting the atomic ops in a
group was to only have one exclusive cacheline for mmap_sem and the rss.
Which could lead to more bouncing of a single cache line rather than
bouncing multiple cache lines less. But it seems to me that the problem
essentially remains the same if the rss counter is not split.
The prior results Robin Holt cited were that the counter needed to be
in a different cacheline from the ->mmap_sem and ->page_table_lock.
We shouldn't need to evaluate splitting for the atomic RSS algorithm.

A faithful implementation would just move the atomic counters away from
the ->mmap_sem and ->page_table_lock (just shuffle some mm fields).
Obviously a complete set of results won't be needed unless it's very
surprisingly competitive with the stronger algorithms. Things should be
fine just making sure that behaves similarly to the one with the shared
cacheline with ->mmap_sem in the sense of having a curve of similar shape
on smaller systems. The absolute difference probably doesn't matter,
but there is something to prove, and the largest risk of not doing so
is exaggerating the low-end performance benefits of stronger algorithms.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-09 23:49:53 UTC
Permalink
Post by William Lee Irwin III
Post by Christoph Lameter
Sloppy rss left the rss in the section of mm that contained the counters.
So that has a separate cacheline. The idea of putting the atomic ops in a
group was to only have one exclusive cacheline for mmap_sem and the rss.
Which could lead to more bouncing of a single cache line rather than
bouncing multiple cache lines less. But it seems to me that the problem
essentially remains the same if the rss counter is not split.
The prior results Robin Holt cited were that the counter needed to be
in a different cacheline from the ->mmap_sem and ->page_table_lock.
We shouldn't need to evaluate splitting for the atomic RSS algorithm.
Ok. Then we would need rss and rss_anon on two additional cache lines?
Both rss and anon_rss on one line? mmap_sem and the page_table_lock also
each on different cache lines?
Post by William Lee Irwin III
A faithful implementation would just move the atomic counters away from
the ->mmap_sem and ->page_table_lock (just shuffle some mm fields).
Obviously a complete set of results won't be needed unless it's very
surprisingly competitive with the stronger algorithms. Things should be
fine just making sure that behaves similarly to the one with the shared
cacheline with ->mmap_sem in the sense of having a curve of similar shape
on smaller systems. The absolute difference probably doesn't matter,
but there is something to prove, and the largest risk of not doing so
is exaggerating the low-end performance benefits of stronger algorithms.
The advantage in the split rss solution is that it can be placed on the
same cacheline as other stuff from task that is already needed. So there
is minimal overhead involved. But I can certainly give it a spin and see
what the results are.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-12-10 04:26:15 UTC
Permalink
Post by Hugh Dickins
Post by Christoph Lameter
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).
Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.
Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.
That looks obviously correct. Probably the reason why Martin was
getting crashes.

[snip]
Post by Hugh Dickins
Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together. Obviously no problem on
architectures where update_mmu_cache is a no-op! But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures. I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm? You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.
Yep, the update_mmu_cache issue is real. There is a parallel problem
that is update_mmu_cache can be called on a pte who's page has since
been evicted and reused. Again, that looks safe on IA64, but maybe
not on other architectures.

It can be solved by moving lru_cache_add to after update_mmu_cache in
all cases but the "update accessed bit" type fault. I solved that by
simply defining that out for architectures that don't need it - a raced
fault will simply get repeated if need be.
Post by Hugh Dickins
A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?
That's what I had been doing too. Seems to be the right way to go.
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
Indeed. This was a real problem for my patch, definitely.
Post by Hugh Dickins
Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte. Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?
That's what I was wondering. It could be that the actual 64-bit store is
still atomic without the lock prefix (just not the entire rmw), which I
think would be sufficient.

In that case, get_64bit may be able to drop the lock prefix as well.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-12-10 04:54:41 UTC
Permalink
Post by Nick Piggin
Yep, the update_mmu_cache issue is real. There is a parallel problem
that is update_mmu_cache can be called on a pte who's page has since
been evicted and reused. Again, that looks safe on IA64, but maybe
not on other architectures.
It can be solved by moving lru_cache_add to after update_mmu_cache in
all cases but the "update accessed bit" type fault. I solved that by
simply defining that out for architectures that don't need it - a raced
fault will simply get repeated if need be.
The page-freed-before-update_mmu_cache issue can be solved in that way,
not the set_pte and update_mmu_cache not performed under the same ptl
section issue that you raised.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt
2004-12-10 05:06:16 UTC
Permalink
Post by Nick Piggin
Post by Nick Piggin
Yep, the update_mmu_cache issue is real. There is a parallel problem
that is update_mmu_cache can be called on a pte who's page has since
been evicted and reused. Again, that looks safe on IA64, but maybe
not on other architectures.
It can be solved by moving lru_cache_add to after update_mmu_cache in
all cases but the "update accessed bit" type fault. I solved that by
simply defining that out for architectures that don't need it - a raced
fault will simply get repeated if need be.
The page-freed-before-update_mmu_cache issue can be solved in that way,
not the set_pte and update_mmu_cache not performed under the same ptl
section issue that you raised.
What is the problem with update_mmu_cache ? It doesn't need to be done
in the same lock section since it's approx. equivalent to a HW fault,
which doesn't take the ptl...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-12-10 05:19:56 UTC
Permalink
Post by Benjamin Herrenschmidt
Post by Nick Piggin
The page-freed-before-update_mmu_cache issue can be solved in that way,
not the set_pte and update_mmu_cache not performed under the same ptl
section issue that you raised.
What is the problem with update_mmu_cache ? It doesn't need to be done
in the same lock section since it's approx. equivalent to a HW fault,
which doesn't take the ptl...
I don't think a problem has been observed, I think Hugh was just raising
it as a general issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2004-12-10 12:30:39 UTC
Permalink
Post by Nick Piggin
Post by Benjamin Herrenschmidt
Post by Nick Piggin
The page-freed-before-update_mmu_cache issue can be solved in that way,
not the set_pte and update_mmu_cache not performed under the same ptl
section issue that you raised.
What is the problem with update_mmu_cache ? It doesn't need to be done
in the same lock section since it's approx. equivalent to a HW fault,
which doesn't take the ptl...
I don't think a problem has been observed, I think Hugh was just raising
it as a general issue.
That's right, I know little of the arches on which update_mmu_cache does
something, so cannot say that separation is a problem. And I did see mail
from Ben a month ago in which he arrived at the conclusion that it's not a
problem - but assumed he was speaking for ppc and ppc64. (He was also
writing in the context of your patches rather than Christoph's.)

Perhaps Ben has in mind a logical argument that if update_mmu_cache does
just what its name implies, then doing it under a separate acquisition
of page_table_lock cannot introduce incorrectness on any architecture.
Maybe, but I'd still rather we heard that from an expert in each of the
affected architectures.

As it stands in Christoph's patches, update_mmu_cache is sometimes
called inside page_table_lock and sometimes outside: I'd be surprised
if that doesn't require adjustment for some architecture.

Your idea to raise do_anonymous_page's update_mmu_cache before the
lru_cache_add_active sounds just right; perhaps it should then even be
subsumed into the architectural ptep_cmpxchg. But once we get this far,
I do wonder again whether it's right to be changing the rules in
do_anonymous_page alone (Christoph's patches) rather than all the
other faults together (your patches).

But there's no doubt that the do_anonymous_page case is easier,
or more obviously easy, to deal with - it helps a lot to know
that the page cannot yet be exposed to vmscan.c and rmap.c.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-10 18:43:31 UTC
Permalink
Thank you for the thorough review of my patches. Comments below
Post by Hugh Dickins
Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.
I am still mystified as to why this is an issue at all. The patches apply
just fine to the kernel sources as is. I have patched kernels numerous
times with this patchset and never ran into any issue. quilt removes trailing
whitespace from patches when they are generated as far as I can tell.

Patches will be made against mm after Nick's modifications to the 4 level
patches are in.
Post by Hugh Dickins
Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.
Thanks for the patch.
Post by Hugh Dickins
Your scalability figures show a superb improvement. But they are (I
presume) for the best case: intense initial faulting of distinct areas
of anonymous memory by parallel cpus running a multithreaded process.
This is not a common case: how much do what real-world apps benefit?
This is common during the startup of distributed applications on our large
machines. They seem to freeze for minutes on bootup. I am not sure how
much real-world apps benefit. The numbers show that the benefit would
mostly be for SMP applications. UP has only very minor improvements.
Post by Hugh Dickins
Since you also avoid taking the page_table_lock in handle_pte_fault,
do you have any results to show how much (perhaps hard to quantify,
since even tmpfs file faults introduce other scalability issues)?
I have not done such tests (yet).
Post by Hugh Dickins
The split rss patch, if it stays, needs some work. For example,
task_statm uses "get_shared" to total up rss-anon_rss from the tasks,
but assumes mm->rss is already accurate. Scrap the separate get_rss,
get_anon_rss, get_shared functions: just one get_rss to make a single
pass through the tasks adding up both rss and anon_rss at the same time.
Next rev will have that.
Post by Hugh Dickins
Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task. You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that). You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm. Or... just go back to sloppy rss.
I will look into this issue.
Post by Hugh Dickins
Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together. Obviously no problem on
architectures where update_mmu_cache is a no-op! But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures. I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm? You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.
I have worked with a couple of arches and received feedback that was
integrated. I certainly welcome more feedback. A vague idea if there is
more trouble on that front: One could take the ptl in the cmpxchg
emulation and then unlock on update_mmu cache.
Post by Hugh Dickins
A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?
Changed.
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
The mark_page_accessed is likely there avoid a future fault just to set
the accessed bit.
Post by Hugh Dickins
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
That would be a performance issue.
Post by Hugh Dickins
Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte. Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?
But by making every set_pte use set_64bit, you are significantly slowing
down many operations which do not need that atomicity. This is quite
visible in the fork/exec/shell results from lmbench on i386 PAE (and is
the only interesting difference, for good or bad, that I noticed with
your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are
no faults on dst mm (nor on src mm) while copy_page_range is copying,
so its set_ptes don't need to be atomic; likewise during zap_pte_range
(either mmap_sem is held exclusively, or it's in the final exit_mmap).
Probably revert set_pte and set_pte_atomic to what they were, and use
set_pte_atomic where it's needed.
Good suggestions. Will see what I can do but I will need some assistence
my main platform is ia64 and the hardware and opportunities for testing on
i386 are limited.

Again thanks for the detailed review.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2004-12-10 21:43:59 UTC
Permalink
Post by Christoph Lameter
Post by Hugh Dickins
Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.
I am still mystified as to why this is an issue at all. The patches apply
just fine to the kernel sources as is. I have patched kernels numerous
times with this patchset and never ran into any issue. quilt removes trailing
whitespace from patches when they are generated as far as I can tell.
Perhaps you've only tried applying your original patches, not the ones
as received through the mail. It discourages people from trying them
when "patch -p1" fails with rejects, however trivial. Or am I alone
in seeing this? never had such a problem with other patches before.
Post by Christoph Lameter
Post by Hugh Dickins
Your scalability figures show a superb improvement. But they are (I
presume) for the best case: intense initial faulting of distinct areas
of anonymous memory by parallel cpus running a multithreaded process.
This is not a common case: how much do what real-world apps benefit?
This is common during the startup of distributed applications on our large
machines. They seem to freeze for minutes on bootup. I am not sure how
much real-world apps benefit. The numbers show that the benefit would
mostly be for SMP applications. UP has only very minor improvements.
How much do your patches speed the startup of these applications?
Can you name them?
Post by Christoph Lameter
I have worked with a couple of arches and received feedback that was
integrated. I certainly welcome more feedback. A vague idea if there is
more trouble on that front: One could take the ptl in the cmpxchg
emulation and then unlock on update_mmu cache.
Or move the update_mmu_cache into the ptep_cmpxchg emulation perhaps.
Post by Christoph Lameter
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
The mark_page_accessed is likely there avoid a future fault just to set
the accessed bit.
No, mark_page_accessed is an operation on the struct page
(and the accessed bit of the pte is preset too anyway).
Post by Christoph Lameter
Post by Hugh Dickins
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
That would be a performance issue.
Sadly, yes, but correctness must take precedence over performance.
It may be possible to avoid it in most cases, doing the atomic
later when in doubt: but would need careful thought.
Post by Christoph Lameter
Good suggestions. Will see what I can do but I will need some assistence
my main platform is ia64 and the hardware and opportunities for testing on
i386 are limited.
There's plenty of us can be trying i386. It's other arches worrying me.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-10 22:12:58 UTC
Permalink
Post by Hugh Dickins
Post by Christoph Lameter
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
The mark_page_accessed is likely there avoid a future fault just to set
the accessed bit.
No, mark_page_accessed is an operation on the struct page
(and the accessed bit of the pte is preset too anyway).
The point is a good one - I guess that code is a holdover from earlier
implementations.

This is equivalent, no?

--- 25/mm/memory.c~do_anonymous_page-use-setpagereferenced Fri Dec 10 14:11:32 2004
+++ 25-akpm/mm/memory.c Fri Dec 10 14:11:42 2004
@@ -1464,7 +1464,7 @@ do_anonymous_page(struct mm_struct *mm,
vma->vm_page_prot)),
vma);
lru_cache_add_active(page);
- mark_page_accessed(page);
+ SetPageReferenced(page);
page_add_anon_rmap(page, vma, addr);
}

_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2004-12-10 23:52:30 UTC
Permalink
Post by Andrew Morton
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
The point is a good one - I guess that code is a holdover from earlier
implementations.
This is equivalent, no?
Yes, it is equivalent to use SetPageReferenced(page) there instead.
But why is do_anonymous_page adding anything to lru_cache_add_active,
when its other callers leave it at that? What's special about the
do_anonymous_page case?

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-11 00:18:35 UTC
Permalink
Post by Hugh Dickins
Post by Andrew Morton
Post by Hugh Dickins
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
The point is a good one - I guess that code is a holdover from earlier
implementations.
This is equivalent, no?
Yes, it is equivalent to use SetPageReferenced(page) there instead.
But why is do_anonymous_page adding anything to lru_cache_add_active,
when its other callers leave it at that? What's special about the
do_anonymous_page case?
do_swap_page() is effectively doing the same as do_anonymous_page().
do_wp_page() and do_no_page() appear to be errant.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-10 20:03:32 UTC
Permalink
Post by Hugh Dickins
Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task. You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that). You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm. Or... just go back to sloppy rss.
Use_mm can simply attach the kernel thread to the mm via mm_add_thread
and will then update mm->rss when being detached again.

The issue with ptrace and get_user_pages is a bit thorny. I did the check
for mm = current->mm in the following patch. If mm != current->mm then
do the sloppy thing and increment mm->rss without the page table lock.
This should be a very special rare case.

One could also set current to the target task in get_user_pages but then
faults for the actual current task may increment the wrong counters. Could
we live with that?

Or simply leave as is. The pages are after all allocated by the ptrace
process and it should be held responsible for it.

My favorite rss solution is still just getting rid of rss and
anon_rss and do the long loops in procfs. Whichever process wants to
know better be willing to pay the price in cpu time and the code for
incrementing rss can be removed from the page fault handler.

We have no real way of establishing the ownership of shared pages
anyways. Its counted when allocated. But the page may live on afterwards
in another process and then not be accounted for although its only user is
the new process. IMHO vm scans may be the only way of really getting an
accurate count.

But here is the improved list_rss patch:

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-10 11:39:00.000000000 -0800
@@ -30,6 +30,7 @@
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
+#include <linux/rcupdate.h>

struct exec_domain;

@@ -217,6 +218,7 @@
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ long rss, anon_rss;

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -226,7 +228,7 @@
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+ unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
@@ -236,6 +238,8 @@

/* Architecture-specific MM context */
mm_context_t context;
+ struct list_head task_list; /* Tasks using this mm */
+ struct rcu_head rcu_head; /* For freeing mm via rcu */

/* Token based thrashing protection. */
unsigned long swap_token_time;
@@ -545,6 +549,9 @@
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+ /* Split counters from mm */
+ long rss;
+ long anon_rss;

/* task state */
struct linux_binfmt *binfmt;
@@ -578,6 +585,9 @@
struct completion *vfork_done; /* for vfork() */
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
+
+ /* List of other tasks using the same mm */
+ struct list_head mm_tasks;

unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
@@ -1124,6 +1134,12 @@

#endif

+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss);
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk);
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk);
+
#endif /* __KERNEL__ */

#endif
+
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-12-06 17:23:54.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-12-10 11:39:00.000000000 -0800
@@ -6,8 +6,9 @@

char *task_mem(struct mm_struct *mm, char *buffer)
{
- unsigned long data, text, lib;
+ unsigned long data, text, lib, rss, anon_rss;

+ get_rss(mm, &rss, &anon_rss);
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
@@ -22,7 +23,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,11 +38,14 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ unsigned long rss, anon_rss;
+
+ get_rss(mm, &rss, &anon_rss);
+ *shared = rss - anon_rss;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
Post by Hugh Dickins
PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = mm->rss;
+ *resident = rss;
return mm->total_vm;
}

Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-12-06 17:23:54.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-12-10 11:39:00.000000000 -0800
@@ -302,7 +302,7 @@

static int do_task_stat(struct task_struct *task, char * buffer, int whole)
{
- unsigned long vsize, eip, esp, wchan = ~0UL;
+ unsigned long rss, anon_rss, vsize, eip, esp, wchan = ~0UL;
long priority, nice;
int tty_pgrp = -1, tty_nr = 0;
sigset_t sigign, sigcatch;
@@ -325,6 +325,7 @@
vsize = task_vsize(mm);
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
+ get_rss(mm, &rss, &anon_rss);
}

get_task_comm(tcomm, task);
@@ -420,7 +421,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? rss : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-12-10 11:11:26.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-12-10 11:46:07.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -438,7 +436,10 @@
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

- vma->vm_mm->anon_rss++;
+ if (current->mm == vma->vm_mm)
+ current->anon_rss++;
+ else
+ vma->vm_mm->anon_rss++;

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
@@ -510,8 +511,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -799,8 +798,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
Index: linux-2.6.9/kernel/fork.c
===================================================================
--- linux-2.6.9.orig/kernel/fork.c 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/kernel/fork.c 2004-12-10 11:39:00.000000000 -0800
@@ -151,6 +151,7 @@
*tsk = *orig;
tsk->thread_info = ti;
ti->task = tsk;
+ tsk->rss = 0;

/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
@@ -292,6 +293,7 @@
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ INIT_LIST_HEAD(&mm->task_list);
mm->core_waiters = 0;
mm->nr_ptes = 0;
spin_lock_init(&mm->page_table_lock);
@@ -323,6 +325,13 @@
return mm;
}

+static void rcu_free_mm(struct rcu_head *head)
+{
+ struct mm_struct *mm = container_of(head ,struct mm_struct, rcu_head);
+
+ free_mm(mm);
+}
+
/*
* Called when the last reference to the mm
* is dropped: either by a lazy thread or by
@@ -333,7 +342,7 @@
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
- free_mm(mm);
+ call_rcu(&mm->rcu_head, rcu_free_mm);
}

/*
@@ -400,6 +409,8 @@

/* Get rid of any cached register state */
deactivate_mm(tsk, mm);
+ if (mm)
+ mm_remove_thread(mm, tsk);

/* notify parent sleeping on vfork() */
if (vfork_done) {
@@ -447,8 +458,8 @@
* new threads start up in user mode using an mm, which
* allows optimizing out ipis; the tlb_gather_mmu code
* is an example.
+ * (mm_add_thread does use the ptl .... )
*/
- spin_unlock_wait(&oldmm->page_table_lock);
goto good_mm;
}

@@ -470,6 +481,7 @@
goto free_pt;

good_mm:
+ mm_add_thread(mm, tsk);
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-10 11:12:44.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-10 11:45:00.000000000 -0800
@@ -1467,8 +1467,10 @@
*/
page_add_anon_rmap(page, vma, addr);
lru_cache_add_active(page);
- mm->rss++;
-
+ if (current->mm == mm)
+ current->rss++;
+ else
+ mm->rss++;
}
pte_unmap(page_table);

@@ -1859,3 +1861,49 @@
}

#endif
+
+void get_rss(struct mm_struct *mm, unsigned long *rss, unsigned long *anon_rss)
+{
+ struct list_head *y;
+ struct task_struct *t;
+ long rss_sum, anon_rss_sum;
+
+ rcu_read_lock();
+ rss_sum = mm->rss;
+ anon_rss_sum = mm->anon_rss;
+ list_for_each_rcu(y, &mm->task_list) {
+ t = list_entry(y, struct task_struct, mm_tasks);
+ rss_sum += t->rss;
+ anon_rss_sum += t->anon_rss;
+ }
+ if (rss_sum < 0)
+ rss_sum = 0;
+ if (anon_rss_sum < 0)
+ anon_rss_sum = 0;
+ rcu_read_unlock();
+ *rss = rss_sum;
+ *anon_rss = anon_rss_sum;
+}
+
+void mm_remove_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ if (!mm)
+ return;
+
+ spin_lock(&mm->page_table_lock);
+ mm->rss += tsk->rss;
+ mm->anon_rss += tsk->anon_rss;
+ list_del_rcu(&tsk->mm_tasks);
+ spin_unlock(&mm->page_table_lock);
+}
+
+void mm_add_thread(struct mm_struct *mm, struct task_struct *tsk)
+{
+ spin_lock(&mm->page_table_lock);
+ tsk->rss = 0;
+ tsk->anon_rss = 0;
+ list_add_rcu(&tsk->mm_tasks, &mm->task_list);
+ spin_unlock(&mm->page_table_lock);
+}
+
+
Index: linux-2.6.9/include/linux/init_task.h
===================================================================
--- linux-2.6.9.orig/include/linux/init_task.h 2004-12-06 17:23:55.000000000 -0800
+++ linux-2.6.9/include/linux/init_task.h 2004-12-10 11:39:00.000000000 -0800
@@ -42,6 +42,7 @@
.mmlist = LIST_HEAD_INIT(name.mmlist), \
.cpu_vm_mask = CPU_MASK_ALL, \
.default_kioctx = INIT_KIOCTX(name.default_kioctx, name), \
+ .task_list = LIST_HEAD_INIT(name.task_list), \
}

#define INIT_SIGNALS(sig) { \
@@ -112,6 +113,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .mm_tasks = LIST_HEAD_INIT(tsk.mm_tasks), \
}


Index: linux-2.6.9/fs/exec.c
===================================================================
--- linux-2.6.9.orig/fs/exec.c 2004-12-06 17:23:54.000000000 -0800
+++ linux-2.6.9/fs/exec.c 2004-12-10 11:39:00.000000000 -0800
@@ -543,6 +543,7 @@
active_mm = tsk->active_mm;
tsk->mm = mm;
tsk->active_mm = mm;
+ mm_add_thread(mm, current);
activate_mm(active_mm, mm);
task_unlock(tsk);
arch_pick_mmap_layout(mm);
Index: linux-2.6.9/fs/aio.c
===================================================================
--- linux-2.6.9.orig/fs/aio.c 2004-12-06 17:23:54.000000000 -0800
+++ linux-2.6.9/fs/aio.c 2004-12-10 11:39:00.000000000 -0800
@@ -575,6 +575,7 @@
atomic_inc(&mm->mm_count);
tsk->mm = mm;
tsk->active_mm = mm;
+ mm_add_thread(mm, tsk);
activate_mm(active_mm, mm);
task_unlock(tsk);

@@ -597,6 +598,7 @@
struct task_struct *tsk = current;

task_lock(tsk);
+ mm_remove_thread(mm,tsk);
tsk->flags &= ~PF_BORROWED_MM;
tsk->mm = NULL;
/* active_mm is still 'mm' */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2004-12-10 21:24:50 UTC
Permalink
Post by Christoph Lameter
Post by Hugh Dickins
Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task. You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that). You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm. Or... just go back to sloppy rss.
Use_mm can simply attach the kernel thread to the mm via mm_add_thread
and will then update mm->rss when being detached again.
True. But please add and remove mm outside of the task_lock,
there's no need to nest page_table_lock within it, is there?
Post by Christoph Lameter
The issue with ptrace and get_user_pages is a bit thorny. I did the check
for mm = current->mm in the following patch. If mm != current->mm then
do the sloppy thing and increment mm->rss without the page table lock.
This should be a very special rare case.
I don't understand why you want to avoid taking mm->page_table_lock
in that special rare case. I do prefer the sloppy rss approach, but if
you're trying to be exact then it's regrettable to leave sloppy corners.

Oh, is it because page_add_anon_rmap is usually called with page_table_lock,
but without in your do_anonymous_page case? You'll have to move the
anon_rss incrementation out of page_add_anon_rmap to its callsites
(I was being a little bit lazy when I sited it in that one place,
it's probably better to do it near mm->rss anyway.)
Post by Christoph Lameter
One could also set current to the target task in get_user_pages but then
faults for the actual current task may increment the wrong counters. Could
we live with that?
No, "current" is not nearly so easy to play with as that.
See i386. Even if it were, you might get burnt for heresy.
Post by Christoph Lameter
Or simply leave as is. The pages are after all allocated by the ptrace
process and it should be held responsible for it.
No.
Post by Christoph Lameter
My favorite rss solution is still just getting rid of rss and
anon_rss and do the long loops in procfs. Whichever process wants to
know better be willing to pay the price in cpu time and the code for
incrementing rss can be removed from the page fault handler.
We all seem to have different favourites. Your favourite makes
quite a few people very angry. We've been there, we've done that,
we've no wish to return. It'd be fine if just the process which
wants to know paid the price; but it's every other that has to pay.
Post by Christoph Lameter
We have no real way of establishing the ownership of shared pages
anyways. Its counted when allocated. But the page may live on afterwards
in another process and then not be accounted for although its only user is
the new process.
I didn't understand that bit.
Post by Christoph Lameter
IMHO vm scans may be the only way of really getting an accurate count.
Not studied in depth, but... am I going mad, or is your impressive
RCUing the wrong way round? While we're scanning the list of tasks
sharing the mm, there's no danger of the mm vanishing, but there is
a danger of the task vanishing. Isn't it therefore the task which
needs to be freed via RCU, not the mm?

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-12-10 21:38:59 UTC
Permalink
Post by Hugh Dickins
Post by Christoph Lameter
We have no real way of establishing the ownership of shared pages
anyways. Its counted when allocated. But the page may live on afterwards
in another process and then not be accounted for although its only user is
the new process.
I didn't understand that bit.
We did lose some accounting accuracy when the pagetable walk and the big
tasklist walks were removed. Bill would probably have more details. Given
that the code as it stood was a complete showstopper, the tradeoff seemed
reasonable.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-22 22:32:48 UTC
Permalink
Post by Linus Torvalds
Post by Christoph Lameter
The problem is then that the proc filesystem must do an extensive scan
over all threads to find users of a certain mm_struct.
The alternative is to just add a simple list into the task_struct and the
head of it into mm_struct. Then, at fork, you just finish the fork() with
list_add(p->mm_list, p->mm->thread_list);
and do the proper list_del() in exit_mm() or wherever.
You'll still loop in /proc, but you'll do the minimal loop necessary.
Yes, that was what I was thinking we'd have to resort to. Not a bad idea.

It would be nice if you could have it integrated with the locking that
is already there - for example mmap_sem, although that might mean you'd
have to take mmap_sem for writing which may limit scalability of thread
creation / destruction... maybe a seperate lock / semaphore for that list
itself would be OK.

Deferred rss might be a practical solution, but I'd prefer this if it can
be made workable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-22 22:39:19 UTC
Permalink
Post by Nick Piggin
Deferred rss might be a practical solution, but I'd prefer this if it can
be made workable.
Both results in an additional field in task_struct that is going to be
incremented when the page_table_lock is not held. It would be possible
to switch to looping in procfs later. The main question with this patchset
is:

How and when can we get this get into the kernel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-22 23:14:02 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
Deferred rss might be a practical solution, but I'd prefer this if it can
be made workable.
Both results in an additional field in task_struct that is going to be
incremented when the page_table_lock is not held. It would be possible
to switch to looping in procfs later. The main question with this patchset
Sure.
Post by Christoph Lameter
How and when can we get this get into the kernel?
Well it is a good starting platform for the various PTL reduction patches
floating around.

I'd say Andrew could be convinced to stick it in -mm after 2.6.10, but we'd
probably need a clear path to one of the PTL patches before anything would
move into 2.6.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:44:47 UTC
Permalink
Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PGD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -78,12 +82,19 @@
preempt_enable();
}

+
static inline void
pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
{
pgd_val(*pgd_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800
@@ -414,6 +425,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -476,13 +507,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -560,6 +584,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:44:15 UTC
Permalink
Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <***@sgi.com

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-19 06:38:53.000000000 -0800
@@ -1330,8 +1330,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1343,15 +1342,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1374,8 +1371,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1422,14 +1418,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1441,7 +1435,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1450,30 +1443,37 @@
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
- lru_cache_add_active(page);
mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ /* update the entry */
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ }
+ goto out;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ lru_cache_add_active(page);
+ page_add_anon_rmap(page, vma, addr);
+ mm->rss++;
+
+ }
pte_unmap(page_table);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1489,12 +1489,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1505,9 +1505,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1605,7 +1604,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1618,13 +1617,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

pgoff = pte_to_pgoff(*pte);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1643,49 +1641,40 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+ update_mmu_cache(vma, address, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -1703,22 +1692,45 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd
*/
- spin_lock(&mm->page_table_lock);
- pmd = pmd_alloc(mm, pgd, address);
+ if (unlikely(pgd_none(*pgd))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ /* Insure that the update is done in an atomic way */
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pgd, address);
+
+ if (likely(pmd)) {
+ pte_t *pte;
+
+ if (!pmd_present(*pmd)) {
+ struct page *new;

- if (pmd) {
- pte_t * pte = pte_alloc_map(mm, pmd, address);
- if (pte)
+ new = pte_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else
+ inc_page_state(nr_page_table_pages);
+ }
+
+ pte = pte_offset_map(pmd, address);
+ if (likely(pte))
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}

Index: linux-2.6.9/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-19 07:54:05.000000000 -0800
@@ -134,4 +134,60 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) set_pte(__ptep, __newval); \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-19 06:38:51.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-19 06:38:53.000000000 -0800
@@ -419,7 +419,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -561,11 +564,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -580,11 +578,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
page_remove_rmap(page);
page_cache_release(page);
@@ -671,15 +673,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:45:28 UTC
Permalink
Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor systems
if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value via
cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic 64 bit
operations are useful)

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +325,32 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
struct alt_instr {
__u8 *instr; /* original instruction */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:46:45 UTC
Permalink
Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-19 08:17:55.000000000 -0800
@@ -7,16 +7,26 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pgd_populate(mm, pgd, pmd) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+ (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-19 08:18:52.000000000 -0800
@@ -437,6 +437,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:46:06 UTC
Permalink
Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800
@@ -413,6 +413,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.9/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800
@@ -6,7 +6,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <***@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg <***@lameter.com>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -42,26 +43,15 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
-#define __HAVE_ARCH_SET_PTE_ATOMIC
-#define set_pte_atomic(pteptr,pteval) \
+#define set_pte(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))

+#define set_pte_atomic set_pte
+
/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
* the TLB via cr3 if the top-level pgd is changed...
@@ -142,4 +132,23 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800
@@ -82,4 +82,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +19,19 @@
((unsigned long long)page_to_pfn(pte) <<
(unsigned long long) PAGE_SHIFT)));
}
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -49,6 +65,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pgd_populate(mm, pmd, pte) BUG()
+#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })

#define check_pgt_cache() do { } while (0)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:47:14 UTC
Permalink
Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800
@@ -567,6 +567,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -778,6 +787,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -791,6 +808,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.9/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-19 19:59:03 UTC
Permalink
You could also make "rss" be a _signed_ integer per-thread.

When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.

Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).

Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 01:07:51 UTC
Permalink
Post by Linus Torvalds
You could also make "rss" be a _signed_ integer per-thread.
When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.
Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).
Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.
I think this sounds like it might be a good idea. I prefer it to having
the unbounded error of sloppy rss (as improbable as it may be in practice).

The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-20 01:29:06 UTC
Permalink
Post by Nick Piggin
I think this sounds like it might be a good idea. I prefer it to having
the unbounded error of sloppy rss (as improbable as it may be in practice).
It may also be faster since the processors can have exclusive cache lines.

This means we need to move rss into the task struct. But how does one get
from mm struct to task struct? current is likely available most of
the time. Is that always the case?
Post by Nick Piggin
The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.
Note though that the mmap_sem is no protection. It is a read lock and may
be held by multiple processes while incrementing and decrementing rss.
This is likely reducing the number of collisions significantly but it wont
be a guarantee like locking or atomic ops.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 01:45:59 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
I think this sounds like it might be a good idea. I prefer it to having
the unbounded error of sloppy rss (as improbable as it may be in practice).
It may also be faster since the processors can have exclusive cache lines.
Yep.
Post by Christoph Lameter
This means we need to move rss into the task struct. But how does one get
from mm struct to task struct? current is likely available most of
the time. Is that always the case?
It is available everywhere that mm_struct is, I guess. So yes, I
think `current` should be OK.
Post by Christoph Lameter
Post by Nick Piggin
The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.
Note though that the mmap_sem is no protection. It is a read lock and may
be held by multiple processes while incrementing and decrementing rss.
This is likely reducing the number of collisions significantly but it wont
be a guarantee like locking or atomic ops.
Yeah the read lock won't do anything to serialise it. I think what Linus
is saying is that we _don't care_ most of the time (because the error will
be bounded). But if it happened that we really do care anywhere, then the
write lock should be sufficient.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-20 01:58:25 UTC
Permalink
Post by Christoph Lameter
Note though that the mmap_sem is no protection. It is a read lock and may
be held by multiple processes while incrementing and decrementing rss.
This is likely reducing the number of collisions significantly but it wont
be a guarantee like locking or atomic ops.
It is, though, if you hold it for a write.

The point being that you _can_ get an exact rss value if you want to.

Not that I really see any overwhelming evidence of anybody ever really
caring, but it's nice to know that you have the option.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-20 02:06:52 UTC
Permalink
Post by Linus Torvalds
Not that I really see any overwhelming evidence of anybody ever really
caring, but it's nice to know that you have the option.
Btw, if you are going to look at doing this rss thing, you need to make
sure that thread exit ends up adding its rss to _some_ remaining sibling.

I guess that was obvious, but it's worth pointing out. That may actually
be the only case where we do _not_ have a nice SMP-safe access: we do have
a stable sibling (tsk->thread_leader), but we don't have any good
serialization _except_ for taking mmap_sem for writing. Which we currently
don't do: we take it for reading (and then we possibly upgrade it to a
write lock if we notice that there is a core-dump starting).

We can avoid this too by having a per-mm atomic rss "spill" counter. So
exit_mm() would basically do:

...
tsk->mm = NULL;
atomic_add(tsk->rss, &mm->rss_spill);
...

and then the algorithm for getting rss would be:

rss = atomic_read(mm->rss_spill);
for_each_thread(..)
rss += tsk->rss;

Or does anybody see any better approaches?

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-20 01:56:43 UTC
Permalink
Post by Nick Piggin
The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.
Yes. As long as the total rss fits in an int, it doesn't matter if any of
them wrap. Addition is still associative in twos-complement arithmetic
even in the presense of overflows.

If you actually want to make it proper standard C, I guess you'd have to
make the thing unsigned, which gives you the mod-2**n guarantees even if
somebody were to ever make a non-twos-complement machine.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bill Davidsen
2004-11-22 18:06:03 UTC
Permalink
Post by Linus Torvalds
Post by Nick Piggin
The per thread rss may wrap (maybe not 64-bit counters), but even so,
the summation over all threads should still end up being correct I
think.
Yes. As long as the total rss fits in an int, it doesn't matter if any of
them wrap. Addition is still associative in twos-complement arithmetic
even in the presense of overflows.
If you actually want to make it proper standard C, I guess you'd have to
make the thing unsigned, which gives you the mod-2**n guarantees even if
somebody were to ever make a non-twos-complement machine.
I think other stuff breaks as well, I think I saw you post some example
code using something like (a & -a) or similar within the last few
months. Fortunately neither 1's comp or BCD are likeliy to return in
hardware. Big-end vs. little-end is still an issue, though.
--
-bill davidsen (***@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 02:03:06 UTC
Permalink
Post by Linus Torvalds
You could also make "rss" be a _signed_ integer per-thread.
When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.
Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).
Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.
Unprivileged triggers for full-tasklist scans are NMI oops material.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 02:25:37 UTC
Permalink
Post by William Lee Irwin III
Post by Linus Torvalds
You could also make "rss" be a _signed_ integer per-thread.
When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.
Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).
Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.
Unprivileged triggers for full-tasklist scans are NMI oops material.
What about pushing the per-thread rss delta back into the global atomic
rss counter in each schedule()?

Pros:
This would take the task exiting problem into its stride as a matter of
course.

Single atomic read to get rss.

Cons:
would just be moving the atomic op somewhere else if we don't get
many page faults per schedule.

Not really nice dependancies.

Assumes schedule (not context switch) must occur somewhat regularly.
At present this is not true for SCHED_FIFO tasks.


Too nasty?
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 02:41:04 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Unprivileged triggers for full-tasklist scans are NMI oops material.
What about pushing the per-thread rss delta back into the global atomic
rss counter in each schedule()?
This would take the task exiting problem into its stride as a matter of
course.
Single atomic read to get rss.
would just be moving the atomic op somewhere else if we don't get
many page faults per schedule.
Not really nice dependancies.
Assumes schedule (not context switch) must occur somewhat regularly.
At present this is not true for SCHED_FIFO tasks.
Too nasty?
This doesn't sound too hot. There's enough accounting that can't be
done anywhere but schedule(), and this can be done elsewhere. Plus,
you're moving an already too-frequent operation to a more frequent
callsite.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 02:46:11 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Unprivileged triggers for full-tasklist scans are NMI oops material.
What about pushing the per-thread rss delta back into the global atomic
rss counter in each schedule()?
This would take the task exiting problem into its stride as a matter of
course.
Single atomic read to get rss.
would just be moving the atomic op somewhere else if we don't get
many page faults per schedule.
Not really nice dependancies.
Assumes schedule (not context switch) must occur somewhat regularly.
At present this is not true for SCHED_FIFO tasks.
Too nasty?
This doesn't sound too hot. There's enough accounting that can't be
done anywhere but schedule(), and this can be done elsewhere. Plus,
you're moving an already too-frequent operation to a more frequent
callsite.
No, it won't somehow increase the number of atomic rss operations
just because schedule is called more often. The number of ops will
be at _most_ the number of page faults.

But I agree with your overall evaluation of its 'hotness'. Just
another idea. Give this monkey another thousand years at the keys
and he'll come up with the perfect solution :P
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 03:37:04 UTC
Permalink
Post by William Lee Irwin III
Post by Linus Torvalds
You could also make "rss" be a _signed_ integer per-thread.
When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.
Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).
Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.
Unprivileged triggers for full-tasklist scans are NMI oops material.
Hang on, let's come back to this...

We already have unprivileged do-for-each-thread triggers in the proc
code. It's in do_task_stat, even. Rss reporting would basically just
involve one extra addition within that loop.

So... hmm, I can't see a problem with it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 03:55:10 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Unprivileged triggers for full-tasklist scans are NMI oops material.
Hang on, let's come back to this...
We already have unprivileged do-for-each-thread triggers in the proc
code. It's in do_task_stat, even. Rss reporting would basically just
involve one extra addition within that loop.
So... hmm, I can't see a problem with it.
/proc/ triggering NMI oopses was a persistent problem even before that
code was merged. I've not bothered testing it as it at best aggravates it.

And thread groups can share mm's. do_for_each_thread() won't suffice.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 04:03:17 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Unprivileged triggers for full-tasklist scans are NMI oops material.
Hang on, let's come back to this...
We already have unprivileged do-for-each-thread triggers in the proc
code. It's in do_task_stat, even. Rss reporting would basically just
involve one extra addition within that loop.
So... hmm, I can't see a problem with it.
/proc/ triggering NMI oopses was a persistent problem even before that
code was merged. I've not bothered testing it as it at best aggravates it.
It isn't a problem. If it ever became a problem then we can just
touch the nmi oopser in the loop.
Post by William Lee Irwin III
And thread groups can share mm's. do_for_each_thread() won't suffice.
I think it will be just fine.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 04:06:04 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
And thread groups can share mm's. do_for_each_thread() won't suffice.
I think it will be just fine.
Sorry, I misread. I think having per-thread rss counters will be
fine (regardless of whether or not do_for_each_thread itself will
suffice).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 04:23:40 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
/proc/ triggering NMI oopses was a persistent problem even before that
code was merged. I've not bothered testing it as it at best aggravates it.
It isn't a problem. If it ever became a problem then we can just
touch the nmi oopser in the loop.
Very, very wrong. The tasklist scans hold the read side of the lock
and aren't even what's running with interrupts off. The contenders
on the write side are what the NMI oopser oopses.

And supposing the arch reenables interrupts in the write side's
spinloop, you just get a box that silently goes out of service for
extended periods of time, breaking cluster membership and more. The
NMI oopser is just the report of the problem, not the problem itself.
It's not a false report. The box is dead for > 5s at a time.
Post by Nick Piggin
Post by William Lee Irwin III
And thread groups can share mm's. do_for_each_thread() won't suffice.
I think it will be just fine.
And that makes it wrong on both counts. The above fails any time
LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features
are used directly.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 04:29:29 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
/proc/ triggering NMI oopses was a persistent problem even before that
code was merged. I've not bothered testing it as it at best aggravates it.
It isn't a problem. If it ever became a problem then we can just
touch the nmi oopser in the loop.
Very, very wrong. The tasklist scans hold the read side of the lock
and aren't even what's running with interrupts off. The contenders
on the write side are what the NMI oopser oopses.
*blinks*

So explain how this is "very very wrong", then?
Post by William Lee Irwin III
And supposing the arch reenables interrupts in the write side's
spinloop, you just get a box that silently goes out of service for
extended periods of time, breaking cluster membership and more. The
NMI oopser is just the report of the problem, not the problem itself.
It's not a false report. The box is dead for > 5s at a time.
The point is, adding a for-each-thread loop or two in /proc isn't
going to cause a problem that isn't already there.

If you had zero for-each-thread loops then you might have a valid
complaint. Seeing as you have more than zero, with slim chances of
reducing that number, then there is no valid complaint.
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
And thread groups can share mm's. do_for_each_thread() won't suffice.
I think it will be just fine.
And that makes it wrong on both counts. The above fails any time
LD_ASSUME_KERNEL=2.4 is used, we well as when actual Linux features
are used directly.
See my followup.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 05:38:02 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Very, very wrong. The tasklist scans hold the read side of the lock
and aren't even what's running with interrupts off. The contenders
on the write side are what the NMI oopser oopses.
*blinks*
So explain how this is "very very wrong", then?
There isn't anything left to explain. So if there's a question, be
specific about it.
Post by Nick Piggin
Post by William Lee Irwin III
And supposing the arch reenables interrupts in the write side's
spinloop, you just get a box that silently goes out of service for
extended periods of time, breaking cluster membership and more. The
NMI oopser is just the report of the problem, not the problem itself.
It's not a false report. The box is dead for > 5s at a time.
The point is, adding a for-each-thread loop or two in /proc isn't
going to cause a problem that isn't already there.
If you had zero for-each-thread loops then you might have a valid
complaint. Seeing as you have more than zero, with slim chances of
reducing that number, then there is no valid complaint.
This entire line of argument is bogus. A preexisting bug of a similar
nature is not grounds for deliberately introducing any bug.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 05:50:25 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Very, very wrong. The tasklist scans hold the read side of the lock
and aren't even what's running with interrupts off. The contenders
on the write side are what the NMI oopser oopses.
*blinks*
So explain how this is "very very wrong", then?
There isn't anything left to explain. So if there's a question, be
specific about it.
Why am I very very wrong? Why won't touch_nmi_watchdog work from
the read loop?

And let's just be nice and try not to jump at the chance to point
out when people are very very wrong, and keep count of the times
they have been very very wrong. I'm trying to be constructive.
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
And supposing the arch reenables interrupts in the write side's
spinloop, you just get a box that silently goes out of service for
extended periods of time, breaking cluster membership and more. The
NMI oopser is just the report of the problem, not the problem itself.
It's not a false report. The box is dead for > 5s at a time.
The point is, adding a for-each-thread loop or two in /proc isn't
going to cause a problem that isn't already there.
If you had zero for-each-thread loops then you might have a valid
complaint. Seeing as you have more than zero, with slim chances of
reducing that number, then there is no valid complaint.
This entire line of argument is bogus. A preexisting bug of a similar
nature is not grounds for deliberately introducing any bug.
Sure, if that is a bug and someone is just about to fix it then
yes you're right, we shouldn't introduce this. I didn't realise
it was a bug. Sounds like it would be causing you lots of problems
though - have you looked at how to fix it?
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 06:23:41 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
There isn't anything left to explain. So if there's a question, be
specific about it.
Why am I very very wrong? Why won't touch_nmi_watchdog work from
the read loop?
And let's just be nice and try not to jump at the chance to point
out when people are very very wrong, and keep count of the times
they have been very very wrong. I'm trying to be constructive.
touch_nmi_watchdog() is only "protection" against local interrupt
disablement triggering the NMI oopser because alert_counter[]
increments are not atomic. Yet even supposing they were made so, the
net effect of "covering up" this gross deficiency is making the
user-observable problems it causes undiagnosable, as noted before.
Post by Nick Piggin
Post by William Lee Irwin III
This entire line of argument is bogus. A preexisting bug of a similar
nature is not grounds for deliberately introducing any bug.
Sure, if that is a bug and someone is just about to fix it then
yes you're right, we shouldn't introduce this. I didn't realise
it was a bug. Sounds like it would be causing you lots of problems
though - have you looked at how to fix it?
Kevin Marin was the first to report this issue to lkml. I had seen
instances of it in internal corporate bugreports and it was one of
the motivators for the work I did on pidhashing (one of the causes
of the timeouts was worst cases in pid allocation). Manfred Spraul
and myself wrote patches attempting to reduce read-side hold time
in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
Soni wrote patches to carry out the long iterations in /proc/ locklessly.

The last several of these affecting /proc/ have not gained acceptance,
though the work has not been halted in any sense, as this problem
recurs quite regularly. A considerable amount of sustained effort has
gone toward mitigating and resolving rwlock starvation.

Aggravating the rwlock starvation destabilizes, not pessimizes,
and performance is secondary to stability.


-- wli
Nick Piggin
2004-11-20 06:49:53 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
There isn't anything left to explain. So if there's a question, be
specific about it.
Why am I very very wrong? Why won't touch_nmi_watchdog work from
the read loop?
And let's just be nice and try not to jump at the chance to point
out when people are very very wrong, and keep count of the times
they have been very very wrong. I'm trying to be constructive.
touch_nmi_watchdog() is only "protection" against local interrupt
disablement triggering the NMI oopser because alert_counter[]
increments are not atomic. Yet even supposing they were made so, the
That would be a bug in touch_nmi_watchdog then, because you're
racy against your own NMI too.

So I'm actually not very very wrong at all. I'm technically wrong
because touch_nmi_watchdog has a theoretical 'bug'. In practice,
multiple races with the non atomic increments to the same counter,
and in an unbroken sequence would be about as likely as hardware
failure.

Anyway, this touch nmi thing is going off topic, sorry list.
Post by William Lee Irwin III
net effect of "covering up" this gross deficiency is making the
user-observable problems it causes undiagnosable, as noted before.
Well the loops that are in there now aren't covered up, and they
don't seem to be causing problems. Ergo there is no problem (we're
being _practical_ here, right?)
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
This entire line of argument is bogus. A preexisting bug of a similar
nature is not grounds for deliberately introducing any bug.
Sure, if that is a bug and someone is just about to fix it then
yes you're right, we shouldn't introduce this. I didn't realise
it was a bug. Sounds like it would be causing you lots of problems
though - have you looked at how to fix it?
Kevin Marin was the first to report this issue to lkml. I had seen
instances of it in internal corporate bugreports and it was one of
the motivators for the work I did on pidhashing (one of the causes
of the timeouts was worst cases in pid allocation). Manfred Spraul
and myself wrote patches attempting to reduce read-side hold time
in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
Soni wrote patches to carry out the long iterations in /proc/ locklessly.
The last several of these affecting /proc/ have not gained acceptance,
though the work has not been halted in any sense, as this problem
recurs quite regularly. A considerable amount of sustained effort has
gone toward mitigating and resolving rwlock starvation.
That's very nice. But there is no problem _now_, is there?
Post by William Lee Irwin III
Aggravating the rwlock starvation destabilizes, not pessimizes,
and performance is secondary to stability.
Well luckily we're not going to be aggravating the rwlock stavation.

If you found a problem with, and fixed do_task_stat: ?time, ???_flt,
et al, then you would apply the same solution to per thread rss to
fix it in the same way.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-11-20 06:57:01 UTC
Permalink
Post by Nick Piggin
per thread rss
Given that we have contention problems updating a single mm-wide rss and
given that the way to fix that up is to spread things out a bit, it seems
wildly arbitrary to me that the way in which we choose to spread the
counter out is to stick a bit of it into each task_struct.

I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice. We probably
don't even need to spread them out on cachelines - having four or eight
cpus sharing the same cacheline probably isn't going to hurt much.

At least, that'd be my first attempt. If it's still not good enough, try
something else.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2004-11-20 07:04:18 UTC
Permalink
Post by Andrew Morton
I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice.
One might even be able to use percpu_counter.h, although that might end up
hurting many-cpu fork times, due to all that work in __alloc_percpu().
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 07:13:03 UTC
Permalink
Post by Andrew Morton
Post by Nick Piggin
per thread rss
Given that we have contention problems updating a single mm-wide rss and
given that the way to fix that up is to spread things out a bit, it seems
wildly arbitrary to me that the way in which we choose to spread the
counter out is to stick a bit of it into each task_struct.
I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice. We probably
don't even need to spread them out on cachelines - having four or eight
cpus sharing the same cacheline probably isn't going to hurt much.
At least, that'd be my first attempt. If it's still not good enough, try
something else.
That is what Bill thought too. I guess per-cpu and per-thread rss are
the leading candidates.

Per thread rss has the benefits of cacheline exclusivity, and not
causing task bloat in the common case.

Per CPU array has better worst case /proc properties, but shares
cachelines (or not, if using percpu_counter as you suggested).


I think I'd better leave it to others to finish off the arguments ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 08:00:49 UTC
Permalink
Post by Nick Piggin
Post by Andrew Morton
Given that we have contention problems updating a single mm-wide rss and
given that the way to fix that up is to spread things out a bit, it seems
wildly arbitrary to me that the way in which we choose to spread the
counter out is to stick a bit of it into each task_struct.
I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice. We probably
don't even need to spread them out on cachelines - having four or eight
cpus sharing the same cacheline probably isn't going to hurt much.
At least, that'd be my first attempt. If it's still not good enough, try
something else.
That is what Bill thought too. I guess per-cpu and per-thread rss are
the leading candidates.
Per thread rss has the benefits of cacheline exclusivity, and not
causing task bloat in the common case.
Per CPU array has better worst case /proc properties, but shares
cachelines (or not, if using percpu_counter as you suggested).
I think I'd better leave it to others to finish off the arguments ;)
(1) The "task bloat" is more than tolerable on the systems capable
of having enough cpus to see significant per-process
memory footprint, where "significant" is smaller than a
pagetable page even for systems twice as large as now shipped.
(2) The cacheline exclusivity is not entirely gone in dense per-cpu
arrays, it's merely "approximated" by sharing amongst small
groups of adjacent cpus. This is fine for e.g. NUMA because
those small groups of adjacent cpus will typically be on nearby
nodes.
(3) The price paid to get "perfect exclusivity" instead of "approximate
exclusivity" is unbounded tasklist_lock hold time, which takes
boxen down outright in every known instance.

The properties are not for /proc/, they are for tasklist_lock. Every
read stops all other writes. When you hold tasklist_lock for an
extended period of time for read or write, (e.g. exhaustive tasklist
search) you stop all fork()'s and exit()'s and execve()'s on a running
system. The "worst case" analysis has nothing to do with speed. It has
everything to do with taking a box down outright, much like unplugging
power cables or dereferencing NULL. Unbounded tasklist_lock hold time
kills running boxen dead.

Read sides of rwlocks are not licenses to spin for aeons with locks held.

And the "question" of sufficiency has in fact already been answered.
SGI's own testing during the 2.4 out-of-tree patching cycle determined
that an mm-global atomic counter was already sufficient so long as the
cacheline was not shared with ->mmap_sem and the like. The "simplest"
optimization of moving the field out of the way of ->mmap_sem already
worked. The grander ones, if and *ONLY* if they don't have showstoppers
like unbounded tasklist_lock hold time or castrating workload monitoring
to unusability, will merely be more robust for future systems.
Reiterating, this is all just fine so long as they don't cause any
showstopping problems, like castrating the ability to monitor
processes, or introducing more tasklist_lock starvation.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin J. Bligh
2004-11-20 16:59:36 UTC
Permalink
Post by Nick Piggin
Post by Andrew Morton
Given that we have contention problems updating a single mm-wide rss and
given that the way to fix that up is to spread things out a bit, it seems
wildly arbitrary to me that the way in which we choose to spread the
counter out is to stick a bit of it into each task_struct.
I'd expect that just shoving a pointer into mm_struct which points at a
dynamically allocated array[NR_CPUS] of longs would suffice. We probably
don't even need to spread them out on cachelines - having four or eight
cpus sharing the same cacheline probably isn't going to hurt much.
At least, that'd be my first attempt. If it's still not good enough, try
something else.
That is what Bill thought too. I guess per-cpu and per-thread rss are
the leading candidates.
Per thread rss has the benefits of cacheline exclusivity, and not
causing task bloat in the common case.
Per CPU array has better worst case /proc properties, but shares
cachelines (or not, if using percpu_counter as you suggested).
Per thread seems much nicer to me - mainly because it degrades cleanly to
a single counter for 99% of processes, which are single threaded.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2004-11-20 17:14:11 UTC
Permalink
Post by Martin J. Bligh
Per thread seems much nicer to me - mainly because it degrades cleanly to
a single counter for 99% of processes, which are single threaded.
I will pretty much guarantee that if you put the per-thread patches next
to some abomination with per-cpu allocation for each mm, the choice will
be clear. Especially if the per-cpu/per-mm thing tries to avoid false
cacheline sharing, which sounds really "interesting" in itself.

And without the cacheline sharing avoidance, what's the point of this
again? It sure wasn't to make the code simpler. It was about performance
and scalability.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 19:08:18 UTC
Permalink
Post by Linus Torvalds
I will pretty much guarantee that if you put the per-thread patches next
to some abomination with per-cpu allocation for each mm, the choice will
be clear. Especially if the per-cpu/per-mm thing tries to avoid false
cacheline sharing, which sounds really "interesting" in itself.
And without the cacheline sharing avoidance, what's the point of this
again? It sure wasn't to make the code simpler. It was about performance
and scalability.
"The perfect is the enemy of the good."

The "perfect" cacheline separation achieved that way is at the cost of
destabilizing the kernel. The dense per-cpu business is only really a
concession to the notion that the counter needs to be split up at all,
which has never been demonstrated with performance measurements. In fact,
Robin Holt has performance measurements demonstrating the opposite.

The "good" alternatives are negligibly different wrt. performance, and
don't carry the high cost of rwlock starvation that breaks boxen.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2004-11-20 19:16:12 UTC
Permalink
Post by William Lee Irwin III
"The perfect is the enemy of the good."
Yes. But in this case, my suggestion _is_ the good. You seem to be pushing
for a really horrid thing which allocates a per-cpu array for each
mm_struct.

What is it that you have against the per-thread rss? We already have
several places that do the thread-looping, so it's not like "you can't do
that" is a valid argument.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 19:33:25 UTC
Permalink
Post by Linus Torvalds
Post by William Lee Irwin III
"The perfect is the enemy of the good."
Yes. But in this case, my suggestion _is_ the good. You seem to be pushing
for a really horrid thing which allocates a per-cpu array for each
mm_struct.
What is it that you have against the per-thread rss? We already have
several places that do the thread-looping, so it's not like "you can't do
that" is a valid argument.
Okay, first thread groups can share mm's, so it's worse than iterating
over a thread group. Second, the long loops under tasklist_lock didn't
stop causing rwlock starvation because what patches there were to do
something about them didn't get merged.

I'm not particularly "stuck on" the per-cpu business, it was merely the
most obvious method of splitting the RSS counter without catastrophes
elsewhere. Robin Holt's 2.4 performance studies actually show that
splitting the counter is not even essential.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-22 17:44:02 UTC
Permalink
Post by William Lee Irwin III
I'm not particularly "stuck on" the per-cpu business, it was merely the
most obvious method of splitting the RSS counter without catastrophes
elsewhere. Robin Holt's 2.4 performance studies actually show that
splitting the counter is not even essential.
There is no problem moving back to the atomic approach that is if it is
okay to also make anon_rss atomic. But its a pretty significant
performance hit (comparison with some old data from V4 of patch which
makes this data a bit suspect since the test environment is likely
slightly different. I should really test this again. Note that the old
performance test was only run 3 times instead of 10):

atomic vs. sloppy rss performance 64G allocation:

sloppy rss:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 10 1 1.818s 131.556s 133.038s 78618.592 78615.672
16 10 2 1.736s 121.167s 65.026s 85317.098 160656.362
16 10 4 1.835s 120.444s 36.002s 85751.810 291074.998
16 10 8 1.820s 131.068s 25.049s 78906.310 411304.895
16 10 16 3.275s 194.971s 22.019s 52892.356 472497.962
16 10 32 13.006s 496.628s 27.044s 20575.038 381999.865

atomic:

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 3 1 0.610s 61.557s 62.016s 50600.438 50599.822
16 3 2 0.640s 83.116s 43.016s 37557.847 72869.978
16 3 4 0.621s 73.897s 26.023s 42214.002 119908.246
16 3 8 0.596s 86.587s 14.098s 36081.229 209962.059
16 3 16 0.646s 69.601s 7.000s 44780.269 448823.690
16 3 32 0.903s 185.609s 8.085s 16866.018 355301.694

Lets go for the approach to move rss into the thread structure but
keep the rss in the mm structure as is (need to take page_table_lock
for update) to consolidate the values. This allows to keep most
of the code as is and the rss in the task struct is only used if
we are not holding page_table_lock.

Maybe we can then find some way to regularly update the rss in the mm
structure to avoid the loop over the tasklist in proc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-22 22:43:33 UTC
Permalink
Post by Christoph Lameter
Post by William Lee Irwin III
I'm not particularly "stuck on" the per-cpu business, it was merely the
most obvious method of splitting the RSS counter without catastrophes
elsewhere. Robin Holt's 2.4 performance studies actually show that
splitting the counter is not even essential.
There is no problem moving back to the atomic approach that is if it is
okay to also make anon_rss atomic. But its a pretty significant
performance hit (comparison with some old data from V4 of patch which
makes this data a bit suspect since the test environment is likely
slightly different. I should really test this again. Note that the old
The specific patches you compared matter a great deal as there are
implementation blunders (e.g. poor placement of counters relative to
->mmap_sem) that can ruin the results. URL's to the specific patches
would rule out that source of error.


-- wli
Christoph Lameter
2004-11-22 22:51:22 UTC
Permalink
Post by William Lee Irwin III
The specific patches you compared matter a great deal as there are
implementation blunders (e.g. poor placement of counters relative to
->mmap_sem) that can ruin the results. URL's to the specific patches
would rule out that source of error.
I mentioned V4 of this patch which was posted to lkml. A simple search
should get you there.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-23 02:25:15 UTC
Permalink
Post by Christoph Lameter
Post by William Lee Irwin III
The specific patches you compared matter a great deal as there are
implementation blunders (e.g. poor placement of counters relative to
->mmap_sem) that can ruin the results. URL's to the specific patches
would rule out that source of error.
I mentioned V4 of this patch which was posted to lkml. A simple search
should get you there.
The counter's placement was poor in that version of the patch. The
results are very suspect and likely invalid. It would have been more
helpful if you provided some kind of unique identifier when requests
for complete disambiguation are made. For instance, the version tags of
your patches are not visible in Subject: lines.

There are, of course, other issues, e.g. where the arch sweeps went.
This discussion has degenerated into non-cooperation making it beyond
my power to help, and I'm in the midst of several rather urgent
bughunts, of which there are apparently more to come.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 07:15:14 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
touch_nmi_watchdog() is only "protection" against local interrupt
disablement triggering the NMI oopser because alert_counter[]
increments are not atomic. Yet even supposing they were made so, the
That would be a bug in touch_nmi_watchdog then, because you're
racy against your own NMI too.
So I'm actually not very very wrong at all. I'm technically wrong
because touch_nmi_watchdog has a theoretical 'bug'. In practice,
multiple races with the non atomic increments to the same counter,
and in an unbroken sequence would be about as likely as hardware
failure.
Anyway, this touch nmi thing is going off topic, sorry list.
No, it's on-topic.
(1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
merely not every time, and not on every system. It is not
associated with hardware failure. It is, however, tolerable
because sysrq's require privilege to trigger and are primarly
used when the box is dying anyway.
(2) NMI's don't nest. There is no possibility of NMI's racing against
themselves while the data is per-cpu.
Post by Nick Piggin
Post by William Lee Irwin III
net effect of "covering up" this gross deficiency is making the
user-observable problems it causes undiagnosable, as noted before.
Well the loops that are in there now aren't covered up, and they
don't seem to be causing problems. Ergo there is no problem (we're
being _practical_ here, right?)
They are causing problems. They never stopped causing problems. None
of the above attempts to reduce rwlock starvation has been successful
in reducing it to untriggerable-in-the-field levels, and empirical
demonstrations of starvation recurring after those available at the
time of testing were put into place did in fact happen. Reduction of
frequency and making starvation more difficult to trigger are all that
they've achieved thus far.
Post by Nick Piggin
Post by William Lee Irwin III
Kevin Marin was the first to report this issue to lkml. I had seen
instances of it in internal corporate bugreports and it was one of
the motivators for the work I did on pidhashing (one of the causes
of the timeouts was worst cases in pid allocation). Manfred Spraul
and myself wrote patches attempting to reduce read-side hold time
in /proc/ algorithms, Ingo Molnar wrote patches to hierarchically
subdivide the /proc/ iterations, and Dipankar Sarma and Maneesh
Soni wrote patches to carry out the long iterations in /proc/ locklessly.
The last several of these affecting /proc/ have not gained acceptance,
though the work has not been halted in any sense, as this problem
recurs quite regularly. A considerable amount of sustained effort has
gone toward mitigating and resolving rwlock starvation.
That's very nice. But there is no problem _now_, is there?
There is and has always been. All of the above merely mitigate the
issue, with the possible exception of the tasklist RCU patch, for
which I know of no testing results. Also note that almost none of
the work on /proc/ has been merged.
Post by Nick Piggin
Post by William Lee Irwin III
Aggravating the rwlock starvation destabilizes, not pessimizes,
and performance is secondary to stability.
Well luckily we're not going to be aggravating the rwlock stavation.
If you found a problem with, and fixed do_task_stat: ?time, ???_flt,
et al, then you would apply the same solution to per thread rss to
fix it in the same way.
You are aggravating the rwlock starvation by introducing gratuitous
full tasklist iterations. There is no solution to do_task_stat()
because it was recently introduced. There will be one as part of a port
of the usual mitigation patches when the perennial problem is reported
against a sufficiently recent kernel version, as usual. The already-
demonstrated problematic iterations have not been removed.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 07:29:27 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
touch_nmi_watchdog() is only "protection" against local interrupt
disablement triggering the NMI oopser because alert_counter[]
increments are not atomic. Yet even supposing they were made so, the
That would be a bug in touch_nmi_watchdog then, because you're
racy against your own NMI too.
So I'm actually not very very wrong at all. I'm technically wrong
because touch_nmi_watchdog has a theoretical 'bug'. In practice,
multiple races with the non atomic increments to the same counter,
and in an unbroken sequence would be about as likely as hardware
failure.
Anyway, this touch nmi thing is going off topic, sorry list.
No, it's on-topic.
(1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
merely not every time, and not on every system. It is not
associated with hardware failure. It is, however, tolerable
because sysrq's require privilege to trigger and are primarly
used when the box is dying anyway.
OK then put a touch_nmi_watchdog in there if you must.
Post by William Lee Irwin III
(2) NMI's don't nest. There is no possibility of NMI's racing against
themselves while the data is per-cpu.
Your point was that touch_nmi_watchdog() which resets alert_counter,
is racy when resetting the counter of other CPUs. Yes it is racy.
It is also racy against the NMI on the _current_ CPU.

This has nothing whatsoever to do with NMIs racing against themselves,
I don't know how you got that idea when you were the one to bring up
this race anyway.

[ snip back-and-forth that is going nowhere ]

I'll bow out of the argument here. I grant you raise valid concens
WRT the /proc issues, of course.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 07:45:18 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
(2) NMI's don't nest. There is no possibility of NMI's racing against
themselves while the data is per-cpu.
Your point was that touch_nmi_watchdog() which resets alert_counter,
is racy when resetting the counter of other CPUs. Yes it is racy.
It is also racy against the NMI on the _current_ CPU.
Hmm no I think you're right in that it is only a problem WRT the remote
CPUs. However that would still be a problem, as the comment in i386
touch_nmi_watchdog attests.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 07:57:38 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
No, it's on-topic.
(1) The issue is not theoretical. e.g. sysrq t does trigger NMI oopses,
merely not every time, and not on every system. It is not
associated with hardware failure. It is, however, tolerable
because sysrq's require privilege to trigger and are primarly
used when the box is dying anyway.
OK then put a touch_nmi_watchdog in there if you must.
Duh, there is one in there :\

Still, that doesn't really say much about a normal tasklist traversal
because this thing will spend ages writing stuff to serial console.

Now I know going over the whole tasklist is crap. Anything O(n) for
things like this is crap. I happen to just get frustrated to see
concessions being made to support more efficient /proc access. I know
you are one of the ones who has to deal with the practical realities
of that though. Sigh. Well try to bear with me... :|
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 08:25:36 UTC
Permalink
Post by Nick Piggin
Post by Nick Piggin
OK then put a touch_nmi_watchdog in there if you must.
Duh, there is one in there :\
Still, that doesn't really say much about a normal tasklist traversal
because this thing will spend ages writing stuff to serial console.
Now I know going over the whole tasklist is crap. Anything O(n) for
things like this is crap. I happen to just get frustrated to see
concessions being made to support more efficient /proc access. I know
you are one of the ones who has to deal with the practical realities
of that though. Sigh. Well try to bear with me... :|
I sure as Hell don't have any interest in /proc/ in and of itself,
but this stuff does really bite people, and hard, too.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 02:04:01 UTC
Permalink
Post by Christoph Lameter
A. make_rss_atomic. The earlier releases contained that patch but
then another variable (such as anon_rss) was introduced that would
have required additional atomic operations. Atomic rss operations
are also causing slowdowns on machines with a high number of cpus
due to memory contention.
B. remove_rss. Replace rss with a periodic scan over the vm to
determine rss and additional numbers. This was also discussed on
linux-mm and linux-ia64. The scans while displaying /proc data
were undesirable.
Split counters easily resolve the issues with both these approaches
(and apparently your co-workers are suggesting it too, and have
performance results backing it).


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 02:18:22 UTC
Permalink
Post by William Lee Irwin III
Post by Christoph Lameter
A. make_rss_atomic. The earlier releases contained that patch but
then another variable (such as anon_rss) was introduced that would
have required additional atomic operations. Atomic rss operations
are also causing slowdowns on machines with a high number of cpus
due to memory contention.
B. remove_rss. Replace rss with a periodic scan over the vm to
determine rss and additional numbers. This was also discussed on
linux-mm and linux-ia64. The scans while displaying /proc data
were undesirable.
Split counters easily resolve the issues with both these approaches
(and apparently your co-workers are suggesting it too, and have
performance results backing it).
Split counters still require atomic operations though. This is what
Christoph's latest effort is directed at removing. And they'll still
bounce cachelines around. (I assume we've reached the conclusion
that per-cpu split counters per-mm won't fly?).
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 02:34:43 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Split counters easily resolve the issues with both these approaches
(and apparently your co-workers are suggesting it too, and have
performance results backing it).
Split counters still require atomic operations though. This is what
Christoph's latest effort is directed at removing. And they'll still
bounce cachelines around. (I assume we've reached the conclusion
that per-cpu split counters per-mm won't fly?).
Split != per-cpu, though it may be. Counterexamples are
as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]);
Furthermore, see Robin Holt's results regarding the performance of the
atomic operations and their relation to cacheline sharing.

And frankly, the argument that the space overhead of per-cpu counters
is problematic is not compelling. Even at 1024 cpus it's smaller than
an ia64 pagetable page, of which there are numerous instances attached
to each mm.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 02:40:40 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Split counters easily resolve the issues with both these approaches
(and apparently your co-workers are suggesting it too, and have
performance results backing it).
Split counters still require atomic operations though. This is what
Christoph's latest effort is directed at removing. And they'll still
bounce cachelines around. (I assume we've reached the conclusion
that per-cpu split counters per-mm won't fly?).
Split != per-cpu, though it may be. Counterexamples are
as simple as atomic_inc(&mm->rss[smp_processor_id()>>RSS_IDX_SHIFT]);
Oh yes, I just meant that the only way split counters will relieve
the atomic ops and bouncing is by having them per-cpu. But you knew
that :)
Post by William Lee Irwin III
Furthermore, see Robin Holt's results regarding the performance of the
atomic operations and their relation to cacheline sharing.
Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss
means another atomic op. While this doesn't immediately make it a
showstopper, it is gradually slowing down the single threaded page
fault path too, which is bad.
Post by William Lee Irwin III
And frankly, the argument that the space overhead of per-cpu counters
is problematic is not compelling. Even at 1024 cpus it's smaller than
an ia64 pagetable page, of which there are numerous instances attached
to each mm.
1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
don't even care about 64K on their large machines, but...

On i386 this would be maybe 32 * 128 byte == 4K per task for distro
kernels. Not so good.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 03:04:25 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Furthermore, see Robin Holt's results regarding the performance of the
atomic operations and their relation to cacheline sharing.
Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss
Irrelevant. Unshare cachelines with hot mm-global ones, and the
"problem" goes away.

This stuff is going on and on about some purist "no atomic operations
anywhere" weirdness even though killing the last atomic operation
creates problems and doesn't improve performance.
Post by Nick Piggin
means another atomic op. While this doesn't immediately make it a
showstopper, it is gradually slowing down the single threaded page
fault path too, which is bad.
Post by William Lee Irwin III
And frankly, the argument that the space overhead of per-cpu counters
is problematic is not compelling. Even at 1024 cpus it's smaller than
an ia64 pagetable page, of which there are numerous instances attached
to each mm.
1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
don't even care about 64K on their large machines, but...
On i386 this would be maybe 32 * 128 byte == 4K per task for distro
kernels. Not so good.
Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 03:14:33 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Furthermore, see Robin Holt's results regarding the performance of the
atomic operations and their relation to cacheline sharing.
Well yeah, but a. their patch isn't in 2.6 (or 2.4), and b. anon_rss
Irrelevant. Unshare cachelines with hot mm-global ones, and the
"problem" goes away.
That's the idea.
Post by William Lee Irwin III
This stuff is going on and on about some purist "no atomic operations
anywhere" weirdness even though killing the last atomic operation
creates problems and doesn't improve performance.
Huh? How is not wanting to impact single threaded performance being
"purist weirdness"? Practical, I'd call it.
Post by William Lee Irwin III
Post by Nick Piggin
means another atomic op. While this doesn't immediately make it a
showstopper, it is gradually slowing down the single threaded page
fault path too, which is bad.
Post by William Lee Irwin III
And frankly, the argument that the space overhead of per-cpu counters
is problematic is not compelling. Even at 1024 cpus it's smaller than
an ia64 pagetable page, of which there are numerous instances attached
to each mm.
1024 CPUs * 64 byte cachelines == 64K, no? Well I'm sure they probably
don't even care about 64K on their large machines, but...
On i386 this would be maybe 32 * 128 byte == 4K per task for distro
kernels. Not so good.
Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.
Hmm yeah I guess wouldn't put them all on different cachelines.
As you can see though, Christoph ran into a wall at 8 CPUs, so
having them densly packed still might not be enough.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
William Lee Irwin III
2004-11-20 03:43:49 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Irrelevant. Unshare cachelines with hot mm-global ones, and the
"problem" goes away.
That's the idea.
Post by William Lee Irwin III
This stuff is going on and on about some purist "no atomic operations
anywhere" weirdness even though killing the last atomic operation
creates problems and doesn't improve performance.
Huh? How is not wanting to impact single threaded performance being
"purist weirdness"? Practical, I'd call it.
Empirically demonstrate the impact on single-threaded performance.
Post by Nick Piggin
Post by William Lee Irwin III
Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.
Hmm yeah I guess wouldn't put them all on different cachelines.
As you can see though, Christoph ran into a wall at 8 CPUs, so
having them densly packed still might not be enough.
Please be more specific about the result, and cite the Message-Id.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-20 03:58:36 UTC
Permalink
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Irrelevant. Unshare cachelines with hot mm-global ones, and the
"problem" goes away.
That's the idea.
Post by William Lee Irwin III
This stuff is going on and on about some purist "no atomic operations
anywhere" weirdness even though killing the last atomic operation
creates problems and doesn't improve performance.
Huh? How is not wanting to impact single threaded performance being
"purist weirdness"? Practical, I'd call it.
Empirically demonstrate the impact on single-threaded performance.
I can tell you its worse. I don't have to demonstrate anything, more
atomic RMW ops in the page fault path is going to have an impact.

I'm not saying we must not compromise *anywhere*, but it would
just be nice to try to avoid making the path heavier, that's all.
I'm not being purist when I say I'd first rather explore all other
options before adding atomics.

But nevermind arguing, it appears Linus' suggested method will
be fine and *does* mean we don't have to compromise.
Post by William Lee Irwin III
Post by Nick Piggin
Post by William Lee Irwin III
Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.
Hmm yeah I guess wouldn't put them all on different cachelines.
As you can see though, Christoph ran into a wall at 8 CPUs, so
having them densly packed still might not be enough.
Please be more specific about the result, and cite the Message-Id.
Start of this thread.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 04:01:10 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Please be more specific about the result, and cite the Message-Id.
Start of this thread.
Those do not have testing results of different RSS counter
implementations in isolation.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Robin Holt
2004-11-20 04:34:17 UTC
Permalink
Post by Nick Piggin
Post by William Lee Irwin III
Please be more specific about the result, and cite the Message-Id.
Start of this thread.
Part of the impact was having the page table lock, the mmap_sem, and
these two atomic counters in the same cacheline. What about seperating
the counters from the locks?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Robin Holt
2004-11-20 03:33:12 UTC
Permalink
Post by William Lee Irwin III
Why the Hell would you bother giving each cpu a separate cacheline?
The odds of bouncing significantly merely amongst the counters are not
particularly high.
Agree, we are currently using atomic ops on a global rss on our 2.4
kernel with 512cpu systems and not seeing much cacheline contention.
I don't remember how little it ended up being, but it was very little.
We had gone to dropping the page_table_lock and only reaquiring it if
the pte was non-null when we went to insert our new one. I think that
was how we had it working. I would have to wake up and actually look
at that code as it was many months ago that Ray Bryant did that work.
We did make rss atomic. Most of the contention is sorted out by the
mmap_sem. Processes acquiring themselves off of mmap_sem were found
to have spaced themselves out enough that they were all approximately
equal time from doing their atomic_add and therefore had very little
contention for the cacheline. At least it was not enough that we could
measure it as significant.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
William Lee Irwin III
2004-11-20 04:24:27 UTC
Permalink
Post by Robin Holt
Agree, we are currently using atomic ops on a global rss on our 2.4
kernel with 512cpu systems and not seeing much cacheline contention.
I don't remember how little it ended up being, but it was very little.
We had gone to dropping the page_table_lock and only reaquiring it if
the pte was non-null when we went to insert our new one. I think that
was how we had it working. I would have to wake up and actually look
at that code as it was many months ago that Ray Bryant did that work.
We did make rss atomic. Most of the contention is sorted out by the
mmap_sem. Processes acquiring themselves off of mmap_sem were found
to have spaced themselves out enough that they were all approximately
equal time from doing their atomic_add and therefore had very little
contention for the cacheline. At least it was not enough that we could
measure it as significant.
Also, the densely-packed split counter can only get 4-16 cpus to a
cacheline with cachelines <= 128B, so there are definite limitations to
the amount of cacheline contention in such schemes.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Robin Holt
2004-11-20 02:06:15 UTC
Permalink
Post by Christoph Lameter
A. make_rss_atomic. The earlier releases contained that patch but then another
variable (such as anon_rss) was introduced that would have required additional
atomic operations. Atomic rss operations are also causing slowdowns on
machines with a high number of cpus due to memory contention.
B. remove_rss. Replace rss with a periodic scan over the vm to determine
rss and additional numbers. This was also discussed on linux-mm and linux-ia64.
The scans while displaying /proc data were undesirable.
Can you run a comparison benchmark between atomic rss and anon_rss and
the sloppy rss with the rss and anon_rss in seperate cachelines. I am not
sure that it is important to seperate the two into seperate lines, just
rss and anon_rss from the lock and sema.

If I have the time over the weekend, I might try this myself. If not, can
you give it a try.

Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Benjamin Herrenschmidt
2004-11-19 07:05:20 UTC
Permalink
Post by Nick Piggin
Post by Christoph Lameter
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.
I had a similar wild idea. Mine was to just make sure we have a spare
per-CPU page ready before taking any locks.
Ahh, you're doing clear_user_highpage after the pte is already set up?
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
Yah, doing clear_user_highpage() after setting the PTE is unfortunately
unacceptable. It show interesting bugs... As soon as the PTE is setup,
another thread on another CPU can hit the page, you'll then clear what
it's writing...

Take for example 2 threads writing to different structures in the same
page of anonymous memory. The first one triggers the allocation, the
second writes right away, "sees" the new PTE, and writes just before the
first one does clear_user_highpage...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 19:21:38 UTC
Permalink
Just coming back to your sloppy rss patch - this thing will of course allow
unbounded error to build up. Well, it *will* be bounded by the actual RSS if
we assume the races can only cause rss to be underestimated. However, such an
assumption (I think it is a safe one?) also means that rss won't hover around
the correct value, but tend to go increasingly downward.
On your HPC codes that never reclaim memory, and don't do a lot of mapping /
unmapping I guess this wouldn't matter... But a long running database or
something?
Databases preallocate memory on startup and then manage memory themselves.
One reason for this patch is that these applications cause anonymous page
fault storms on startup given lots of memory which will make
the system seem to freeze for awhile.

It is rare for a program to actually free up memory.

Where this approach could be problematic is when the system is under
heavy swap load. Pages of an application will be repeatedly paged in and
out and therefore rss will be incremented and decremented. But in those
cases these incs and decs are not done in a way that is on purpose
parallel like in my test programs. So I would expect rss to be more
accurate than in my tests.

I think the sloppy rss approach is the right way to go.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robin Holt
2004-11-19 19:57:21 UTC
Permalink
Post by Christoph Lameter
Just coming back to your sloppy rss patch - this thing will of course allow
unbounded error to build up. Well, it *will* be bounded by the actual RSS if
we assume the races can only cause rss to be underestimated. However, such an
assumption (I think it is a safe one?) also means that rss won't hover around
the correct value, but tend to go increasingly downward.
On your HPC codes that never reclaim memory, and don't do a lot of mapping /
unmapping I guess this wouldn't matter... But a long running database or
something?
Databases preallocate memory on startup and then manage memory themselves.
One reason for this patch is that these applications cause anonymous page
fault storms on startup given lots of memory which will make
the system seem to freeze for awhile.
It is rare for a program to actually free up memory.
Where this approach could be problematic is when the system is under
heavy swap load. Pages of an application will be repeatedly paged in and
out and therefore rss will be incremented and decremented. But in those
cases these incs and decs are not done in a way that is on purpose
parallel like in my test programs. So I would expect rss to be more
accurate than in my tests.
I think the sloppy rss approach is the right way to go.
Is this really that much of a problem? Why not leave rss as an _ACCURATE_
count of pages. That way stuff like limits based upon rss and accounting
of memory usage are accurate.

Have we tried splitting into seperate cache lines? How about grouped counters
for every 16 cpus instead of a per-cpu counter as proposed by someone else
earlier.

IMHO, keeping rss as an accurate count is much more important that having
a nearly correct value. If this turns into more of a scaling issue later on,
your patch will have to be caught by someone accidentally noticing that
the rss value is _WAY_ off as opposed to our normal methods for detecting
cacheline contention.

Just my opinion,
Robin Holt
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2004-11-20 01:24:17 UTC
Permalink
Post by Robin Holt
Post by Christoph Lameter
I think the sloppy rss approach is the right way to go.
Is this really that much of a problem? Why not leave rss as an _ACCURATE_
count of pages. That way stuff like limits based upon rss and accounting
of memory usage are accurate.
I think I agree. (But Christoph is right that in practice probably nobody
or very few will ever notice).
Post by Robin Holt
Have we tried splitting into seperate cache lines? How about grouped counters
for every 16 cpus instead of a per-cpu counter as proposed by someone else
earlier.
Well, you still need to put those counters on seperate cachelines, so you
still need to pad them out quite a lot. Then as they are shared, you _still_
need to make them atomic, and they'll still be bouncing around too.

Linus' idea of a per-thread 'pages_in - pages_out' counter may prove to be
just the right solution though.

Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robin Holt
2004-11-20 01:46:10 UTC
Permalink
Post by Nick Piggin
Well, you still need to put those counters on seperate cachelines, so you
still need to pad them out quite a lot. Then as they are shared, you _still_
need to make them atomic, and they'll still be bouncing around too.
Linus' idea of a per-thread 'pages_in - pages_out' counter may prove to be
just the right solution though.
I can go with either solution. Not sure how many cpus we can group together
before the cacheline becomes so hot that we need to fan them out. I have
a gut feeling it is alot.

On the 2.4 kernel which SGI put together, we just changed rss to an
atomic and ensured it was in a seperate cacheline from the locks and
performance was more than adequate. I realize a lot has changed since
2.4, but the concepts are similar.

Just my 2 cents,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Sebastien Decugis
2004-12-03 14:49:51 UTC
Permalink
[Gerrit Huizenga, 2004-12-02 16:24:04]
Post by Gerrit Huizenga
Towards that end, there
was a recent effort at Bull on the NPTL work which serves as a very
http://nptl.bullopensource.org/Tests/results/run-browse.php
Basically, you can compare results from any test run with any other
and get a summary of differences. That helps give a quick status
check and helps you focus on the correct issues when tracking down
defects.
Thanks Gerrit for mentioning this :)

Just an additional information -- the tool used to get this reporting
system is OSS and can be found here:
http://tslogparser.sourceforge.net

This tool is not mature yet, but it gives an overview of how useful a
test suite can be, when the results are easy to analyse...

It currently supports only the Open POSIX Test Suite, but I'd be happy
to work to enlarge the scope of this tool.

Regards,
Seb.

PS: please include me in reply as I'm not subscribed to the list...
-------------------------------
Sebastien DECUGIS
NPTL Test & Trace Project
http://nptl.bullopensource.org/

"You may fail if you try.
You -will- fail if you don't."

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Luck, Tony
2004-12-08 17:44:09 UTC
Permalink
Post by Christoph Lameter
If a fault occurred for page x and is then followed by page
x+1 then it may be reasonable to expect another page fault
at x+2 in the future.
What if the application had used "madvise(start, len, MADV_RANDOM)"
to tell the kernel that this isn't "reasonable"?

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-12-08 17:57:09 UTC
Permalink
Post by Luck, Tony
Post by Christoph Lameter
If a fault occurred for page x and is then followed by page
x+1 then it may be reasonable to expect another page fault
at x+2 in the future.
What if the application had used "madvise(start, len, MADV_RANDOM)"
to tell the kernel that this isn't "reasonable"?
We could use that as a way to switch of the preallocation. How expensive
is that check?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Luck, Tony
2004-12-08 18:31:20 UTC
Permalink
Post by Christoph Lameter
We could use that as a way to switch of the preallocation. How
expensive is that check?
If you already looked up the vma, then it is very cheap. Just
check for VM_RAND_READ in vma->vm_flags.

-Tony
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Continue reading on narkive:
Loading...