Discussion:
another approach to rss : sloppy rss
(too old to reply)
Christoph Lameter
2004-11-18 19:34:21 UTC
Permalink
But I don't know what the appropriate solution is. My priorities
may be wrong, but I dislike the thought of a struct mm dominated
by a huge percpu array of rss longs (or cachelines?), even if the
machines on which it would be huge are ones which could well afford
the waste of memory. It just offends my sense of proportion, when
the exact rss is of no importance. I'm more attracted to just
leaving it unatomic, and living with the fact that it's racy
and approximate (but /proc report negatives as 0).
Here is a patch that enables handling of rss outside of the page table
lock by simply ignoring errors introduced by not locking. The loss
of rss was always less than 1%.

The patch insures that negative rss values are not displayed and removes 3
checks in mm/rmap.c that utilized rss (unecessarily AFAIK).

Some numbers:

4 Gigabyte concurrent alocation from 4 cpus:

rss protect by page_table_lock:

margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262479 RSS=262234
Size=262415 RSS=262233
4 3 4 0.180s 16.271s 5.010s 47801.151 154059.862
margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262415 RSS=262233
Size=262415 RSS=262233
4 3 4 0.155s 14.616s 4.081s 53239.852 163270.962
margin:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=262233
Size=262479 RSS=262234
Size=262415 RSS=262233
4 3 4 0.172s 16.192s 5.018s 48055.018 151621.738

with sloppy rss:

margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=261120
Size=262415 RSS=261074
Size=262415 RSS=261215
4 3 4 0.161s 13.058s 4.060s 59489.254 170939.864
margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=260900
Size=262543 RSS=261001
Size=262415 RSS=261053
4 3 4 0.152s 13.565s 4.031s 57329.397 182103.081
margin2:~/clameter # ./pftn -g4 -r3 -f4
Size=262415 RSS=260988
Size=262479 RSS=261112
Size=262479 RSS=261343
4 3 4 0.143s 12.994s 4.060s 59860.702 170770.399

32 GB allocation with 32 cpus.

with page_table_lock:

Size=2099307 RSS=2097270
Size=2099371 RSS=2097271
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
Size=2099307 RSS=2097270
32 10 32 18.105s 5466.913s 202.027s 3823.418 103676.172

sloppy rss:

Size=2099307 RSS=2094018
Size=2099307 RSS=2093738
Size=2099307 RSS=2093907
Size=2099307 RSS=2093634
Size=2099307 RSS=2093731
Size=2099307 RSS=2094343
Size=2099307 RSS=2094072
Size=2099307 RSS=2094185
Size=2099307 RSS=2093845
Size=2099307 RSS=2093396
32 10 32 14.872s 1036.711s 55.023s 19942.800 379701.332



Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-17 06:58:51.000000000 -0800
@@ -216,7 +216,7 @@
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -252,6 +252,19 @@
struct kioctx default_kioctx;
};

+/*
+ * rss and anon_rss are incremented and decremented in some locations without
+ * proper locking. This function insures that these values do not become negative
+ * and is called before reporting rss based statistics
+ */
+static void inline rss_fixup(struct mm_struct *mm)
+{
+ if ((long)mm->rss < 0)
+ mm->rss = 0;
+ if ((long)mm->anon_rss < 0)
+ mm->anon_rss = 0;
+}
+
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-17 06:58:51.000000000 -0800
@@ -11,6 +11,7 @@
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+ rss_fixup(mm);
buffer += sprintf(buffer,
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
@@ -37,6 +38,7 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
+ rss_fixup(mm);
*shared = mm->rss - mm->anon_rss;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
PAGE_SHIFT;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-17 06:58:51.000000000 -0800
@@ -325,6 +325,7 @@
vsize = task_vsize(mm);
eip = KSTK_EIP(task);
esp = KSTK_ESP(task);
+ rss_fixup(mm);
}

get_task_comm(tcomm, task);
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-17 07:07:00.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -504,8 +502,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -788,8 +784,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 01:40:42 UTC
Permalink
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.

The patch implements a fastpath where the page_table_lock is not dropped
in do_anonymous_page. The fastpath steals a page from the hot or cold
lists to get a page quickly.

Results (4 GB and 32 GB allocation on up to 32 processors gradually
incrementing the number of processors)

with patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.524s 24.524s 25.005s104653.150 104642.920
4 10 2 0.456s 29.458s 15.082s 87629.462 165633.410
4 10 4 0.453s 37.064s 11.002s 69872.279 237796.809
4 10 8 0.574s 99.258s 15.003s 26258.236 174308.765
4 10 16 2.171s 279.211s 21.001s 9316.271 124721.683
4 10 32 2.544s 741.273s 27.093s 3524.299 93827.660

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 4.124s 358.469s 362.061s 57837.481 57834.144
32 10 2 4.217s 440.333s 235.043s 47174.609 89076.709
32 10 4 3.778s 321.754s 100.069s 64422.222 208270.694
32 10 8 3.830s 789.580s 117.067s 26432.116 178211.592
32 10 16 3.921s 2360.026s 170.021s 8871.395 123203.040
32 10 32 9.140s 6213.944s 224.068s 3369.955 93338.297

w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 10 1 0.449s 24.992s 25.044s103038.282 103022.448
4 10 2 0.448s 30.290s 16.027s 85282.541 161110.770
4 10 4 0.420s 38.700s 11.061s 67008.319 225702.353
4 10 8 0.612s 93.862s 14.059s 27747.547 179564.131
4 10 16 1.554s 265.199s 20.016s 9827.180 129994.843
4 10 32 8.088s 657.280s 25.074s 3939.826 101822.835

Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456
32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453
32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486
32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674
32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973
32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521

Only a minimal increase if at all. At the high end the patch leads to
even more contention.

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-18 16:53:01.000000000 -0800
@@ -1436,28 +1436,56 @@

/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
-
/* ..except if it's a write access */
if (write_access) {
+ struct per_cpu_pageset *pageset;
+ unsigned long flags;
+ int temperature;
+
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
-
- if (unlikely(anon_vma_prepare(vma)))
- goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
- if (!page)
- goto no_mem;
- clear_user_highpage(page, addr);
-
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
+ /* This is not numa compatible yet! */
+ pageset = NODE_DATA(numa_node_id())->node_zonelists[GFP_HIGHUSER & GFP_ZONEMASK].zones[0]->pageset+smp_processor_id();
+
+ /* Fastpath for the case that the anonvma is already setup and there are
+ * pages available in the per_cpu_pageset for this node. If so steal
+ * pages from the pageset and avoid dropping the page_table_lock.
+ */
+ local_irq_save(flags);
+ temperature=1;
+ if (vma->anon_vma && (pageset->pcp[temperature].count || pageset->pcp[--temperature].count)) {
+ /* Fastpath for hot/cold pages */
+ page = list_entry(pageset->pcp[temperature].list.next, struct page, lru);
+ list_del(&page->lru);
+ pageset->pcp[temperature].count--;
+ local_irq_restore(flags);
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ 1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_checked | 1 << PG_mappedtodisk);
+ page->private = 0;
+ set_page_count(page, 1);
+ /* We skipped updating the zone statistics !*/
+ } else {
+ /* Slow path */
+ local_irq_restore(flags);
spin_unlock(&mm->page_table_lock);
- goto out;
+
+ if (unlikely(anon_vma_prepare(vma)))
+ goto no_mem;
+ page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ if (!page)
+ goto no_mem;
+
+ spin_lock(&mm->page_table_lock);
+ page_table = pte_offset_map(pmd, addr);
+
+ if (!pte_none(*page_table)) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
}
mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1473,7 +1501,10 @@

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
+
spin_unlock(&mm->page_table_lock);
+ if (write_access)
+ clear_user_highpage(page, addr);
out:
return VM_FAULT_MINOR;
no_mem:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-19 02:19:11 UTC
Permalink
Post by Christoph Lameter
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.
I had a similar wild idea. Mine was to just make sure we have a spare
per-CPU page ready before taking any locks.

Ahh, you're doing clear_user_highpage after the pte is already set up?
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 02:38:47 UTC
Permalink
Post by Nick Piggin
Ahh, you're doing clear_user_highpage after the pte is already set up?
The huge page code also has that optimization. Clearing of pages
may take some time which is one reason the kernel drops the page table
lock for anonymous page allocation and then reacquires it. The patch does
not relinquish the lock on the fast path thus the move outside of the
lock.
Post by Nick Piggin
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
If you do the clearing with the page table lock held then performance will
suffer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Nick Piggin
2004-11-19 02:44:25 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
Ahh, you're doing clear_user_highpage after the pte is already set up?
The huge page code also has that optimization. Clearing of pages
may take some time which is one reason the kernel drops the page table
lock for anonymous page allocation and then reacquires it. The patch does
not relinquish the lock on the fast path thus the move outside of the
lock.
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Post by Christoph Lameter
Post by Nick Piggin
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
If you do the clearing with the page table lock held then performance will
suffer.
Yeah very much, but if you allocate and clear a "just in case" page
_before_ taking any locks for the fault then you'd be able to go
straight through do_anonymous_page.

But yeah that has other issues like having a spare page per CPU (maybe
not so great a loss), and having anonymous faults much more likely to
get pages which are cache cold.

Anyway, glad to see your patches didn't improve things: now we don't
have to think about making *more* tradeoffs :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 03:28:41 UTC
Permalink
Post by Nick Piggin
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Nothing. If this had led to anything then we would have needed to address
this issue. The clearing had to be outside of the lock in order not to
impact the performance tests negatively.
Post by Nick Piggin
Post by Christoph Lameter
If you do the clearing with the page table lock held then performance will
suffer.
Yeah very much, but if you allocate and clear a "just in case" page
_before_ taking any locks for the fault then you'd be able to go
straight through do_anonymous_page.
But yeah that has other issues like having a spare page per CPU (maybe
not so great a loss), and having anonymous faults much more likely to
get pages which are cache cold.
You may be able to implement that using the hot and cold lists. Have
something that runs on the lists and prezeros and preformats these pages
(idle thread?).

Set some flag to indicate that a page has been prepared and then just zing
it in if do_anymous_page finds that flag said.

But I think this may be introduce way too much complexity
into the page fault handler.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt
2004-11-19 07:07:48 UTC
Permalink
Post by Christoph Lameter
Post by Nick Piggin
But you're doing it after you've set up a pte for that page you are
clearing... I think? What's to stop another thread trying to read or
write to it concurrently?
Nothing. If this had led to anything then we would have needed to address
this issue. The clearing had to be outside of the lock in order not to
impact the performance tests negatively.
No, it's clearly a bug. We even had a very hard to track down bug
recently on ppc64 which was caused by the fact that set_pte didn't
contain a barrier, thus the stores done by the _previous_
clear_user_high_page() could be re-ordered with the store to the PTE.
That could cause another process to "see" the PTE before the writes of 0
to the page, and thus start writing to the page before all zero's went
in, thus ending up with corrupted data. We had a real life testcase of
this one. This test case would blow up right away with your code I
think.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 19:42:39 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>

Changes from V10->V11 of this patch:
- cmpxchg_i386: Optimize code generated after feedback from Linus. Various
fixes.
- drop make_rss_atomic in favor of rss_sloppy
- generic: adapt to new changes in Linus tree, some fixes to fallback
functions. Add generic ptep_xchg_flush based on xchg.
- S390: remove use of page_table_lock from ptep_xchg_flush (deadlock)
- x86_64: remove ptep_xchg
- i386: integrated Nick Piggin's changes for PAE mode. Create ptep_xchg_flush and
various fixes.
- ia64: if necessary flush icache before ptep_cmpxchg. Remove ptep_xchg

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 32 processors allocating 32 GB with an increasing
number of cpus.

Without the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.966s 366.840s 370.082s 56556.456 56553.456
32 10 2 3.604s 319.004s 172.058s 65006.086 121511.453
32 10 4 3.705s 341.550s 106.007s 60741.936 197704.486
32 10 8 3.597s 809.711s 119.021s 25785.427 175917.674
32 10 16 5.886s 2238.122s 163.084s 9345.560 127998.973
32 10 32 21.748s 5458.983s 201.062s 3826.409 104011.521

With the patches:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
32 10 1 3.772s 330.629s 334.042s 62713.587 62708.706
32 10 2 3.767s 352.252s 185.077s 58905.502 112886.222
32 10 4 3.549s 255.683s 77.000s 80898.177 272326.496
32 10 8 3.522s 263.879s 52.030s 78427.083 400965.857
32 10 16 5.193s 384.813s 42.076s 53772.158 490378.852
32 10 32 15.806s 996.890s 54.077s 20708.587 382879.208

With a high number of CPUs the page fault rate improves more than
twofold and may reach 500000 faults/sec betweenr 16-512 cpus. The
fault rate drops if a process is running on all processors as also
here for the 32 cpu case.

Note that the measurements were done on a NUMA system and this
test uses off node memory. Variations may exist due to allocations in
memory areas in diverse distances to the local cpu. The slight drop
for 2 cpus is probably due to that effect.

The performance increase is accomplished by avoiding the use of the
page_table_lock spinlock (but not mm->mmap_sem!) through new atomic
operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd and pgd's
(pgd_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

1. An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the
swapper code has been changed to never set a pte to be empty until the
page has been evicted. The population of an empty pte is frequent
if a process touches newly allocated memory.

2. Modifications of flags in a pte entry (write/accessed).

These modifications are done by the CPU or by low level handlers
on various platforms also bypassing the page_table_lock. So this
seems to be safe too.

One essential change in the VM is the use of pte_cmpxchg (or its generic
emulation) on page table entries before doing an update_mmu_change without holding
the page table lock. However, we do similar things now with other atomic pte operations
such as ptep_get_and_clear and ptep_test_and_clear_dirty. These operations clear
a pte *after* doing an operation on it. The ptep_cmpxchg as used in this patch
operates on an *cleared* pte and replaces it with a pte pointing to valid memory.
The effect of this change on various architectures has to be thought through. Local
definitions of ptep_cmpxchg and ptep_xchg may be necessary.

For IA64 an icache coherency issue may arise that potentially requires the
flushing of the icache (as done via update_mmu_cache on IA64) prior
to the use of ptep_cmpxchg. Similar issues may arise on other platforms.

The patch uses sloppy rss handling. mm->rss is incremented without
proper locking because locking would introduce too much overhead. Rss
is not essential for vm operations (3 uses of rss in rmap.c were not necessary and
were removed). The difference in rss values has been found to be less than 1% in
our tests (see also the separate email to linux-mm and linux-ia64 on the subject
of "sloppy rss"). The move away from using atomic operations for rss in earlier versions
of this patch also increases the performance of the page fault handler in the single
thread case over an unpatched kernel.

Note that I have posted two other approaches of dealing with the rss problem:

A. make_rss_atomic. The earlier releases contained that patch but then another
variable (such as anon_rss) was introduced that would have required additional
atomic operations. Atomic rss operations are also causing slowdowns on
machines with a high number of cpus due to memory contention.

B. remove_rss. Replace rss with a periodic scan over the vm to determine
rss and additional numbers. This was also discussed on linux-mm and linux-ia64.
The scans while displaying /proc data were undesirable.

The patchset is composed of 7 patches:

1/7: Sloppy rss

Removes mm->rss usage from mm/rmap.c and insures that negative rss values
are not displayed.

2/7: Avoid page_table_lock in handle_mm_fault

This patch defers the acquisition of the page_table_lock as much as
possible and uses atomic operations for allocating anonymous memory.
These atomic operations are simulated by acquiring the page_table_lock
for very small time frames if an architecture does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. It also changes the swapper so that a
pte will not be set to empty if a page is in transition to swap.

If only the first two patches are applied then the time that the page_table_lock
is held is simply reduced. The lock may then be acquired multiple
times during a page fault.

The remaining patches introduce the necessary atomic pte operations to avoid
the page_table_lock.

3/7: Atomic pte operations for ia64

4/7: Make cmpxchg generally available on i386

The atomic operations on the page table rely heavily on cmpxchg instructions.
This patch adds emulations for cmpxchg and cmpxchg8b for old 80386 and 80486
cpus. The emulations are only included if a kernel is build for these old
cpus and are skipped for the real cmpxchg instructions if the kernel
that is build for 386 or 486 is then run on a more recent cpu.

This patch may be used independently of the other patches.

5/7: Atomic pte operations for i386

A generally available cmpxchg (last patch) must be available for this patch to
preserve the ability to build kernels for 386 and 486.

6/7: Atomic pte operation for x86_64

7/7: Atomic pte operations for s390
Christoph Lameter
2004-11-19 19:43:30 UTC
Permalink
Changelog
* Enable the sloppy use of mm->rss and mm->anon_rss atomic without locking
* Insure that negative rss values are not given out by the /proc filesystem
* remove 3 checks of rss in mm/rmap.c
* Prerequisite for page table scalability patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-11-18 13:04:30.000000000 -0800
@@ -216,7 +216,7 @@
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
+ spinlock_t page_table_lock; /* Protects page tables */

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
@@ -252,6 +252,21 @@
struct kioctx default_kioctx;
};

+/*
+ * rss and anon_rss are incremented and decremented in some locations without
+ * proper locking. This function insures that these values do not become negative.
+ */
+static long inline get_rss(struct mm_struct *mm)
+{
+ long rss = mm->rss;
+
+ if (rss < 0)
+ mm->rss = rss = 0;
+ if ((long)mm->anon_rss < 0)
+ mm->anon_rss = 0;
+ return rss;
+}
+
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
Index: linux-2.6.9/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.9.orig/fs/proc/task_mmu.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/task_mmu.c 2004-11-18 12:56:26.000000000 -0800
@@ -22,7 +22,7 @@
"VmPTE:\t%8lu kB\n",
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->rss << (PAGE_SHIFT-10),
+ get_rss(mm) << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
@@ -37,7 +37,9 @@
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->rss - mm->anon_rss;
+ *shared = get_rss(mm) - mm->anon_rss;
+ if (*shared <0)
+ *shared = 0;
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
Index: linux-2.6.9/fs/proc/array.c
===================================================================
--- linux-2.6.9.orig/fs/proc/array.c 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/fs/proc/array.c 2004-11-18 12:53:16.000000000 -0800
@@ -420,7 +420,7 @@
jiffies_to_clock_t(task->it_real_value),
start_time,
vsize,
- mm ? mm->rss : 0, /* you might want to shift this left 3 */
+ mm ? get_rss(mm) : 0, /* you might want to shift this left 3 */
rsslim,
mm ? mm->start_code : 0,
mm ? mm->end_code : 0,
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-15 11:13:40.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-18 12:26:45.000000000 -0800
@@ -263,8 +263,6 @@
pte_t *pte;
int referenced = 0;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -504,8 +502,6 @@
pte_t pteval;
int ret = SWAP_AGAIN;

- if (!mm->rss)
- goto out;
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
@@ -788,8 +784,7 @@
if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
continue;
cursor = (unsigned long) vma->vm_private_data;
- while (vma->vm_mm->rss &&
- cursor < max_nl_cursor &&
+ while (cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
try_to_unmap_cluster(cursor, &mapcount, vma);
cursor += CLUSTER_SIZE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Hugh Dickins
2004-11-19 20:50:59 UTC
Permalink
Sorry, against what tree do these patches apply?
Apparently not linux-2.6.9, nor latest -bk, nor -mm?

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 19:44:47 UTC
Permalink
Changelog
* Provide atomic pte operations for ia64
* Enhanced parallelism in page fault handler if applied together
with the generic patch

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-11-19 07:54:19.000000000 -0800
@@ -34,6 +34,10 @@
#define pmd_quicklist (local_cpu_data->pmd_quick)
#define pgtable_cache_size (local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE 0
+#define PGD_NONE 0
+
static inline pgd_t*
pgd_alloc_one_fast (struct mm_struct *mm)
{
@@ -78,12 +82,19 @@
preempt_enable();
}

+
static inline void
pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
{
pgd_val(*pgd_entry) = __pa(pmd);
}

+/* Atomic populate */
+static inline int
+pgd_test_and_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
+{
+ return ia64_cmpxchg8_acq(pgd_entry,__pa(pmd), PGD_NONE) == PGD_NONE;
+}

static inline pmd_t*
pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
@@ -132,6 +143,13 @@
pmd_val(*pmd_entry) = page_to_phys(pte);
}

+/* Atomic populate */
+static inline int
+pmd_test_and_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
+{
+ return ia64_cmpxchg8_acq(pmd_entry, page_to_phys(pte), PMD_NONE) == PMD_NONE;
+}
+
static inline void
pmd_populate_kernel (struct mm_struct *mm, pmd_t *pmd_entry, pte_t *pte)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-11-19 07:55:35.000000000 -0800
@@ -414,6 +425,26 @@
#endif
}

+/*
+ * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
+ * information. However, we use this routine to take care of any (delayed) i-cache
+ * flushing that may be necessary.
+ */
+extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+static inline int
+ptep_cmpxchg (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ /*
+ * IA64 defers icache flushes. If the new pte is executable we may
+ * have to flush the icache to insure cache coherency immediately
+ * after the cmpxchg.
+ */
+ if (pte_exec(newval))
+ update_mmu_cache(vma, addr, newval);
+ return ia64_cmpxchg8_acq(&ptep->pte, newval.pte, oldval.pte) == oldval.pte;
+}
+
static inline int
pte_same (pte_t a, pte_t b)
{
@@ -476,13 +507,6 @@
struct vm_area_struct * prev, unsigned long start, unsigned long end);
#endif

-/*
- * IA-64 doesn't have any external MMU info: the page tables contain all the necessary
- * information. However, we use this routine to take care of any (delayed) i-cache
- * flushing that may be necessary.
- */
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
-
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Update PTEP with ENTRY, which is guaranteed to be a less
@@ -560,6 +584,8 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+#define __HAVE_ARCH_LOCK_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _ASM_IA64_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:44:15 UTC
Permalink
Changelog
* Increase parallelism in SMP configurations by deferring
the acquisition of page_table_lock in handle_mm_fault
* Anonymous memory page faults bypass the page_table_lock
through the use of atomic page table operations
* Swapper does not set pte to empty in transition to swap
* Simulate atomic page table operations using the
page_table_lock if an arch does not define
__HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
a performance benefit since the page_table_lock
is held for shorter periods of time.

Signed-off-by: Christoph Lameter <***@sgi.com

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-11-18 12:25:49.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-11-19 06:38:53.000000000 -0800
@@ -1330,8 +1330,7 @@
}

/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1343,15 +1342,13 @@
int ret = VM_FAULT_MINOR;

pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {
/*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1374,8 +1371,7 @@
lock_page(page);

/*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
+ * Back out if somebody else faulted in this pte
*/
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
@@ -1422,14 +1418,12 @@
}

/*
- * We are called with the MM semaphore and page_table_lock
- * spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * We are called with the MM semaphore held.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *page_table, pmd_t *pmd, int write_access,
- unsigned long addr)
+ unsigned long addr, pte_t orig_entry)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
@@ -1441,7 +1435,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1450,30 +1443,37 @@
goto no_mem;
clear_user_highpage(page, addr);

- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);

- if (!pte_none(*page_table)) {
- pte_unmap(page_table);
- page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
- mm->rss++;
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
- lru_cache_add_active(page);
mark_page_accessed(page);
- page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ /* update the entry */
+ if (!ptep_cmpxchg(vma, addr, page_table, orig_entry, entry)) {
+ if (write_access) {
+ pte_unmap(page_table);
+ page_cache_release(page);
+ }
+ goto out;
+ }
+ if (write_access) {
+ /*
+ * These two functions must come after the cmpxchg
+ * because if the page is on the LRU then try_to_unmap may come
+ * in and unmap the pte.
+ */
+ lru_cache_add_active(page);
+ page_add_anon_rmap(page, vma, addr);
+ mm->rss++;
+
+ }
pte_unmap(page_table);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1489,12 +1489,12 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, int write_access, pte_t *page_table, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *page_table,
+ pmd_t *pmd, pte_t orig_entry)
{
struct page * new_page;
struct address_space *mapping = NULL;
@@ -1505,9 +1505,8 @@

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
- pmd, write_access, address);
+ pmd, write_access, address, orig_entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1605,7 +1604,7 @@
* nonlinear vmas.
*/
static int do_file_page(struct mm_struct * mm, struct vm_area_struct * vma,
- unsigned long address, int write_access, pte_t *pte, pmd_t *pmd)
+ unsigned long address, int write_access, pte_t *pte, pmd_t *pmd, pte_t entry)
{
unsigned long pgoff;
int err;
@@ -1618,13 +1617,12 @@
if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
}

pgoff = pte_to_pgoff(*pte);

pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);

err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1643,49 +1641,40 @@
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * Note the "page_table_lock". It is to protect against kswapd removing
- * pages from under us. Note that kswapd only ever _removes_ pages, never
- * adds them. As such, once we have noticed that the page is not present,
- * we can drop the lock early.
- *
- * The adding of pages is protected by the MM semaphore (which we hold),
- * so we don't need to worry about a page being suddenly been added into
- * our VM.
- *
- * We enter with the pagetable spinlock held, we are supposed to
- * release it when done.
+ * Note that kswapd only ever _removes_ pages, never adds them.
+ * We need to insure to handle that case properly.
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
pte_t entry;
+ pte_t new_entry;

entry = *pte;
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
- return do_no_page(mm, vma, address, write_access, pte, pmd);
+ return do_no_page(mm, vma, address, write_access, pte, pmd, entry);
if (pte_file(entry))
- return do_file_page(mm, vma, address, write_access, pte, pmd);
+ return do_file_page(mm, vma, address, write_access, pte, pmd, entry);
return do_swap_page(mm, vma, address, pte, pmd, entry, write_access);
}

+ /*
+ * This is the case in which we only update some bits in the pte.
+ */
+ new_entry = pte_mkyoung(entry);
if (write_access) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ /* do_wp_page expects us to hold the page_table_lock */
+ spin_lock(&mm->page_table_lock);
return do_wp_page(mm, vma, address, pte, pmd, entry);
-
- entry = pte_mkdirty(entry);
+ }
+ new_entry = pte_mkdirty(new_entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
+ if (ptep_cmpxchg(vma, address, pte, entry, new_entry))
+ update_mmu_cache(vma, address, new_entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}

@@ -1703,22 +1692,45 @@

inc_page_state(pgfault);

- if (is_vm_hugetlb_page(vma))
+ if (unlikely(is_vm_hugetlb_page(vma)))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */

/*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
+ * We rely on the mmap_sem and the SMP-safe atomic PTE updates.
+ * to synchronize with kswapd
*/
- spin_lock(&mm->page_table_lock);
- pmd = pmd_alloc(mm, pgd, address);
+ if (unlikely(pgd_none(*pgd))) {
+ pmd_t *new = pmd_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ /* Insure that the update is done in an atomic way */
+ if (!pgd_test_and_populate(mm, pgd, new))
+ pmd_free(new);
+ }
+
+ pmd = pmd_offset(pgd, address);
+
+ if (likely(pmd)) {
+ pte_t *pte;
+
+ if (!pmd_present(*pmd)) {
+ struct page *new;

- if (pmd) {
- pte_t * pte = pte_alloc_map(mm, pmd, address);
- if (pte)
+ new = pte_alloc_one(mm, address);
+ if (!new)
+ return VM_FAULT_OOM;
+
+ if (!pmd_test_and_populate(mm, pmd, new))
+ pte_free(new);
+ else
+ inc_page_state(nr_page_table_pages);
+ }
+
+ pte = pte_offset_map(pmd, address);
+ if (likely(pte))
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}

Index: linux-2.6.9/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-generic/pgtable.h 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-generic/pgtable.h 2004-11-19 07:54:05.000000000 -0800
@@ -134,4 +134,60 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif

+#ifndef __HAVE_ARCH_ATOMIC_TABLE_OPS
+/*
+ * If atomic page table operations are not available then use
+ * the page_table_lock to insure some form of locking.
+ * Note thought that low level operations as well as the
+ * page_table_handling of the cpu may bypass all locking.
+ */
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__vma, __addr, __ptep, __oldval, __newval) \
+({ \
+ int __rc; \
+ spin_lock(&__vma->vm_mm->page_table_lock); \
+ __rc = pte_same(*(__ptep), __oldval); \
+ if (__rc) set_pte(__ptep, __newval); \
+ spin_unlock(&__vma->vm_mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PGP_TEST_AND_POPULATE
+#define pgd_test_and_populate(__mm, __pgd, __pmd) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pgd_present(*(__pgd)); \
+ if (__rc) pgd_populate(__mm, __pgd, __pmd); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#ifndef __HAVE_PMD_TEST_AND_POPULATE
+#define pmd_test_and_populate(__mm, __pmd, __page) \
+({ \
+ int __rc; \
+ spin_lock(&__mm->page_table_lock); \
+ __rc = !pmd_present(*(__pmd)); \
+ if (__rc) pmd_populate(__mm, __pmd, __page); \
+ spin_unlock(&__mm->page_table_lock); \
+ __rc; \
+})
+#endif
+
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __p = __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)));\
+ flush_tlb_page(__vma, __address); \
+ __p; \
+})
+
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.9/mm/rmap.c
===================================================================
--- linux-2.6.9.orig/mm/rmap.c 2004-11-19 06:38:51.000000000 -0800
+++ linux-2.6.9/mm/rmap.c 2004-11-19 06:38:53.000000000 -0800
@@ -419,7 +419,10 @@
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
@@ -561,11 +564,6 @@

/* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -580,11 +578,15 @@
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte(pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
mm->anon_rss--;
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);

+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
mm->rss--;
page_remove_rmap(page);
page_cache_release(page);
@@ -671,15 +673,21 @@
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address);
- pteval = ptep_clear_flush(vma, address, pte);
+ /*
+ * There would be a race here with handle_mm_fault and do_anonymous_page
+ * which bypasses the page_table_lock if we would zap the pte before
+ * putting something into it. On the other hand we need to
+ * have the dirty flag setting at the time we replaced the value.
+ */

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte(pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_get_and_clear(pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:45:28 UTC
Permalink
Changelog
* Make cmpxchg and cmpxchg8b generally available on the i386
platform.
* Provide emulation of cmpxchg suitable for uniprocessor if
build and run on 386.
* Provide emulation of cmpxchg8b suitable for uniprocessor systems
if build and run on 386 or 486.
* Provide an inline function to atomically get a 64 bit value via
cmpxchg8b in an SMP system (courtesy of Nick Piggin)
(important for i386 PAE mode and other places where atomic 64 bit
operations are useful)

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig 2004-11-19 10:02:54.000000000 -0800
@@ -351,6 +351,11 @@
depends on !M386
default y

+config X86_CMPXCHG8B
+ bool
+ depends on !M386 && !M486
+ default y
+
config X86_XADD
bool
depends on !M386
Index: linux-2.6.9/arch/i386/kernel/cpu/intel.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/cpu/intel.c 2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/cpu/intel.c 2004-11-19 10:38:26.000000000 -0800
@@ -6,6 +6,7 @@
#include <linux/bitops.h>
#include <linux/smp.h>
#include <linux/thread_info.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/msr.h>
@@ -287,5 +288,103 @@
return 0;
}

+#ifndef CONFIG_X86_CMPXCHG
+unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
+{
+ u8 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u8));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u8 *)ptr;
+ if (prev == old)
+ *(u8 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u8);
+
+unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
+{
+ u16 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u16));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u16 *)ptr;
+ if (prev == old)
+ *(u16 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u16);
+
+unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
+{
+ u32 prev;
+ unsigned long flags;
+ /*
+ * Check if the kernel was compiled for an old cpu but the
+ * currently running cpu can do cmpxchg after all
+ * All CPUs except 386 support CMPXCHG
+ */
+ if (cpu_data->x86 > 3)
+ return __cmpxchg(ptr, old, new, sizeof(u32));
+
+ /* Poor man's cmpxchg for 386. Unsuitable for SMP */
+ local_irq_save(flags);
+ prev = *(u32 *)ptr;
+ if (prev == old)
+ *(u32 *)ptr = new;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg_386_u32);
+#endif
+
+#ifndef CONFIG_X86_CMPXCHG8B
+unsigned long long cmpxchg8b_486(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ unsigned long flags;
+
+ /*
+ * Check if the kernel was compiled for an old cpu but
+ * we are running really on a cpu capable of cmpxchg8b
+ */
+
+ if (cpu_has(cpu_data, X86_FEATURE_CX8))
+ return __cmpxchg8b(ptr, old, newv);
+
+ /* Poor mans cmpxchg8b for 386 and 486. Not suitable for SMP */
+ local_irq_save(flags);
+ prev = *ptr;
+ if (prev == old)
+ *ptr = newv;
+ local_irq_restore(flags);
+ return prev;
+}
+
+EXPORT_SYMBOL(cmpxchg8b_486);
+#endif
+
// arch_initcall(intel_cpu_init);

Index: linux-2.6.9/include/asm-i386/system.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/system.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/system.h 2004-11-19 10:49:46.000000000 -0800
@@ -149,6 +149,9 @@
#define __xg(x) ((struct __xchg_dummy *)(x))


+#define ll_low(x) *(((unsigned int*)&(x))+0)
+#define ll_high(x) *(((unsigned int*)&(x))+1)
+
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -184,8 +187,6 @@
{
__set_64bit(ptr,(unsigned int)(value), (unsigned int)((value)>>32ULL));
}
-#define ll_low(x) *(((unsigned int*)&(x))+0)
-#define ll_high(x) *(((unsigned int*)&(x))+1)

static inline void __set_64bit_var (unsigned long long *ptr,
unsigned long long value)
@@ -203,6 +204,26 @@
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+static inline unsigned long long __get_64bit(unsigned long long * ptr)
+{
+ unsigned long long ret;
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%1), %%eax\n\t"
+ "movl 4(%1), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ LOCK_PREFIX "cmpxchg8b (%1)\n\t"
+ "jnz 1b"
+ : "=A"(ret)
+ : "D"(ptr)
+ : "ebx", "ecx", "memory");
+ return ret;
+}
+
+#define get_64bit(ptr) __get_64bit(ptr)
+
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
@@ -240,7 +261,41 @@
*/

#ifdef CONFIG_X86_CMPXCHG
+
#define __HAVE_ARCH_CMPXCHG 1
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
+
+#else
+
+/*
+ * Building a kernel capable running on 80386. It may be necessary to
+ * simulate the cmpxchg on the 80386 CPU. For that purpose we define
+ * a function for each of the sizes we support.
+ */
+
+extern unsigned long cmpxchg_386_u8(volatile void *, u8, u8);
+extern unsigned long cmpxchg_386_u16(volatile void *, u16, u16);
+extern unsigned long cmpxchg_386_u32(volatile void *, u32, u32);
+
+static inline unsigned long cmpxchg_386(volatile void *ptr, unsigned long old,
+ unsigned long new, int size)
+{
+ switch (size) {
+ case 1:
+ return cmpxchg_386_u8(ptr, old, new);
+ case 2:
+ return cmpxchg_386_u16(ptr, old, new);
+ case 4:
+ return cmpxchg_386_u32(ptr, old, new);
+ }
+ return old;
+}
+
+#define cmpxchg(ptr,o,n)\
+ ((__typeof__(*(ptr)))cmpxchg_386((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))))
#endif

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
@@ -270,10 +325,32 @@
return old;
}

-#define cmpxchg(ptr,o,n)\
- ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
- (unsigned long)(n),sizeof(*(ptr))))
-
+static inline unsigned long long __cmpxchg8b(volatile unsigned long long *ptr,
+ unsigned long long old, unsigned long long newv)
+{
+ unsigned long long prev;
+ __asm__ __volatile__(
+ LOCK_PREFIX "cmpxchg8b (%4)"
+ : "=A" (prev)
+ : "0" (old), "c" ((unsigned long)(newv >> 32)),
+ "b" ((unsigned long)(newv & 0xffffffffULL)), "D" (ptr)
+ : "memory");
+ return prev;
+}
+
+#ifdef CONFIG_X86_CMPXCHG8B
+#define cmpxchg8b __cmpxchg8b
+#else
+/*
+ * Building a kernel capable of running on 80486 and 80386. Both
+ * do not support cmpxchg8b. Call a function that emulates the
+ * instruction if necessary.
+ */
+extern unsigned long long cmpxchg8b_486(volatile unsigned long long *,
+ unsigned long long, unsigned long long);
+#define cmpxchg8b cmpxchg8b_486
+#endif
+
#ifdef __KERNEL__
struct alt_instr {
__u8 *instr; /* original instruction */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:46:45 UTC
Permalink
Changelog
* Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/pgalloc.h 2004-11-19 08:17:55.000000000 -0800
@@ -7,16 +7,26 @@
#include <linux/threads.h>
#include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pgd_populate(mm, pgd, pmd) \
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+ (cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
{
set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
}

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ return cmpxchg((int *)pmd, PMD_NONE, _PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+}
+
extern __inline__ pmd_t *get_pmd(void)
{
return (pmd_t *)get_zeroed_page(GFP_KERNEL);
Index: linux-2.6.9/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgtable.h 2004-11-15 11:13:39.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/pgtable.h 2004-11-19 08:18:52.000000000 -0800
@@ -437,6 +437,10 @@
#define kc_offset_to_vaddr(o) \
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))

+
+#define ptep_cmpxchg(__vma,__addr,__xp,__oldval,__newval) (cmpxchg(&(__xp)->pte, pte_val(__oldval), pte_val(__newval)) == pte_val(__oldval))
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:46:06 UTC
Permalink
Changelog
* Atomic pte operations for i386 in regular and PAE modes

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-11-15 11:13:38.000000000 -0800
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-11-19 10:05:27.000000000 -0800
@@ -413,6 +413,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
#include <asm-generic/pgtable.h>

#endif /* _I386_PGTABLE_H */
Index: linux-2.6.9/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-3level.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-3level.h 2004-11-19 10:10:06.000000000 -0800
@@ -6,7 +6,8 @@
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <***@redhat.com>
- */
+ * August 26, 2004 added ptep_cmpxchg <***@lameter.com>
+*/

#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
@@ -42,26 +43,15 @@
return pte_x(pte);
}

-/* Rules for using set_pte: the pte being assigned *must* be
- * either not present or in a state where the hardware will
- * not attempt to update the pte. In places where this is
- * not possible, use pte_get_and_clear to obtain the old pte
- * value and then use set_pte to update it. -ben
- */
-static inline void set_pte(pte_t *ptep, pte_t pte)
-{
- ptep->pte_high = pte.pte_high;
- smp_wmb();
- ptep->pte_low = pte.pte_low;
-}
-#define __HAVE_ARCH_SET_PTE_ATOMIC
-#define set_pte_atomic(pteptr,pteval) \
+#define set_pte(pteptr,pteval) \
set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
#define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))

+#define set_pte_atomic set_pte
+
/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
* the TLB via cr3 if the top-level pgd is changed...
@@ -142,4 +132,23 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })

+/* Atomic PTE operations */
+#define ptep_xchg_flush(__vma, __addr, __ptep, __newval) \
+({ pte_t __r; \
+ /* xchg acts as a barrier before the setting of the high bits. */\
+ __r.pte_low = xchg(&(__ptep)->pte_low, (__newval).pte_low); \
+ __r.pte_high = (__ptep)->pte_high; \
+ (__ptep)->pte_high = (__newval).pte_high; \
+ flush_tlb_page(__vma, __addr); \
+ (__r); \
+})
+
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
+
+static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
+
#endif /* _I386_PGTABLE_3LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable-2level.h 2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable-2level.h 2004-11-19 10:05:27.000000000 -0800
@@ -82,4 +82,7 @@
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })

+/* Atomic PTE operations */
+#define ptep_cmpxchg(__vma,__a,__xp,__oldpte,__newpte) (cmpxchg(&(__xp)->pte_low, (__oldpte).pte_low, (__newpte).pte_low)==(__oldpte).pte_low)
+
#endif /* _I386_PGTABLE_2LEVEL_H */
Index: linux-2.6.9/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgalloc.h 2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgalloc.h 2004-11-19 10:10:40.000000000 -0800
@@ -4,9 +4,12 @@
#include <linux/config.h>
#include <asm/processor.h>
#include <asm/fixmap.h>
+#include <asm/system.h>
#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */

+#define PMD_NONE 0L
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte)))

@@ -16,6 +19,19 @@
((unsigned long long)page_to_pfn(pte) <<
(unsigned long long) PAGE_SHIFT)));
}
+
+/* Atomic version */
+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+#ifdef CONFIG_X86_PAE
+ return cmpxchg8b( ((unsigned long long *)pmd), PMD_NONE, _PAGE_TABLE +
+ ((unsigned long long)page_to_pfn(pte) <<
+ (unsigned long long) PAGE_SHIFT) ) == PMD_NONE;
+#else
+ return cmpxchg( (unsigned long *)pmd, PMD_NONE, _PAGE_TABLE + (page_to_pfn(pte) << PAGE_SHIFT)) == PMD_NONE;
+#endif
+}
+
/*
* Allocate and free page tables.
*/
@@ -49,6 +65,7 @@
#define pmd_free(x) do { } while (0)
#define __pmd_free_tlb(tlb,x) do { } while (0)
#define pgd_populate(mm, pmd, pte) BUG()
+#define pgd_test_and_populate(mm, pmd, pte) ({ BUG(); 1; })

#define check_pgt_cache() do { } while (0)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Christoph Lameter
2004-11-19 19:47:14 UTC
Permalink
Changelog
* Provide atomic pte operations for s390

Signed-off-by: Christoph Lameter <***@sgi.com>

Index: linux-2.6.9/include/asm-s390/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgtable.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgtable.h 2004-11-19 11:35:08.000000000 -0800
@@ -567,6 +567,15 @@
return pte;
}

+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ struct mm_struct *__mm = __vma->vm_mm; \
+ pte_t __pte; \
+ __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte(__ptep, __pteval); \
+ __pte; \
+})
+
static inline void ptep_set_wrprotect(pte_t *ptep)
{
pte_t old_pte = *ptep;
@@ -778,6 +787,14 @@

#define kern_addr_valid(addr) (1)

+/* Atomic PTE operations */
+#define __HAVE_ARCH_ATOMIC_TABLE_OPS
+
+static inline int ptep_cmpxchg (struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
+{
+ return cmpxchg(ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+}
+
/*
* No page table caches to initialise
*/
@@ -791,6 +808,7 @@
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+#define __HAVE_ARCH_PTEP_XCHG_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
Index: linux-2.6.9/include/asm-s390/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/pgalloc.h 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-s390/pgalloc.h 2004-11-19 11:33:25.000000000 -0800
@@ -97,6 +97,10 @@
pgd_val(*pgd) = _PGD_ENTRY | __pa(pmd);
}

+static inline int pgd_test_and_populate(struct mm_struct *mm, pdg_t *pgd, pmd_t *pmd)
+{
+ return cmpxchg(pgd, _PAGE_TABLE_INV, _PGD_ENTRY | __pa(pmd)) == _PAGE_TABLE_INV;
+}
#endif /* __s390x__ */

static inline void
@@ -119,6 +123,18 @@
pmd_populate_kernel(mm, pmd, (pte_t *)((page-mem_map) << PAGE_SHIFT));
}

+static inline int
+pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *page)
+{
+ int rc;
+ spin_lock(&mm->page_table_lock);
+
+ rc=pte_same(*pmd, _PAGE_INVALID_EMPTY);
+ if (rc) pmd_populate(mm, pmd, page);
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
+
/*
* page table entry allocation/free routines.
*/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Linus Torvalds
2004-11-19 19:59:03 UTC
Permalink
You could also make "rss" be a _signed_ integer per-thread.

When unmapping a page, you decrement one of the threads that shares the mm
(doesn't matter which - which is why the per-thread rss may go negative),
and when mapping a page you increment it.

Then, anybody who actually wants a global rss can just iterate over
threads and add it all up. If you do it under the mmap_sem, it's stable,
and if you do it outside the mmap_sem it's imprecise but stable in the
long term (ie errors never _accumulate_, like the non-atomic case will
do).

Does anybody care enough? Maybe, maybe not. It certainly sounds a hell of
a lot better than the periodic scan.

Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Benjamin Herrenschmidt
2004-11-19 07:05:20 UTC
Permalink
Post by Nick Piggin
Post by Christoph Lameter
This patch conflicts with the page fault scalability patch but I could not
leave this stone unturned. No significant performance increases so
this is just for the record in case someone else gets the same wild idea.
I had a similar wild idea. Mine was to just make sure we have a spare
per-CPU page ready before taking any locks.
Ahh, you're doing clear_user_highpage after the pte is already set up?
Won't that be racy? I guess that would be an advantage of my approach,
the clear_user_highpage can be done first (although that is more likely
to be wasteful of cache).
Yah, doing clear_user_highpage() after setting the PTE is unfortunately
unacceptable. It show interesting bugs... As soon as the PTE is setup,
another thread on another CPU can hit the page, you'll then clear what
it's writing...

Take for example 2 threads writing to different structures in the same
page of anonymous memory. The first one triggers the allocation, the
second writes right away, "sees" the new PTE, and writes just before the
first one does clear_user_highpage...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Lameter
2004-11-19 19:21:38 UTC
Permalink
Just coming back to your sloppy rss patch - this thing will of course allow
unbounded error to build up. Well, it *will* be bounded by the actual RSS if
we assume the races can only cause rss to be underestimated. However, such an
assumption (I think it is a safe one?) also means that rss won't hover around
the correct value, but tend to go increasingly downward.
On your HPC codes that never reclaim memory, and don't do a lot of mapping /
unmapping I guess this wouldn't matter... But a long running database or
something?
Databases preallocate memory on startup and then manage memory themselves.
One reason for this patch is that these applications cause anonymous page
fault storms on startup given lots of memory which will make
the system seem to freeze for awhile.

It is rare for a program to actually free up memory.

Where this approach could be problematic is when the system is under
heavy swap load. Pages of an application will be repeatedly paged in and
out and therefore rss will be incremented and decremented. But in those
cases these incs and decs are not done in a way that is on purpose
parallel like in my test programs. So I would expect rss to be more
accurate than in my tests.

I think the sloppy rss approach is the right way to go.
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robin Holt
2004-11-19 19:57:21 UTC
Permalink
Post by Christoph Lameter
Just coming back to your sloppy rss patch - this thing will of course allow
unbounded error to build up. Well, it *will* be bounded by the actual RSS if
we assume the races can only cause rss to be underestimated. However, such an
assumption (I think it is a safe one?) also means that rss won't hover around
the correct value, but tend to go increasingly downward.
On your HPC codes that never reclaim memory, and don't do a lot of mapping /
unmapping I guess this wouldn't matter... But a long running database or
something?
Databases preallocate memory on startup and then manage memory themselves.
One reason for this patch is that these applications cause anonymous page
fault storms on startup given lots of memory which will make
the system seem to freeze for awhile.
It is rare for a program to actually free up memory.
Where this approach could be problematic is when the system is under
heavy swap load. Pages of an application will be repeatedly paged in and
out and therefore rss will be incremented and decremented. But in those
cases these incs and decs are not done in a way that is on purpose
parallel like in my test programs. So I would expect rss to be more
accurate than in my tests.
I think the sloppy rss approach is the right way to go.
Is this really that much of a problem? Why not leave rss as an _ACCURATE_
count of pages. That way stuff like limits based upon rss and accounting
of memory usage are accurate.

Have we tried splitting into seperate cache lines? How about grouped counters
for every 16 cpus instead of a per-cpu counter as proposed by someone else
earlier.

IMHO, keeping rss as an accurate count is much more important that having
a nearly correct value. If this turns into more of a scaling issue later on,
your patch will have to be caught by someone accidentally noticing that
the rss value is _WAY_ off as opposed to our normal methods for detecting
cacheline contention.

Just my opinion,
Robin Holt
-
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...