arch/x86/kernel/acpi/boot.c: In function ‘dmi_ignore_irq0_timer_override’:
arch/x86/kernel/acpi/boot.c:1443: error: implicit declaration of function ‘force_mask_ioapic_irq_2’
The problems are that, with the ACPI vs timer overring issue _fixed_,
after using the box for some time (between several seconds and 1 hour, at
random) processes get very high CPU loads (once I've got X using 107% of
the CPU, for example) and the system becomes unresponsive, as though there
were interrupts lost or something similar.
Andreas Herrman reproduced similar problems:
> Ok, now I've reproduced the stability problem.
> - Using tip/master,
> - reverting e38502eb8aa82314d5ab0eba45f50e6790dadd88 and
> - applying your patch from this posting
> http://marc.info/?l=linux-kernel&m=121539354224562&w=4
>
> Starting X, firefox, gimp, tuxpaint and doing some drawing in tuxpaint
> results in a slow system. Drawing is almost not possible anymore --
> Selections of new colors, cursors etc. is performed with huge delay
> if it's performed at all.
>
> BTW, the code sets up timer IRQ as Virtual Wire IRQ:
>
> Jul 8 14:57:58 kodscha IO-APIC (apicid-pin) 2-22, 2-23 not connected.
> Jul 8 14:57:58 kodscha ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> Jul 8 14:57:58 kodscha ...trying to set up timer as Virtual Wire IRQ... works.
>
> and both INT0 and INT2 of IOAPIC are masked:
>
> Jul 8 14:57:58 kodscha NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect:
> Jul 8 14:57:58 kodscha 00 000 1 0 0 0 0 0 0 00
> Jul 8 14:57:58 kodscha 01 003 0 0 0 0 0 1 1 31
> Jul 8 14:57:58 kodscha 02 003 1 0 0 0 0 0 0 30
>
> I've also seen strange CPU utilization -- with syslog-ng:
>
> top - 15:33:06 up 35 min, 4 users, load average: 1.70, 0.68, 0.37
> Tasks: 64 total, 4 running, 60 sleeping, 0 stopped, 0 zombie
> Cpu0 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 6.4%us, 87.2%sy, 0.0%ni, 5.8%id, 0.0%wa, 0.6%hi, 0.0%si, 0.0%st
> Mem: 895384k total, 283568k used, 611816k free, 35492k buffers
> Swap: 1959920k total, 0k used, 1959920k free, 163044k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4632 root 20 0 17216 800 580 S 104 0.1 0:34.22 syslog-ng
> 28505 root 20 0 205m 11m 4024 S 6 1.3 0:21.16 X
> 28518 root 20 0 56292 5652 4492 S 1 0.6 0:01.80 fluxbox
> 1 root 20 0 3724 608 508 S 0 0.1 0:00.36 init
>
> So far I have no clue why C1E-idle in conjunction with virtual wire
> mode causes this strange behaviour.
>
> ... and I start to think about the root cause of all this.
>
> I've performed similar tests under X with the IRQ0/INT0 configuration and
> I did not see above symptoms.
So lets fall back to the IRQ0/INT0 configuration on this box.
This basically restores the dont-use-the-lapic-timer exception mechanism
that was unconditional on this box prior commit 8750bf5 ("x86: add C1E
aware idle function").
x86, uv: build fix #2 for "x86, uv: update x86 mmr list for SGI uv"
fix:
In file included from arch/x86/kernel/tlb_uv.c:14:
include/asm/uv/uv_mmrs.h:986: error: redefinition of ‘union uvh_rh_gam_cfg_overlay_config_mmr_u’
include/asm/uv/uv_mmrs.h:988: error: redefinition of ‘struct uvh_rh_gam_cfg_overlay_config_mmr_s’
include/asm/uv/uv_mmrs.h:1064: error: redefinition of ‘union uvh_rh_gam_mmioh_overlay_config_mmr_u’
include/asm/uv/uv_mmrs.h:1066: error: redefinition of ‘struct uvh_rh_gam_mmioh_overlay_config_mmr_s’
caused by another duplicate section (cut & paste error) in commit 5d061e397db1 "x86, uv: update x86 mmr list for SGI uv".
x86, uv: build fix for "x86, uv: update x86 mmr list for SGI uv"
fix:
In file included from arch/x86/kernel/genx2apic_uv_x.c:25:
include/asm/uv/uv_mmrs.h:986: error: redefinition of ‘union uvh_rh_gam_cfg_overlay_config_mmr_u’
include/asm/uv/uv_mmrs.h:988: error: redefinition of ‘struct uvh_rh_gam_cfg_overlay_config_mmr_s’
include/asm/uv/uv_mmrs.h:1064: error: redefinition of ‘union uvh_rh_gam_mmioh_overlay_config_mmr_u’
include/asm/uv/uv_mmrs.h:1066: error: redefinition of ‘struct uvh_rh_gam_mmioh_overlay_config_mmr_s’
caused by duplicate section (cut & paste error) in commit 5d061e397db1 "x86, uv: update x86 mmr list for SGI uv".
- local caching of smp_processor_id() in default_do_nmi()
- v2: do not split default_do_nmi over two lines
On Wed, Jul 02, 2008 at 08:12:20PM +0400, Cyrill Gorcunov wrote:
> | -static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
> | +static notrace __kprobes void
> | +default_do_nmi(struct pt_regs *regs)
> | [ ... ]
> | -asmlinkage notrace __kprobes void default_do_nmi(struct pt_regs *regs)
> | +asmlinkage notrace __kprobes void
> | +default_do_nmi(struct pt_regs *regs)
>
> Hi Alexander, good done, thanks! But why did you split default_do_nmi
> definition by two lines? I think it would be better to keep them as it
> was before, ie by a single line
>
> static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
Thanks! Here is the replacement patch with default_do_nmi left on
a single line.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reorder headers and collect globals in traps_32.c and traps_64.c
Code size and data size are unaffected by the changes. Code
itself is changed due to different ordering of data and bss.
The bss segment changed size due to a change in the packing
of the variables.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge the tsc calibration code for the 32bit and 64bit kernel.
The paravirtualized calculate_cpu_khz for 64bit now points to the correct
tsc_calibrate code as in 32bit.
Original native_calculate_cpu_khz for 64 bit is now called as calibrate_cpu.
Also moved the recalibrate_cpu_khz function in the common file.
Note that this function is called only from powernow K7 cpu freq driver.
Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Dan Hecht <dhecht@vmware.com> Cc: Dan Hecht <dhecht@vmware.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the basic global variable definitions and sched_clock handling in the
common "tsc.c" file.
- Unify notsc kernel command line handling for 32 bit and 64bit.
- Functional changes for 64bit.
- "tsc_disabled" is updated if "notsc" is passed at boottime.
- Fallback to jiffies for sched_clock, incase notsc is passed on
commandline.
Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Dan Hecht <dhecht@vmware.com> Cc: Dan Hecht <dhecht@vmware.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bernhard Walle [Fri, 27 Jun 2008 11:12:55 +0000 (13:12 +0200)]
x86: use FIRMWARE_MEMMAP on x86/E820
This patch uses the /sys/firmware/memmap interface provided in the last patch
on the x86 architecture when E820 is used. The patch copies the E820
memory map very early, and registers the E820 map afterwards via
firmware_map_add_early().
Bernhard Walle [Fri, 27 Jun 2008 11:12:54 +0000 (13:12 +0200)]
sysfs: add /sys/firmware/memmap
This patch adds /sys/firmware/memmap interface that represents the BIOS
(or Firmware) provided memory map. The tree looks like:
/sys/firmware/memmap/0/start (hex number)
end (hex number)
type (string)
... /1/start
end
type
With the following shell snippet one can print the memory map in the same form
the kernel prints itself when booting on x86 (the E820 map).
--------- 8< --------------------------
#!/bin/sh
cd /sys/firmware/memmap
for dir in * ; do
start=$(cat $dir/start)
end=$(cat $dir/end)
type=$(cat $dir/type)
printf "%016x-%016x (%s)\n" $start $[ $end +1] "$type"
done
--------- >8 --------------------------
That patch only provides the needed interface:
1. The sysfs interface.
2. The structure and enumeration definition.
3. The function firmware_map_add() and firmware_map_add_early()
that should be called from architecture code (E820/EFI, for
example) to add the contents to the interface.
If the kernel is compiled without CONFIG_FIRMWARE_MEMMAP, the interface does
nothing without cluttering the architecture-specific code with #ifdef's.
The purpose of the new interface is kexec: While /proc/iomem represents
the *used* memory map (e.g. modified via kernel parameters like 'memmap'
and 'mem'), the /sys/firmware/memmap tree represents the unmodified memory
map provided via the firmware. So kexec can:
- use the original memory map for rebooting,
- use the /proc/iomem for setting up the ELF core headers for kdump
case that should only represent the memory of the system.
Rather than using _PAGE_GLOBAL - which not all CPUs support - to test
CPA, use one of the reserved-for-software-use PTE flags instead. This
allows CPA testing to work on CPUs which don't support PGD.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86_32: remove __PAGE_KERNEL(_EXEC)
From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Older x86-32 processors do not support global mappings (PGD), so must
only use it if the processor supports it.
The _PAGE_KERNEL* flags always have _PAGE_KERNEL set, since logically
we always want it set.
This is OK even on processors which do not support PGD, since all
_PAGE flags are masked with __supported_pte_mask before being turned
into a real in-pagetable pte. On 32-bit systems, __supported_pte_mask
is initialized to not contain _PAGE_GLOBAL, and it is then added if
the CPU is found to support it.
The x86-32 code used to use __PAGE_KERNEL/__PAGE_KERNEL_EXEC for this
purpose, but they're now redundant and can be removed.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86: always set _PAGE_GLOBAL in _PAGE_KERNEL* flags
Consistently set _PAGE_GLOBAL in _PAGE_KERNEL flags. This makes 32-
and 64-bit code consistent, and removes some special cases where
__PAGE_KERNEL* did not have _PAGE_GLOBAL set, causing confusion as a
result of the inconsistencies.
This patch only affects x86-64, which generally always supports PGD.
The x86-32 patch is next.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yinghai Lu [Thu, 3 Jul 2008 01:53:44 +0000 (18:53 -0700)]
x86: move init_cpu_to_node after get_smp_config
when acpi=off, cpu_to_apicid is ready after get_smp_config
so need to move init_cpu_to_node after it.
otherwise, we will get wrong cpu->node mapping, and it will rely on
amd_detect_cmp() to correct it - but that is too late as
setup_per_cpu_data is already called before that so we will get
per_cpu_data on the wrong node.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bernhard Walle [Thu, 26 Jun 2008 19:54:08 +0000 (21:54 +0200)]
x86: find offset for crashkernel reservation automatically
This patch removes the need of the crashkernel=...@offset parameter to define
a fixed offset for crashkernel reservation. That feature can be used together
with a relocatable kernel where the kexec-tools relocate the kernel and
get the actual offset from /proc/iomem.
The use case is a kernel where the .text+.data+.bss is after 16M physical
memory (debug kernel with lockdep on x86_64 can cause that) which caused a
major pain in autoconfiguration in our distribution.
Also, that patch unifies crashdump architectures a bit since IA64 has
that semantics from the very beginning of the kdump port.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: vgoyal@redhat.com Cc: Bernhard Walle <bwalle@suse.de> Cc: kexec@lists.infradead.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Fri, 27 Jun 2008 17:10:13 +0000 (10:10 -0700)]
x86: add check for node passed to node_to_cpumask, v3
* When CONFIG_DEBUG_PER_CPU_MAPS is set, the node passed to
node_to_cpumask and node_to_cpumask_ptr should be validated.
If invalid, then a dump_stack is performed and a zero cpumask
is returned.
v2: Slightly different version to remove a compiler warning.
v3: Redone to reflect moving setup.c -> setup_percpu.c
Signed-off-by: Mike Travis <travis@sgi.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86: fix CPA self-test for "x86/paravirt: groundwork for 64-bit Xen support"
Ingo Molnar wrote:
> -tip auto-testing found pagetable corruption (CPA self-test failure):
>
> [ 32.956015] CPA self-test:
> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>
> config and full log can be found at:
>
> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
> http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad
Phew. OK, I've worked this out. Short version is that's it's a false
alarm, and there was no real failure here. Long version:
* I changed the code to create the physical mapping pagetables to
reuse any existing mapping rather than replace it. Specifically,
reusing an pud pointed to by the pgd caused this symptom to appear.
* The specific PUD being reused is the one created statically in
head_64.S, which creates an initial 1GB mapping.
* That mapping doesn't have _PAGE_GLOBAL set on it, due to the
inconsistency between __PAGE_* and PAGE_*.
* The CPA test attempts to clear _PAGE_GLOBAL, and then checks to
see that the resulting range is 1) shattered into 4k pages, and 2)
has no _PAGE_GLOBAL.
* However, since it didn't have _PAGE_GLOBAL on that range to start
with, change_page_attr_clear() had nothing to do, and didn't
bother shattering the range,
* resulting in the reported messages
The simple fix is to set _PAGE_GLOBAL in level2_ident_pgt.
An additional fix to make CPA testing more robust by using some other
pagetable bit (one of the unused available-to-software ones). This
would solve spurious CPA test warnings under Xen which uses _PAGE_GLOBAL
for its own purposes (ie, not under guest control).
Also, we should revisit the use of _PAGE_GLOBAL in asm-x86/pgtable.h,
and use it consistently, and drop MAKE_GLOBAL. The first time I
proposed it it caused breakages in the very early CPA code; with luck
that's all fixed now.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mark McLoughlin <markmc@redhat.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yinghai Lu [Mon, 30 Jun 2008 23:20:54 +0000 (16:20 -0700)]
x86: move reserve_setup_data to setup.c
Ying Huang would like setup_data to be reserved, but not included in the
no save range.
Here we try to modify the e820 table to reserve that range early.
also add that in early_res in case bootloader messes up with the ramdisk.
other solution would be
1. add early_res_to_highmem...
2. early_res_to_e820...
but they could reserve another type memory wrongly, if early_res has some
resource reserved early, and not needed later, but it is not removed from
early_res in time. Like the RAMDISK (already handled).
Yinghai Lu [Sun, 29 Jun 2008 00:49:59 +0000 (17:49 -0700)]
x86: fix warning in e820_reserve_resources with 32bit
when 64bit resource is not enabled, we get:
arch/x86/kernel/e820.c: In function ‘e820_reserve_resources’:
arch/x86/kernel/e820.c:1217: warning: comparison is always false due to limited range of data type
because res->start/end is resource_t aka u32. it will overflow.
fix it with temp end of u64
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make sure SWAPGS and PARAVIRT_ADJUST_EXCEPTION_FRAME are properly
defined when CONFIG_PARAVIRT is off.
Fixes Ingo's build failure:
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mark McLoughlin <markmc@redhat.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Stephen Tweedie <sct@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86/paravirt: groundwork for 64-bit Xen support, fix
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>
>>> It quickly broke the build in testing:
>>>
>>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>> arch/x86/kernel/entry_64.S: In file included from
>>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function
>>> ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>>
>>>
>> No, looks like my fault. The non-PARAVIRT version of
>> paravirt_pgd_free() is:
>>
>> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
>>
>> but C doesn't like missing parameter names, even if unused.
>>
>> This should fix it:
>>
>
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
> with:
>
> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>
Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active
stack is the usermode stack.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bernhard Walle [Wed, 25 Jun 2008 19:39:16 +0000 (21:39 +0200)]
x86: limit E820 map when a user-defined memory map is specified
This patch brings back limiting of the E820 map when a user-defined
E820 map is specified. While the behaviour of i386 (32 bit) was to limit
the E820 map (and /proc/iomem), the behaviour of x86-64 (64 bit) was not to
limit.
That patch limits the E820 map again for both x86 architectures.
Code was tested for compilation and booting on a 32 bit and 64 bit system.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Acked-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: kexec@lists.infradead.org Cc: vgoyal@redhat.com Cc: Bernhard Walle <bwalle@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86, 64-bit: swapgs pvop with a user-stack can never be called
It's never safe to call a swapgs pvop when the user stack is current -
it must be inline replaced. Rather than making a call, the
SWAPGS_UNSAFE_STACK pvop always just puts "swapgs" as a placeholder,
which must either be replaced inline or trap'n'emulated (somehow).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86/paravirt, 64-bit: don't restore user rsp within sysret
There's no need to combine restoring the user rsp within the sysret
pvop, so split it out. This makes the pvop's semantics closer to the
machine instruction.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't conflate sysret and sysexit; they're different instructions with
different semantics, and may be in use at the same time (at least
within the same kernel, depending on whether its an Intel or AMD
system).
sysexit - just return to userspace, does no register restoration of
any kind; must explicitly atomically enable interrupts.
sysret - reloads flags from r11, so no need to explicitly enable
interrupts on 64-bit, responsible for restoring usermode %gs
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86: use __KERNEL_DS as SS when returning to a kernel thread
This is needed when the kernel is running on RING3, such as under Xen.
x86_64 has a weird feature that makes it #GP on iret when SS is a null
descriptor.
This need to be tested on bare metal to make sure it doesn't cause any
problems. AMD specs say SS is always ignored (except on iret?).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because Xen doesn't support PSE mappings in guests, all code which
assumed the presence of PSE has been changed to fall back to smaller
mappings if necessary. As a result, PSE is optional rather than
required (though still used whereever possible).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86, 64-bit: adjust mapping of physical pagetables to work with Xen
This makes a few of changes to the construction of the initial
pagetables to work better with paravirt_ops/Xen. The main areas
are:
1. Support non-PSE mapping of memory, since Xen doesn't currently
allow 2M pages to be mapped in guests.
2. Make sure that the ioremap alias of all pages are dropped before
attaching the new page to the pagetable. This avoids having
writable aliases of pagetable pages.
3. Preserve existing pagetable entries, rather than overwriting. Its
possible that a fair amount of pagetable has already been constructed,
so reuse what's already in place rather than ignoring and overwriting it.
The algorithm relies on the invariant that any page which is part of
the kernel pagetable is itself mapped in the linear memory area. This
way, it can avoid using ioremap on a pagetable page.
The invariant holds because it maps memory from low to high addresses,
and also allocates memory from low to high. Each allocated page can
map at least 2M of address space, so the mapped area will always
progress much faster than the allocated area. It relies on the early
boot code mapping enough pages to get started.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
The first essentially cleans up after head_64.S. It clears the
bss, zaps low identity mappings, sets up some early exception
handlers.
The second part preserves the boot data, reserves the kernel's
text/data/bss, pagetables and ramdisk, and then starts the kernel
proper.
This split is so that Xen can call the second part to do the set up it
needs done. It doesn't need any of the first part setups, because it
doesn't boot via head_64.S, and its redundant or actively damaging.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Eduardo Habkost [Wed, 25 Jun 2008 04:19:16 +0000 (00:19 -0400)]
paravirt/x86, 64-bit: move __PAGE_OFFSET to leave a space for hypervisor
Set __PAGE_OFFSET to the most negative possible address +
16*PGDIR_SIZE. The gap is to allow a space for a hypervisor to fit.
The gap is more or less arbitrary, but it's what Xen needs.
When booting native, kernel/head_64.S has a set of compile-time
generated pagetables used at boot time. This patch removes their
absolutely hard-coded layout, and makes it parameterised on
__PAGE_OFFSET (and __START_KERNEL_map).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86/paravirt: define PARA_INDIRECT for indirect asm calls
On 32-bit it's best to use a %cs: prefix to access memory where the
other segments may not bet set up properly yet. On 64-bit it's best
to use a rip-relative addressing mode. Define PARA_INDIRECT() to
abstract this and generate the proper addressing mode in each case.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan Beulich points out that vmalloc_sync_all() assumes that the
kernel's pmd is always expected to be present in the pgd. The current
pgd construction code will add the pgd to the pgd_list before its pmds
have been pre-populated, thereby making it visible to
vmalloc_sync_all().
However, because pgd_prepopulate_pmd also does the allocation, it may
block and cannot be done under spinlock.
The solution is to preallocate the pmds out of the spinlock, then
populate them while holding the pgd_list lock.
This patch also pulls the pmd preallocation and mop-up functions out
to be common, assuming that the compiler will generate no code for
them when PREALLOCTED_PMDS is 0. Also, there's no need for pgd_ctor
to clear the pgd again, since it's allocated as a zeroed page.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add hooks which are called at pgd_alloc/free time. The pgd_alloc hook
may return an error code, which if non-zero, causes the pgd allocation
to be failed. The hooks may be used to allocate/free auxillary
per-pgd information.
also fix:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
> arch/x86/kernel/entry_64.S: In file included from
> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
vmalloc_sync_all() is only called from register_die_notifier and
alloc_vm_area. Neither is on any performance-critical paths, so
vmalloc_sync_all() itself is not on any hot paths.
Given that the optimisations in vmalloc_sync_all add a fair amount of
code and complexity, and are fairly hard to evaluate for correctness,
it's better to just remove them to simplify the code rather than worry
about its absolute performance.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Thu, 26 Jun 2008 10:40:35 +0000 (12:40 +0200)]
x86: build fix
fix:
In file included from arch/x86/kernel/setup.c:118:
include/asm/highmem.h:64: error: expected identifier or ‘(' before ‘do'
include/asm/highmem.h:64: error: expected identifier or ‘(' before ‘while'
include/asm/highmem.h:67: error: expected identifier or ‘(' before ‘do'
include/asm/highmem.h:67: error: expected identifier or ‘(' before ‘while'