When PF is set to multi-TCs and configured mapping relationship between
priorities and TCs, the hardware will active these settings for this PF
and its VFs.
In this case when VF just uses one TC and its rx packets contain priority,
and if the priority is not mapped to TC0, as other TCs of VF is not valid,
hardware always put this kind of packets to the queue 0. It cause this kind
of packets of VF can not be used RSS function.
To fix this problem, set tc mode of all unused TCs of VF to the setting of
TC0, then rx packet with priority which map to unused TC will be direct to
TC0.
When applications call shutdown() with SHUT_RDWR in userspace,
smc_close_active() calls kernel_sock_shutdown(), and it is called
twice in smc_shutdown().
This fixes this by checking sk_state before do clcsock shutdown, and
avoids missing the application's call of smc_shutdown().
Inject error before dev_hold(real_dev) in register_vlan_dev(),
and execute the following testcase:
ip link add dev dummy1 type dummy
ip link add name dummy1.100 link dummy1 type vlan id 100
ip link del dev dummy1
When the dummy netdevice is removed, we will get a WARNING as following:
=======================================================================
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0
and an endless loop of:
=======================================================================
unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824
That is because dev_put(real_dev) in vlan_dev_free() be called without
dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev
underflow.
Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of
ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev
symmetrical.
Fixes: 563bcbae3ba2 ("net: vlan: fix a UAF in vlan_dev_real_dev()") Reported-by: Petr Machata <petrm@nvidia.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
It hangup when booting Loongson 3A1000 with BOTH
CONFIG_PAGE_SIZE_64KB and CONFIG_MIPS_VA_BITS_48, that it turn
out to use 2-level pgtable instead of 3-level. 64KB page size
with 2-level pgtable only cover 42 bits VA, use 3-level pgtable
to cover all 48 bits VA(55 bits)
Fixes: 1e321fa917fb ("MIPS64: Support of at least 48 bits of SEGBITS) Signed-off-by: Huang Pei <huangpei@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Oleksandr brought a bug report where netpoll causes trace
messages in the log on igb.
Danielle brought this back up as still occurring, so we'll try
again.
[22038.710800] ------------[ cut here ]------------
[22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll
[22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 netpoll_poll_dev+0x18a/0x1a0
As Alex suggested, change the driver to return work_done at the
exit of napi_poll, which should be safe to do in this driver
because it is not polling multiple queues in this single napi
context (multiple queues attached to one MSI-X vector). Several
other drivers contain the same simple sequence, so I hope
this will not create new problems.
Fixes: 16eb8815c235 ("igb: Refactor clean_rx_irq to reduce overhead and improve performance") Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Danielle Ratson <danieller@nvidia.com> Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Danielle Ratson <danieller@nvidia.com> Link: https://lore.kernel.org/r/20211123204000.1597971-1-jesse.brandeburg@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
While testing BIG TCP patch series, I was expecting that TCP_RR workloads
with 80KB requests/answers would send one 80KB TSO packet,
then being received as a single GRO packet.
It turns out this was not happening, and the root cause was that
cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC.
Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC
needed a budget of ~20 segments.
Ideally these TCP_RR flows should not exit slow start.
Cubic Hystart should reset itself at each round, instead of assuming
every TCP flow is a bulk one.
Note that even after this patch, Hystart can still trigger, depending
on scheduling artifacts, but at a higher CWND/SSTHRESH threshold,
keeping optimal TSO packet sizes.
Tested:
ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072
nstat -n; netperf -H ... -t TCP_RR -l 5 -- -r 80000,80000 -K cubic; nstat|egrep "Ip6InReceives|Hystart|Ip6OutRequests"
Commit 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in
swsusp_check()") changed the opening mode of the block device to
(FMODE_READ | FMODE_EXCL).
In the corresponding calls to swsusp_close(), the mode is still just
FMODE_READ which triggers the warning in blkdev_flush_mapping() on
resume from hibernate.
So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
device.
Fixes: 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Update NC-SI command handler (both standard and OEM) to take into
account of payload paddings in allocating skb (in case of payload
size is not 32-bit aligned).
The checksum field follows payload field, without taking payload
padding into account can cause checksum being truncated, leading to
dropped packets.
Fixes: fb4ee67529ff ("net/ncsi: Add NCSI OEM command support") Signed-off-by: Kumar Thangavel <thangavel.k@hcl.com> Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Current nvmet_try_send_ddgst() code does not check whether
all data digest bytes are transmitted, fix this by returning
-EAGAIN if all data digest bytes are not transmitted.
Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The side that actively closed socket, it's clcsock doesn't enter
TIME_WAIT state, but the passive side does it. It should show the same
behavior as TCP sockets.
Consider this, when client actively closes the socket, the clcsock in
server enters TIME_WAIT state, which means the address is occupied and
won't be reused before TIME_WAIT dismissing. If we restarted server, the
service would be unavailable for a long time.
To solve this issue, shutdown the clcsock in [A], perform the TCP active
close progress first, before the passive closed side closing it. So that
the actively closed side enters TIME_WAIT, not the passive one.
Client | Server
close() // client actively close |
smc_release() |
smc_close_active() // PEERCLOSEWAIT1 |
smc_close_final() // abort or closed = 1|
smc_cdc_get_slot_and_msg_send() |
[A] |
|smc_cdc_msg_recv_action() // ACTIVE
| queue_work(smc_close_wq, &conn->close_work)
| smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_passive_abort_received() // only in abort
|
|close() // server recv zero, close
| smc_release() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_active()
| smc_close_abort() or smc_close_final() // CLOSED
| smc_cdc_get_slot_and_msg_send() // abort or closed = 1
smc_cdc_msg_recv_action() | smc_clcsock_release()
queue_work(smc_close_wq, &conn->close_work) | sock_release(tcp) // actively close clc, enter TIME_WAIT
smc_close_passive_work() // PEERCLOSEWAIT1 | smc_conn_free()
smc_close_passive_abort_received() // CLOSED|
smc_conn_free() |
smc_clcsock_release() |
sock_release(tcp) // passive close clc |
Link: https://www.spinics.net/lists/netdev/msg780407.html Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
This fixes an issue added in commit 4edd8cd4e86d ("scsi: core: sysfs: Fix
hang when device state is set via sysfs") where if userspace is requesting
to set the device state to SDEV_RUNNING when the state is already
SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret
to count for this case, when it should have set it to 0.
Link: https://lore.kernel.org/r/20211120164917.4924-1-michael.christie@oracle.com Fixes: 4edd8cd4e86d ("scsi: core: sysfs: Fix hang when device state is set via sysfs") Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable. IPv4 is not affected because dsts
don't take a refcount on the route.
[1]
$ ip nexthop list
id 200 via 2002:db8::2 dev bridge.10 scope link onlink
id 201 via 2002:db8::3 dev bridge scope link onlink
id 203 group 201/200
$ ip -6 route
2001:db8::10 nhid 203 metric 1024 pref medium
nexthop via 2002:db8::3 dev bridge weight 1 onlink
nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)
Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201
Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128
The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200
And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.
To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203
It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.
At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.
If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$
Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.
Message from syslogd@dev at Nov 19 14:04:10 ...
kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Fixes: 7bf4796dd099 ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.
Fixes: 7bf4796dd099 ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Use nn->tlv_caps.me_freq_mhz instead of nn->me_freq_mhz to check whether
rx-usecs/tx-usecs is valid.
This is because nn->tlv_caps.me_freq_mhz represents the clock_freq (MHz) of
the flow processing cores (FPC) on the NIC. While nn->me_freq_mhz is not
be set.
Fixes: ce991ab6662a ("nfp: read ME frequency from vNIC ctrl memory") Signed-off-by: Diana Wang <na.wang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Resolve being able to change static values on VF when adaptive interrupt
moderation is enabled.
This problem is fixed by checking the interrupt settings is not
a combination of change of static value while adaptive interrupt
moderation is turned on.
Without this fix, the user would be able to change static values
on VF with adaptive moderation enabled.
Fixes: 65e87c0398f5 ("i40evf: support queue-specific settings for interrupt moderation") Signed-off-by: Nitesh B Venkatesh <nitesh.b.venkatesh@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The ->gem_create_object() functions are supposed to return NULL if there
is an error. None of the callers expect error pointers so returing one
will lead to an Oops. See drm_gem_vram_create(), for example.
Fixes: c826a6e10644 ("drm/vc4: Add a BO cache.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20211118111416.GC1147@kili Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
While looping over shost's sdev list it is possible that one
of the drives is getting removed and its sas_target object is
freed but its sdev object remains intact.
Consequently, a kernel panic can occur while the driver is trying to access
the sas_address field of sas_target object without also checking the
sas_target object for NULL.
Link: https://lore.kernel.org/r/20211117104909.2069-1-sreekanth.reddy@broadcom.com Fixes: f92363d12359 ("[SCSI] mpt3sas: add new driver supporting 12GB SAS") Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When CONFIG_FORTIFY_SOURCE is set, memcpy() checks the potential
buffer overflow and panics. The code in sofcpga bootstrapping
contains the memcpy() calls are mistakenly translated as the shorter
size, hence it triggers a panic as if it were overflowing.
This patch changes the secondary_trampoline and *_end definitions
to arrays for avoiding the false-positive crash above.
The failure to retrieve post-op attributes has no bearing on whether or
not the clone operation itself was successful. We must therefore ignore
the return value of decode_getfattr() when looking at the success or
failure of nfs4_xdr_dec_clone().
This patch fixes an issue that an u32 netlink value is handled as a
signed enum value which doesn't fit into the range of u32 netlink type.
If it's handled as -1 value some BIT() evaluation ends in a
shift-out-of-bounds issue. To solve the issue we set the to u32 max which
is s32 "-1" value to keep backwards compatibility and let the followed enum
values start counting at 0. This brings the compiler to never handle the
enum as signed and a check if the value is above NL802154_IFTYPE_MAX should
filter -1 out.
Fixes: f3ea5e44231a ("ieee802154: add new interface command") Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20211112030916.685793-1-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
snd_ctl_remove() has to be called with card->controls_rwsem held (when
called after the card instantiation). This patch add the missing
rwsem calls around it.
Fixes: 8a9782346dcc ("ASoC: topology: Add topology core") Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/r/20211116071812.18109-1-tiwai@suse.de Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
would set the MultiMedia1 steam for SLIMBUS_0_RX, however doing below
command will reset previously setup MultiMedia1 stream, because both of them
are using MultiMedia1 PCM stream.
The GPIO controller is also an interrupt controller provider and is
currently missing the appropriate 'interrupt-controller' and
'#interrupt-cells' properties to denote that.
Fixes: fb026d3de33b ("ARM: BCM5301X: Add Broadcom's bus-axi to the DTS file") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The I2C interrupt controller line is off by 32 because the datasheet
describes interrupt inputs into the GIC which are for Shared Peripheral
Interrupts and are starting at offset 32. The ARM GIC binding expects
the SPI interrupts to be numbered from 0 relative to the SPI base.
Fixes: bb097e3e0045 ("ARM: dts: BCM5301X: Add I2C support to the DT") Tested-by: Christian Lamparter <chunkeey@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
We are changing expire_nodest_conn to work even for reused connections when
conn_reuse_mode=0, just as what was done with commit dc7b3eb900aa ("ipvs:
Fix reuse connection if real server is dead").
For controlled and persistent connections, the new connection will get the
needed real server depending on the rules in ip_vs_check_template().
Fixes: d752c3645717 ("ipvs: allow rescheduling of new connections when port reuse is detected") Co-developed-by: Chuanqi Liu <legend050709@qq.com> Signed-off-by: Chuanqi Liu <legend050709@qq.com> Signed-off-by: yangxingwu <xingwu.yang@gmail.com> Acked-by: Simon Horman <horms@verge.net.au> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":
Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().
To fix, properly use clear_user() when we're dealing with a user buffer.
Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
We found out that we are unable to control the PERST# signal via the
default pin dedicated to be PERST# pin (GPIO2[3] pin) on A3700 SOC when
this pin is in EP_PCIE1_Resetn mode. There is a register in the PCIe
register space called PERSTN_GPIO_EN (D0088004[3]), but changing the
value of this register does not change the pin output when measuring
with voltmeter.
We do not know if this is a bug in the SOC, or if it works only when
PCIe controller is in a certain state.
Commit f4c7d053d7f7 ("PCI: aardvark: Wait for endpoint to be ready
before training link") says that when this pin changes pinctrl mode
from EP_PCIE1_Resetn to GPIO, the PERST# signal is asserted for a brief
moment.
So currently the situation is that on A3700 boards the PERST# signal is
asserted in U-Boot (because the code in U-Boot issues reset via this pin
via GPIO mode), and then in Linux by the obscure and undocumented
mechanism described by the above mentioned commit.
We want to issue PERST# signal in a known way, therefore this patch
changes the pcie_reset_pin function from "pcie" to "gpio" and adds the
reset-gpios property to the PCIe node in device tree files of
EspressoBin and Armada 3720 Dev Board (Turris Mox device tree already
has this property and uDPU does not have a PCIe port).
Signed-off-by: Marek Behún <marek.behun@nic.cz> Cc: Remi Pommarel <repk@triplefau.lt> Tested-by: Tomasz Maciej Nowak <tmn505@gmail.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The PWM pins on North Bridge on Armada 37xx can be configured into PWM
or GPIO functions. When in PWM function, each pin can also be configured
to drive low on 0 and tri-state on 1 (LED mode).
The current definitions handle this by declaring two pin groups for each
pin:
- group "pwmN" with functions "pwm" and "gpio"
- group "ledN_od" ("od" for open drain) with functions "led" and "gpio"
This is semantically incorrect. The correct definition for each pin
should be one group with three functions: "pwm", "led" and "gpio".
Change the "pwmN" groups to support "led" function.
Remove "ledN_od" groups. This cannot break backwards compatibility with
older device trees: no device tree uses it since there is no PWM driver
for this SOC yet. Also "ledN_od" groups are not even documented.
Fixes: b835d6953009 ("pinctrl: armada-37xx: swap polarity on LED group") Signed-off-by: Marek Behún <kabel@kernel.org> Acked-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20210719112938.27594-1-kabel@kernel.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Aardvark supports PCIe Hot Reset via PCIE_CORE_CTRL1_REG.
Use it for implementing PCI_BRIDGE_CTL_BUS_RESET bit of PCI_BRIDGE_CONTROL
register on emulated bridge.
With this, the function pci_reset_secondary_bus() starts working and can
reset connected PCIe card. Custom userspace script [1] which uses setpci
can trigger PCIe Hot Reset and reset the card manually.
Aardvark controller has something like config space of a Root Port
available at offset 0x0 of internal registers - these registers are used
for implementation of the emulated bridge.
The default value of Class Code of this bridge corresponds to a RAID Mass
storage controller, though. (This is probably intended for when the
controller is used as Endpoint.)
Change the Class Code to correspond to a PCI Bridge.
Add comment explaining this change.
Link: https://lore.kernel.org/r/20211028185659.20329-6-kabel@kernel.org Fixes: 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space") Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: stable@vger.kernel.org Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>From very vague, ambiguous and incomplete information from Marvell we
deduced that the 32-bit Aardvark register at address 0x4
(PCIE_CORE_CMD_STATUS_REG), which is not documented for Root Complex mode
in the Functional Specification (only for Endpoint mode), controls two
16-bit PCIe registers: Command Register and Status Registers of PCIe Root
Port.
This means that bit 2 controls bus mastering and forwarding of memory and
I/O requests in the upstream direction. According to PCI specifications
bits [0:2] of Command Register, this should be by default disabled on
reset. So explicitly disable these bits at early setup of the Aardvark
driver.
Remove code which unconditionally enables all 3 bits and let kernel code
(via pci_set_master() function) to handle bus mastering of Root PCIe
Bridge via emulated PCI_COMMAND on emulated bridge.
Link: https://lore.kernel.org/r/20211028185659.20329-5-kabel@kernel.org Fixes: 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space") Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: stable@vger.kernel.org # b2a56469d550 ("PCI: aardvark: Add FIXME comment for PCIE_CORE_CMD_STATUS_REG access") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Fix multiple link training issues in aardvark driver. The main reason of
these issues was misunderstanding of what certain registers do, since their
names and comments were misleading: before commit 96be36dbffac ("PCI:
aardvark: Replace custom macros by standard linux/pci_regs.h macros"), the
pci-aardvark.c driver used custom macros for accessing standard PCIe Root
Bridge registers, and misleading comments did not help to understand what
the code was really doing.
After doing more tests and experiments I've come to the conclusion that the
SPEED_GEN register in aardvark sets the PCIe revision / generation
compliance and forces maximal link speed. Both GEN3 and GEN2 values set the
read-only PCI_EXP_FLAGS_VERS bits (PCIe capabilities version of Root
Bridge) to value 2, while GEN1 value sets PCI_EXP_FLAGS_VERS to 1, which
matches with PCI Express specifications revisions 3, 2 and 1 respectively.
Changing SPEED_GEN also sets the read-only bits PCI_EXP_LNKCAP_SLS and
PCI_EXP_LNKCAP2_SLS to corresponding speed.
(Note that PCI Express rev 1 specification does not define PCI_EXP_LNKCAP2
and PCI_EXP_LNKCTL2 registers and when SPEED_GEN is set to GEN1 (which
also sets PCI_EXP_FLAGS_VERS set to 1), lspci cannot access
PCI_EXP_LNKCAP2 and PCI_EXP_LNKCTL2 registers.)
Changing PCIe link speed can be done via PCI_EXP_LNKCTL2_TLS bits of
PCI_EXP_LNKCTL2 register. Armada 3700 Functional Specifications says that
the default value of PCI_EXP_LNKCTL2_TLS is based on SPEED_GEN value, but
tests showed that the default value is always 8.0 GT/s, independently of
speed set by SPEED_GEN. So after setting SPEED_GEN, we must also set value
in PCI_EXP_LNKCTL2 register via PCI_EXP_LNKCTL2_TLS bits.
Triggering PCI_EXP_LNKCTL_RL bit immediately after setting LINK_TRAINING_EN
bit actually doesn't do anything. Tests have shown that a delay is needed
after enabling LINK_TRAINING_EN bit. As triggering PCI_EXP_LNKCTL_RL
currently does nothing, remove it.
Commit 43fc679ced18 ("PCI: aardvark: Improve link training") introduced
code which sets SPEED_GEN register based on negotiated link speed from
PCI_EXP_LNKSTA_CLS bits of PCI_EXP_LNKSTA register. This code was added to
fix detection of Compex WLE900VX (Atheros QCA9880) WiFi GEN1 PCIe cards, as
otherwise these cards were "invisible" on PCIe bus (probably because they
crashed). But apparently more people reported the same issues with these
cards also with other PCIe controllers [1] and I was able to reproduce this
issue also with other "noname" WiFi cards based on Atheros QCA9890 chip
(with the same PCI vendor/device ids as Atheros QCA9880). So this is not an
issue in aardvark but rather an issue in Atheros QCA98xx chips. Also, this
issue only exists if the kernel is compiled with PCIe ASPM support, and a
generic workaround for this is to change PCIe Bridge to 2.5 GT/s link speed
via PCI_EXP_LNKCTL2_TLS_2_5GT bits in PCI_EXP_LNKCTL2 register [2], before
triggering PCI_EXP_LNKCTL_RL bit. This workaround also works when SPEED_GEN
is set to value GEN2 (5 GT/s). So remove this hack completely in the
aardvark driver and always set SPEED_GEN to value from 'max-link-speed' DT
property. Fix for Atheros QCA98xx chips is handled separately by patch [2].
These two things (code for triggering PCI_EXP_LNKCTL_RL bit and changing
SPEED_GEN value) also explain why commit 6964494582f5 ("PCI: aardvark:
Train link immediately after enabling training") somehow fixed detection of
those problematic Compex cards with Atheros chips: if triggering link
retraining (via PCI_EXP_LNKCTL_RL bit) was done immediately after enabling
link training (via LINK_TRAINING_EN), it did nothing. If there was a
specific delay, aardvark HW already initialized PCIe link and therefore
triggering link retraining caused the above issue. Compex cards triggered
link down event and disappeared from the PCIe bus.
Commit f4c7d053d7f7 ("PCI: aardvark: Wait for endpoint to be ready before
training link") added 100ms sleep before calling 'Start link training'
command and explained that it is a requirement of PCI Express
specification. But the code after this 100ms sleep was not doing 'Start
link training', rather it triggered PCI_EXP_LNKCTL_RL bit via PCIe Root
Bridge to put link into Recovery state.
The required delay after fundamental reset is already done in function
advk_pcie_wait_for_link() which also checks whether PCIe link is up.
So after removing the code which triggers PCI_EXP_LNKCTL_RL bit on PCIe
Root Bridge, there is no need to wait 100ms again. Remove the extra
msleep() call and update comment about the delay required by the PCI
Express specification.
According to Marvell Armada 3700 Functional Specifications, Link training
should be enabled via aardvark register LINK_TRAINING_EN after selecting
PCIe generation and x1 lane. There is no need to disable it prior resetting
card via PERST# signal. This disabling code was introduced in commit 5169a9851daa ("PCI: aardvark: Issue PERST via GPIO") as a workaround for
some Atheros cards. It turns out that this also is Atheros specific issue
and affects any PCIe controller, not only aardvark. Moreover this Atheros
issue was triggered by juggling with PCI_EXP_LNKCTL_RL, LINK_TRAINING_EN
and SPEED_GEN bits interleaved with sleeps. Now, after removing triggering
PCI_EXP_LNKCTL_RL, there is no need to explicitly disable LINK_TRAINING_EN
bit. So remove this code too. The problematic Compex cards described in
previous git commits are correctly detected in advk_pcie_train_link()
function even after applying all these changes.
Note that with this patch, and also prior this patch, some NVMe disks which
support PCIe GEN3 with 8 GT/s speed are negotiated only at the lowest link
speed 2.5 GT/s, independently of SPEED_GEN value. After manually triggering
PCI_EXP_LNKCTL_RL bit (e.g. from userspace via setpci), these NVMe disks
change link speed to 5 GT/s when SPEED_GEN was configured to GEN2. This
issue first needs to be properly investigated. I will send a fix in the
future.
On the other hand, some other GEN2 PCIe cards with 5 GT/s speed are
autonomously by HW autonegotiated at full 5 GT/s speed without need of any
software interaction.
Armada 3700 Functional Specifications describes the following steps for
link training: set SPEED_GEN to GEN2, enable LINK_TRAINING_EN, poll until
link training is complete, trigger PCI_EXP_LNKCTL_RL, poll until signal
rate is 5 GT/s, poll until link training is complete, enable ASPM L0s.
The requirement for triggering PCI_EXP_LNKCTL_RL can be explained by the
need to achieve 5 GT/s speed (as changing link speed is done by throw to
recovery state entered by PCI_EXP_LNKCTL_RL) or maybe as a part of enabling
ASPM L0s (but in this case ASPM L0s should have been enabled prior
PCI_EXP_LNKCTL_RL).
It is unknown why the original pci-aardvark.c driver was triggering
PCI_EXP_LNKCTL_RL bit before waiting for the link to be up. This does not
align with neither PCIe base specifications nor with Armada 3700 Functional
Specification. (Note that in older versions of aardvark, this bit was
called incorrectly PCIE_CORE_LINK_TRAINING, so this may be the reason.)
It is also unknown why Armada 3700 Functional Specification says that it is
needed to trigger PCI_EXP_LNKCTL_RL for GEN2 mode, as according to PCIe
base specification 5 GT/s speed negotiation is supposed to be entirely
autonomous, even if initial speed is 2.5 GT/s.
Commit 43f5c77bcbd2 ("PCI: aardvark: Fix reporting CRS value") fixed
handling of CRS response and when CRSSVE flag was not enabled it marked CRS
response as failed transaction (due to simplicity).
But pci-aardvark.c driver is already waiting up to the PIO_RETRY_CNT count
for PIO config response and so we can with a small change implement
re-issuing of config requests as described in PCIe base specification.
This change implements re-issuing of config requests when response is CRS.
Set upper bound of wait cycles to around PIO_RETRY_CNT, afterwards the
transaction is marked as failed and an all-ones value is returned as
before.
We do this by returning appropriate error codes from function
advk_pcie_check_pio_status(). On CRS we return -EAGAIN and caller then
reissues transaction.
Link: https://lore.kernel.org/r/20211005180952.6812-10-kabel@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Marek Behún <kabel@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Change PCIe Max Payload Size setting in PCIe Device Control register to 512
bytes to align with PCIe Link Initialization sequence as defined in Marvell
Armada 3700 Functional Specification. According to the specification,
maximal Max Payload Size supported by this device is 512 bytes.
Without this kernel prints suspicious line:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 16384, max 512)
With this change it changes to:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 512, max 512)
Link: https://lore.kernel.org/r/20211005180952.6812-3-kabel@kernel.org Fixes: 8c39d710363c ("PCI: aardvark: Add Aardvark PCI host controller driver") Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Marek Behún <kabel@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
In commit 6df6ba974a55 ("PCI: aardvark: Remove PCIe outbound window
configuration") was removed aardvark PCIe outbound window configuration and
commit description said that was recommended solution by HW designers.
But that commit completely removed support for configuring PCIe IO
resources without removing PCIe IO 'ranges' from DTS files. After that
commit PCIe IO space started to be treated as PCIe MEM space and accessing
it just caused kernel crash.
Moreover implementation of PCIe outbound windows prior that commit was
incorrect. It completely ignored offset between CPU address and PCIe bus
address and expected that in DTS is CPU address always same as PCIe bus
address without doing any checks. Also it completely ignored size of every
PCIe resource specified in 'ranges' DTS property and expected that every
PCIe resource has size 128 MB (also for PCIe IO range). Again without any
check. Apparently none of PCIe resource has in DTS specified size of 128
MB. So it was completely broken and thanks to how aardvark mask works,
configuration was completely ignored.
This patch reverts back support for PCIe outbound window configuration but
implementation is a new without issues mentioned above. PCIe outbound
window is required when DTS specify in 'ranges' property non-zero offset
between CPU and PCIe address space. To address recommendation by HW
designers as specified in commit description of 6df6ba974a55, set default
outbound parameters as PCIe MEM access without translation and therefore
for this PCIe 'ranges' it is not needed to configure PCIe outbound window.
For PCIe IO space is needed to configure aardvark PCIe outbound window.
This patch fixes kernel crash when trying to access PCIe IO space.
Link: https://lore.kernel.org/r/20210624215546.4015-2-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: stable@vger.kernel.org # 6df6ba974a55 ("PCI: aardvark: Remove PCIe outbound window configuration") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
We allow up to PCI_EXP_SLTSTA2 registers to be accessed, but the
pcie_cap_regs_behavior[] array only covers up to PCI_EXP_RTSTA. Expand
this array to avoid walking off the end of it.
Do the same for pci_regs_behavior for consistency[], and add a
BUILD_BUG_ON() to also check the bridge->conf structure size.
Fixes: 23a5fba4d941 ("PCI: Introduce PCI bridge emulated config space common logic") Link: https://lore.kernel.org/r/E1l6z9W-0006Re-MQ@rmk-PC.armlinux.org.uk Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
According to PCI Express Base Specifications (rev 4.0, 6.6.1
"Conventional reset"), after fundamental reset a 100ms delay is needed
prior to enabling link training.
Update comment in code to reflect this requirement.
Link: https://lore.kernel.org/r/20201202184659.3795-1-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Move code which belongs to link training (delays and resets) into
advk_pcie_train_link() function, so everything related to link training,
including timings is at one place.
After experiments it can be observed that link training in aardvark
hardware is very sensitive to timings and delays, so it is a good idea to
have this code at the same place as link training calls.
This patch does not change behavior of aardvark initialization.
Link: https://lore.kernel.org/r/20200907111038.5811-6-pali@kernel.org Tested-by: Marek Behún <marek.behun@nic.cz> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Include linux/gpio/consumer.h instead of linux/gpio.h, as is said in the
latter file.
This was reported by kernel test bot when compiling for s390.
drivers/pci/controller/pci-aardvark.c:350:2: error: implicit declaration of function 'gpiod_set_value_cansleep' [-Werror,-Wimplicit-function-declaration]
drivers/pci/controller/pci-aardvark.c:1074:21: error: implicit declaration of function 'devm_gpiod_get_from_of_node' [-Werror,-Wimplicit-function-declaration]
drivers/pci/controller/pci-aardvark.c:1076:14: error: use of undeclared identifier 'GPIOD_OUT_LOW'
Link: https://lore.kernel.org/r/202006211118.LxtENQfl%25lkp@intel.com Link: https://lore.kernel.org/r/20200907111038.5811-2-pali@kernel.org Fixes: 5169a9851daa ("PCI: aardvark: Issue PERST via GPIO") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Marek Behún <marek.behun@nic.cz> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When there is no PCIe card connected and advk_pcie_rd_conf() or
advk_pcie_wr_conf() is called for PCI bus which doesn't belong to emulated
root bridge, the aardvark driver throws the following error message:
advk-pcie d0070000.pcie: config read/write timed out
Obviously accessing PCIe registers of disconnected card is not possible.
Extend check in advk_pcie_valid_device() function for validating
availability of PCIe bus. If PCIe link is down, then the device is marked
as Not Found and the driver does not try to access these registers.
This is just an optimization to prevent accessing PCIe registers when card
is disconnected. Trying to access PCIe registers of disconnected card does
not cause any crash, kernel just needs to wait for a timeout. So if card
disappear immediately after checking for PCIe link (before accessing PCIe
registers), it does not cause any problems.
Link: https://lore.kernel.org/r/20200702083036.12230-1-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
PCI-E capability macros are already defined in linux/pci_regs.h.
Remove their reimplementation in pcie-aardvark.
Link: https://lore.kernel.org/r/20200430080625.26070-9-pali@kernel.org Tested-by: Tomasz Maciej Nowak <tmn505@gmail.com> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Rob Herring <robh@kernel.org> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Add support for issuing PERST via GPIO specified in 'reset-gpios'
property (as described in PCI device tree bindings).
Some buggy cards (e.g. Compex WLE900VX or WLE1216) are not detected
after reboot when PERST is not issued during driver initialization.
If bootloader already enabled link training then issuing PERST has no
effect for some buggy cards (e.g. Compex WLE900VX) and these cards are
not detected. We therefore clear the LINK_TRAINING_EN register before.
It was observed that Compex WLE900VX card needs to be in PERST reset
for at least 10ms if bootloader enabled link training.
Tested on Turris MOX.
Link: https://lore.kernel.org/r/20200430080625.26070-6-pali@kernel.org Tested-by: Tomasz Maciej Nowak <tmn505@gmail.com> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Currently the aardvark driver trains link in PCIe gen2 mode. This may
cause some buggy gen1 cards (such as Compex WLE900VX) to be unstable or
even not detected. Moreover when ASPM code tries to retrain link second
time, these cards may stop responding and link goes down. If gen1 is
used this does not happen.
Unconditionally forcing gen1 is not a good solution since it may have
performance impact on gen2 cards.
To overcome this, read 'max-link-speed' property (as defined in PCI
device tree bindings) and use this as max gen mode. Then iteratively try
link training at this mode or lower until successful. After successful
link training choose final controller gen based on Negotiated Link Speed
from Link Status register, which should match card speed.
Link: https://lore.kernel.org/r/20200430080625.26070-5-pali@kernel.org Tested-by: Tomasz Maciej Nowak <tmn505@gmail.com> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Marek Behún <marek.behun@nic.cz> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Rob Herring <robh@kernel.org> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Adding even 100ms (PCI_PM_D3COLD_WAIT) delay between enabling link
training and starting link training causes detection issues with some
buggy cards (such as Compex WLE900VX).
Move the code which enables link training immediately before the one
which starts link traning.
This fixes detection issues of Compex WLE900VX card on Turris MOX after
cold boot.
Link: https://lore.kernel.org/r/20200430080625.26070-2-pali@kernel.org Fixes: f4c7d053d7f7 ("PCI: aardvark: Wait for endpoint to be ready...") Tested-by: Tomasz Maciej Nowak <tmn505@gmail.com> Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Rob Herring <robh@kernel.org> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Initialise every multiple-byte field of emulated PCI bridge config
space with proper cpu_to_le* macro. This is required since the structure
describing config space of emulated bridge assumes little-endian
convention.
Signed-off-by: Grzegorz Jaszczyk <jaz@semihalf.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When configuring pcie reset pin from gpio (e.g. initially set by
u-boot) to pcie function this pin goes low for a brief moment
asserting the PERST# signal. Thus connected device enters fundamental
reset process and link configuration can only begin after a minimal
100ms delay (see [1]).
Because the pin configuration comes from the "default" pinctrl it is
implicitly configured before the probe callback is called:
driver_probe_device()
really_probe()
...
pinctrl_bind_pins() /* Here pin goes from gpio to PCIE reset
function and PERST# is asserted */
...
drv->probe()
[1] "PCI Express Base Specification", REV. 4.0
PCI Express, February 19 2014, 6.6.1 Conventional Reset
Signed-off-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The issue happened randomly in runtime. The message "Link is Down" is
popped but soon it recovered to "Link is Up".
The "Link is Down" results from the incorrect read data for reading the
PHY register via MDIO bus. The correct sequence for reading the data
shall be:
1. fire the command
2. wait for command done (this step was missing)
3. wait for data idle
4. read data from data register
Cc: stable@vger.kernel.org Fixes: f160e99462c6 ("net: phy: Add mdio-aspeed") Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Dylan Hung <dylan_hung@aspeedtech.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20211125024432.15809-1-dylan_hung@aspeedtech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The block layer forces a minimum segment size of PAGE_SIZE, so a segment
can be too big for the ADMA table, if PAGE_SIZE >= 64KiB. Fix by writing
multiple descriptors, noting that the ADMA table is sized for 4KiB chunks
anyway, so it will be big enough.
Reported-and-tested-by: Bough Chen <haibo.chen@nxp.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211115082345.802238-1-adrian.hunter@intel.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.
Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The POWER9 ERAT flush instruction is a SLBIA with IH=7, which is a
reserved value on POWER7/8. On POWER8 this invalidates the SLB entries
above index 0, similarly to SLBIA IH=0.
If the SLB entries are invalidated, and then the guest is bypassed, the
host SLB does not get re-loaded, so the bolted entries above 0 will be
lost. This can result in kernel stack access causing a SLB fault.
Kernel stack access causing a SLB fault was responsible for the infamous
mega bug (search "Fix SLB reload bug"). Although since commit 48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C") that
starts using the kernel stack in the SLB miss handler, it might only
result in an infinite loop of SLB faults. In any case it's a bug.
Fix this by only executing the instruction on >= POWER9 where IH=7 is
defined not to invalidate the SLB. POWER7/8 don't require this ERAT
flush.
Fixes: 500871125920 ("KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211119031627.577853-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
If the xenstore page hasn't been allocated properly, reading the value
of the related hvm_param (HVM_PARAM_STORE_PFN) won't actually return
error. Instead, it will succeed and return zero. Instead of attempting
to xen_remap a bad guest physical address, detect this condition and
return early.
Note that although a guest physical address of zero for
HVM_PARAM_STORE_PFN is theoretically possible, it is not a good choice
and zero has never been validly used in that capacity.
Also recognize all bits set as an invalid value.
For 32-bit Linux, any pfn above ULONG_MAX would get truncated. Pfns
above ULONG_MAX should never be passed by the Xen tools to HVM guests
anyway, so check for this condition and return early.
Cc: stable@vger.kernel.org Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Link: https://lore.kernel.org/r/20211123210748.1910236-1-sstabellini@kernel.org Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
In case of errors in xenbus_init (e.g. missing xen_store_gfn parameter),
we goto out_error but we forget to reset xen_store_domain_type to
XS_UNKNOWN. As a consequence xenbus_probe_initcall and other initcalls
will still try to initialize xenstore resulting into a crash at boot.
The HID descriptor of many of Wacom's touch input devices include a
"Confidence" usage that signals if a particular touch collection contains
useful data. The driver does not look at this flag, however, which causes
even invalid contacts to be reported to userspace. A lucky combination of
kernel event filtering and device behavior (specifically: contact ID 0 ==
invalid, contact ID >0 == valid; and order all data so that all valid
contacts are reported before any invalid contacts) spare most devices from
any visibly-bad behavior.
The DTH-2452 is one example of an unlucky device that misbehaves. It uses
ID 0 for both the first valid contact and all invalid contacts. Because
we report both the valid and invalid contacts, the kernel reports that
contact 0 first goes down (valid) and then goes up (invalid) in every
report. This causes ~100 clicks per second simply by touching the screen.
This patch inroduces new `confidence` flag in our `hid_data` structure.
The value is initially set to `true` at the start of a report and can be
set to `false` if an invalid touch usage is seen.
Link: https://github.com/linuxwacom/input-wacom/issues/270 Fixes: f8b6a74719b5 ("HID: wacom: generic: Support multiple tools per report") Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com> Tested-by: Joshua Dickens <joshua.dickens@wacom.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When the reply for a non-blocking transmit arrives, the sequence
field for that reply was never filled in, so userspace would have no
way of associating the reply to the original transmit.
Copy the sequence field to ensure that this is now possible.
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Fixes: 0dbacebede1e ([media] cec: move the CEC framework out of staging and to media) Cc: <stable@vger.kernel.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The master and next_conj of rcs_ops are used for iterating the
resource list entries, and currently those are supposed to return the
current value. The problem is that next_conf may go over the last
entry before the loop abort condition is evaluated, and it may return
the "current" value that is beyond the array size. It was caught
recently as a GPF, for example.
Those return values are, however, never actually evaluated, hence
basically we don't have to consider the current value as the return at
all. By dropping those return values, the potential out-of-range
access above is also fixed automatically.
This patch changes the return type of master and next_conj callbacks
to void and drop the superfluous code accordingly.
This is a partial revert of commit 29bc22ac5e5b ("binder: use euid from cred instead of using task").
Setting sender_euid using proc->cred caused some Android system test
regressions that need further investigation. It is a partial
reversion because subsequent patches rely on proc->cred.
Fixes: 29bc22ac5e5b ("binder: use euid from cred instead of using task") Cc: stable@vger.kernel.org # 4.4+ Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: I9b1769a3510fed250bb21859ef8beebabe034c66 Link: https://lore.kernel.org/r/20211112180720.2858135-1-tkjos@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Fix the circular lock dependency and unbalanced unlock of addess0_mutex
introduced when fixing an address0_mutex enumeration retry race in commit ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race")
Make sure locking order between port_dev->status_lock and address0_mutex
is correct, and that address0_mutex is not unlocked in hub_port_connect
"done:" codepath which may be reached without locking address0_mutex
Fixes: 6ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race") Cc: <stable@vger.kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Hans de Goede <hdegoede@redhat.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20211123101656.1113518-1-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
xHC hardware can only have one slot in default state with address 0
waiting for a unique address at a time, otherwise "undefined behavior
may occur" according to xhci spec 5.4.3.4
The address0_mutex exists to prevent this across both xhci roothubs.
If hub_port_init() fails, it may unlock the mutex and exit with a xhci
slot in default state. If the other xhci roothub calls hub_port_init()
at this point we end up with two slots in default state.
Make sure the address0_mutex protects the slot default state across
hub_port_init() retries, until slot is addressed or disabled.
Note, one known minor case is not fixed by this patch.
If device needs to be reset during resume, but fails all hub_port_init()
retries in usb_reset_and_verify_device(), then it's possible the slot is
still left in default state when address0_mutex is unlocked.
Cc: <stable@vger.kernel.org> Fixes: 638139eb95d2 ("usb: hub: allow to process more usb hub events in parallel") Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20211115221630.871204-1-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The code that enables either BC_LVL or COMP_CHNG interrupt in tcpm_set_cc
wrongly assumes that the interrupt is unmasked by writing 1 to the apropriate
bit in the mask register. In fact, interrupts are enabled when the mask
is 0, so the tcpm_set_cc enables interrupt for COMP_CHNG when it expects
BC_LVL interrupt to be enabled.
This causes inability of the driver to recognize cable unplug events
in host mode (unplug is recognized only via a COMP_CHNG interrupt).
In device mode this bug was masked by simultaneous triggering of the VBUS
change interrupt, because of loss of VBUS when the port peer is providing
power.
Fixes: 48242e30532b ("usb: typec: fusb302: Revert "Resolve fixed power role contract setup"") Cc: stable <stable@vger.kernel.org> Cc: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Ondrej Jirman <megous@megous.com> Link: https://lore.kernel.org/r/20211108102833.2793803-1-megous@megous.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When we try to add an IPv6 nexthop and IPv6 is not enabled
(!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
has been present since the beginning of IPv6 nexthop gateway support.
Commit 1aefd3de7bc6 ("ipv6: Add fib6_nh_init and release to stubs") tells
us that only fib6_nh_init has a dummy stub because fib6_nh_release should
not be called if fib6_nh_init returns an error, but the commit below added
a call to ipv6_stub->fib6_nh_release in its error path. To fix it return
the dummy stub's -EAFNOSUPPORT error directly without calling
ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path.
Cc: stable@vger.kernel.org Fixes: 53010f991a9f ("nexthop: Add support for IPv6 gateways") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
A new commit in LLVM causes an error on the use of 'long double' when
'-mno-x87' is used, which the kernel does through an alias,
'-mno-80387' (see the LLVM commit below for more details around why it
does this).
drivers/usb/dwc2/hcd_queue.c:1744:25: error: expression requires 'long double' type support, but target 'x86_64-unknown-linux-gnu' does not support it
delay = ktime_set(0, DWC2_RETRY_WAIT_DELAY);
^
drivers/usb/dwc2/hcd_queue.c:62:34: note: expanded from macro 'DWC2_RETRY_WAIT_DELAY'
#define DWC2_RETRY_WAIT_DELAY (1 * 1E6L)
^
1 error generated.
This happens due to the use of a 'long double' literal. The 'E6' part of
'1E6L' causes the literal to be a 'double' then the 'L' suffix promotes
it to 'long double'.
There is no visible reason for a floating point value in this driver, as
the value is only used as a parameter to a function that expects an
integer type. Use NSEC_PER_MSEC, which is the same integer value as
'1E6L', to avoid changing functionality but fix the error.
Added updating of request frame number for elapsed frames,
otherwise frame number will remain as previous use of request.
This will allow function driver to correctly track frames in
case of Missed ISOC occurs.
Added setting request actual length to 0 for elapsed frames.
In Slave mode when pushing data to RxFIFO by dwords, request
actual length incrementing accordingly. But before whole packet
will be pushed into RxFIFO and send to host can occurs Missed
ISOC and data will not send to host. So, in this case request
actual length should be reset to 0.
Fixes: 91bb163e1e4f ("usb: dwc2: gadget: Fix ISOC flow for BDMA and Slave") Cc: stable <stable@vger.kernel.org> Reviewed-by: John Keeping <john@metanate.com> Signed-off-by: Minas Harutyunyan <Minas.Harutyunyan@synopsys.com> Link: https://lore.kernel.org/r/c356baade6e9716d312d43df08d53ae557cb8037.1636011277.git.Minas.Harutyunyan@synopsys.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Update the USB serial option driver support for the Fibocom
FM101-GL Cat.6
LTE modules as there are actually several different variants.
- VID:PID 2cb7:01a2, FM101-GL are laptop M.2 cards (with
MBIM interfaces for /Linux/Chrome OS)
- VID:PID 2cb7:01a4, FM101-GL for laptop debug M.2 cards(with adb
interface for /Linux/Chrome OS)
Maor Dickman [Tue, 25 Jan 2022 08:54:24 +0000 (16:54 +0800)]
net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch manager
BugLink: https://bugs.launchpad.net/bugs/1957753
When using libvirt to passthrough VF to VM it will always set the VF vlan
to 0 even if user didn’t request it, this will cause libvirt to fail to
boot in case the PF isn't eswitch owner.
Example of such case is the DPU host PF which isn't eswitch manager, so
any attempt to passthrough VF of it using libvirt will fail.
Fix it by not returning error in case set VF vlan is called with vid 0.
Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
(backported from commit 7846665d3504812acaebf920d1141851379a7f37)
[PHLin: apply change in eswitch.c accordingly before the split b55b3538] Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Linus Torvalds [Fri, 21 Jan 2022 14:26:15 +0000 (11:26 -0300)]
fget: check that the fd still exists after getting a ref to it
Jann Horn points out that there is another possible race wrt Unix domain
socket garbage collection, somewhat reminiscent of the one fixed in
commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
See the extended comment about the garbage collection requirements added
to unix_peek_fds() by that commit for details.
The race comes from how we can locklessly look up a file descriptor just
as it is in the process of being closed, and with the right artificial
timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
Unix domain socket garbage collector could see the reference count
decrement of the close() happen before fget() took its reference to the
file and the file was attached onto a new file descriptor.
This is all (intentionally) correct on the 'struct file *' side, with
RCU lookups and lockless reference counting very much part of the
design. Getting that reference count out of order isn't a problem per
se.
But the garbage collector can get confused by seeing this situation of
having seen a file not having any remaining external references and then
seeing it being attached to an fd.
In commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK") the
fix was to serialize the file descriptor install with the garbage
collector by taking and releasing the unix_gc_lock.
That's not really an option here, but since this all happens when we are
in the process of looking up a file descriptor, we can instead simply
just re-check that the file hasn't been closed in the meantime, and just
re-do the lookup if we raced with a concurrent close() of the same file
descriptor.
Reported-and-tested-by: Jann Horn <jannh@google.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit 054aa8d439b9185d4f5eb9a90282d1ce74772969)
[cascardo: __fcheck_files was renamed to files_lookup_fd_raw at bebf684bf330 ("file: Rename __fcheck_files to files_lookup_fd_raw")]
CVE-2021-4083 Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Darrick J. Wong [Thu, 20 Jan 2022 14:31:29 +0000 (11:31 -0300)]
xfs: map unwritten blocks in XFS_IOC_{ALLOC, FREE}SP just like fallocate
The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at
the end of files, just like fallocate and RESVSP. Make the behavior
consistent with the other ioctls.
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
(backported from commit 983d8e60f50806f90534cc5373d0ce867e5aaf79)
[cascardo: always use XFS_BMAPI_PREALLOC on xfs_alloc_file_space]
CVE-2021-4155 Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Po-Hsu Lin [Tue, 28 Dec 2021 09:43:00 +0000 (10:43 +0100)]
selftests: icmp_redirect: pass xfail=0 to log_test()
BugLink: https://bugs.launchpad.net/bugs/1938964
If any sub-test in this icmp_redirect.sh is failing but not expected
to fail. The script will complain:
./icmp_redirect.sh: line 72: [: 1: unary operator expected
This is because when the sub-test is not expected to fail, we won't
pass any value for the xfail local variable in log_test() and thus
it's empty. Fix this by passing 0 as the 4th variable to log_test()
for non-xfail cases.
v2: added fixes tag
Fixes: 0a36a75c6818 ("selftests: icmp_redirect: support expected failures") Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3748939bce3fc7a15ef07161826507fbe410bb7a) Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
We generally rely on a bunch of factors to differentiate between servers.
For example, IP address, port etc.
For certain server types (like Azure), it is important to make sure
that the server hostname matches too, even if the both hostnames currently
resolve to the same IP address.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[rtg - backported by the Microsoft team. They dropped changes to
fs/cifs/fs_connect.[ch], added a structure tag to fs/cifs/cifsglob.h: struct smb_vol,
misc changes to fs/cifs/connect.c to reflect the intent of the original upstream patch] Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kelsey Skunberg <kelsey.skunberg@canonical.com> Signed-off-by: Kelsey Skunberg <kelsey.skunberg@canonical.com>
Paulo Alcantara [Wed, 15 Dec 2021 14:35:26 +0000 (07:35 -0700)]
cifs: set a minimum of 120s for next dns resolution
BugLink: https://bugs.launchpad.net/bugs/1954926
With commit 506c1da44fee ("cifs: use the expiry output of dns_query to
schedule next resolution") and after triggering the first reconnect,
the next async dns resolution of tcp server's hostname would be
scheduled based on dns_resolver's key expiry default, which happens to
default to 5s on most systems that use key.dns_resolver for upcall.
As per key.dns_resolver.conf(5):
default_ttl=<number>
The number of seconds to set as the expiration on a cached
record. This will be overridden if the program manages to re-
trieve TTL information along with the addresses (if, for exam-
ple, it accesses the DNS directly). The default is 5 seconds.
The value must be in the range 1 to INT_MAX.
Make the next async dns resolution no shorter than 120s as we do not
want to be upcalling too often.
Cc: stable@vger.kernel.org Fixes: 506c1da44fee ("cifs: use the expiry output of dns_query to schedule next resolution") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
(cherry picked from commit 4ac0536f8874a903a72bddc57eb88db774261e3a) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kelsey Skunberg <kelsey.skunberg@canonical.com> Signed-off-by: Kelsey Skunberg <kelsey.skunberg@canonical.com>
Shyam Prasad N [Wed, 15 Dec 2021 14:35:25 +0000 (07:35 -0700)]
cifs: use the expiry output of dns_query to schedule next resolution
BugLink: https://bugs.launchpad.net/bugs/1954926
We recently fixed DNS resolution of the server hostname during reconnect.
However, server IP address may change, even when the old one continues
to server (although sub-optimally).
We should schedule the next DNS resolution based on the TTL of
the DNS record used for the last resolution. This way, we resolve the
server hostname again when a DNS record expires.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Steve French <stfrench@microsoft.com>
(backported from commit 506c1da44fee32ba1d3a70413289ad58c772bba6)
[rtg - dropped changes to fs/cifs/misc.c] Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kelsey Skunberg <kelsey.skunberg@canonical.com> Signed-off-by: Kelsey Skunberg <kelsey.skunberg@canonical.com>
UBUNTU: SAUCE: selftests/seccomp: Fix s390x regs not defined issue
BugLink: https://bugs.launchpad.net/bugs/1896420
Fix build issue on s390x caused by previous patch
'UBUNTU: SAUCE: selftests/seccomp: fix "storage size of 'md' isn't
known" build issue' which caused an include to be ommited that was needed
on s390x.
To fix it we just include the needed asm/ptrace.h include directly.
Fixes: cf091fb15f93 ("UBUNTU: SAUCE: selftests/seccomp: fix "storage
size of 'md' isn't known" build issue") Signed-off-by: Luke Nowakowski-Krijger <luke.nowakowskikrijger@canonical.com> Acked-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Kelsey Skunberg <kelsey.skunberg@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1950666
Currently when removing an ipmi_user the removal is deferred as a work on
the system's workqueue. Although this guarantees the free operation will
occur in non atomic context, it can race with the ipmi_msghandler module
removal (see [1]) . In case a remove_user work is scheduled for removal
and shortly after ipmi_msghandler module is removed we can end up in a
situation where the module is removed fist and when the work is executed
the system crashes with :
BUG: unable to handle page fault for address: ffffffffc05c3450
PF: supervisor instruction fetch in kernel mode
PF: error_code(0x0010) - not-present page
because the pages of the module are gone. In cleanup_ipmi() there is no
easy way to detect if there are any pending works to flush them before
removing the module. This patch creates a separate workqueue and schedules
the remove_work works on it. When removing the module the workqueue is
drained when destroyed to avoid the race.
BugLink: https://bugs.launchpad.net/bugs/1954680
This patch provides support for new dynamic AP bus message limit
with the existing zcrypt device driver and AP bus core code.
There is support for a new field 'ml' from TAPQ query. The field
gives if != 0 the AP bus limit for this card in 4k chunk units.
The actual message size limit per card is shown as a new read-only
sysfs attribute. The sysfs attribute
/sys/devices/ap/cardxx/max_msg_size
shows the upper limit in bytes used by the AP bus and zcrypt device
driver for requests and replies send to and received from this card.
Currently up to CEX7 support only max 12kB msg size and thus the field
shows 12288 meaning the upper limit of a valid msg for this card is
12kB. Please note that the usable payload is somewhat lower and
depends on the msg type and thus the header struct which is to be
prepended by the zcrypt dd.
The dispatcher responsible for choosing the right card and queue is
aware of the individual card AP bus message limit. So a request is
only assigned to a queue of a card which is able to handle the size of
the request (e.g. a 14kB request will never go to a max 12kB card).
If no such card is found the ioctl will fail with ENODEV.
The reply buffer held by the device driver is determined by the ml
field of the TAPQ for this card. If a response from the card exceeds
this limit however, the response is not truncated but the ioctl for
this request will fail with errno EMSGSIZE to indicate that the device
driver has dropped the response because it would overflow the buffer
limit.
If the request size does not indicate to the dispatcher that an
adapter with extended limit is to be used, a random card will be
chosen when no specific card is addressed (ANY addressing). This may
result in an ioctl failure when the reply size needs an adapter with
extended limit but the randomly chosen one is not capable of handling
the broader reply size. The user space application needs to use
dedicated addressing to forward such a request only to suitable cards
to get requests like this processed properly.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Ingo Tuchscherer <ingo.tuchscherer@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
(cherry picked from commit bd39654a2282c1a51c044575a6bc00d641d5dfd1) Signed-off-by: Frank Heimes <frank.heimes@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Collin Walling [Tue, 7 Dec 2021 20:38:00 +0000 (21:38 +0100)]
KVM: s390: add debug statement for diag 318 CPNC data
BugLink: https://bugs.launchpad.net/bugs/1953334
The diag 318 data contains values that denote information regarding the
guest's environment. Currently, it is unecessarily difficult to observe
this value (either manually-inserted debug statements, gdb stepping, mem
dumping etc). It's useful to observe this information to obtain an
at-a-glance view of the guest's environment, so lets add a simple VCPU
event that prints the CPNC to the s390dbf logs.
Signed-off-by: Collin Walling <walling@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20211027025451.290124-1-walling@linux.ibm.com
[borntraeger@de.ibm.com]: change debug level to 3 Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
(backported from commit 3fd8417f2c728d810a3b26d7e2008012ffb7fd01) Signed-off-by: Frank Heimes <frank.heimes@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Acked-by: Philip Cox <philip.cox@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Collin Walling [Tue, 7 Dec 2021 20:38:00 +0000 (21:38 +0100)]
KVM: s390: remove diag318 reset code
BugLink: https://bugs.launchpad.net/bugs/1953334
The diag318 data must be set to 0 by VM-wide reset events
triggered by diag308. As such, KVM should not handle
resetting this data via the VCPU ioctls.
Fixes: 23a60f834406 ("s390/kvm: diagnose 0x318 sync and reset") Signed-off-by: Collin Walling <walling@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Link: https://lore.kernel.org/r/20201104181032.109800-1-walling@linux.ibm.com
(backported from commit 6cbf1e960fa52e4c63a6dfa4cda8736375b34ccc) Signed-off-by: Frank Heimes <frank.heimes@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Acked-by: Philip Cox <philip.cox@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Collin Walling [Tue, 7 Dec 2021 20:38:00 +0000 (21:38 +0100)]
s390/kvm: diagnose 0x318 sync and reset
BugLink: https://bugs.launchpad.net/bugs/1953334
DIAGNOSE 0x318 (diag318) sets information regarding the environment
the VM is running in (Linux, z/VM, etc) and is observed via
firmware/service events.
This is a privileged s390x instruction that must be intercepted by
SIE. Userspace handles the instruction as well as migration. Data
is communicated via VCPU register synchronization.
The Control Program Name Code (CPNC) is stored in the SIE block. The
CPNC along with the Control Program Version Code (CPVC) are stored
in the kvm_vcpu_arch struct.
This data is reset on load normal and clear resets.
Signed-off-by: Collin Walling <walling@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200622154636.5499-3-walling@linux.ibm.com
[borntraeger@de.ibm.com: fix sync_reg position] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
(backported from commit 23a60f834406c8e3805328b630d09d5546b460c1) Signed-off-by: Frank Heimes <frank.heimes@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Acked-by: Philip Cox <philip.cox@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Collin Walling [Tue, 7 Dec 2021 20:38:00 +0000 (21:38 +0100)]
s390/setup: diag 318: refactor struct
BugLink: https://bugs.launchpad.net/bugs/1953334
The diag 318 struct introduced in include/asm/diag.h can be
reused in KVM, so let's condense the version code fields in the
diag318_info struct for easier usage and simplify it until we
can determine how the data should be formatted.
Signed-off-by: Collin Walling <walling@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20200622154636.5499-2-walling@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
(backported from commit a23816f3cdcbffe5dc6e8c331914b3f51b87c2f3) Signed-off-by: Frank Heimes <frank.heimes@canonical.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Acked-by: Philip Cox <philip.cox@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1896420
There is a build issue on Bionic/5.4 kernels due to
PTRACE_SECCOMP_GET_METADATA being defined in glibc header sys/ptrace.h,
which then stops struct seccomp_metadata from being defined leading to:
seccomp_bpf.c:3028:26: error: storage size of 'md' isn't known
The solution here is to unconditonally define the seccomp_metadata
definition that we need, and remove the linux/ptrace.h header where a
definition of seccomp_metadata exists in Focal.
Signed-off-by: Luke Nowakowski-Krijger <luke.nowakowskikrijger@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Jack Morgenstein [Thu, 14 Oct 2021 23:04:00 +0000 (01:04 +0200)]
UBUNTU: SAUCE: RDMA/core: Updated ib_peer_memory
BugLink: https://launchpad.net/bugs/1947206
- Allow clients to opt out of unmap during invalidation
- Fix race condition with clients when deregistering
a peer mr.
- Enable ATS for peer memory
To fix the race condition, we disable the client invalidation
notifier before calling destroy_mkey().
The race condition fix depends on being able to call procedure
mlx5_mr_cache_invalidate() twice on the same mr, and not have
the second call fail if the first call succeeds.
Based on upstream kernel patch from Jason Gunthorpe,
adapted for mlnx_ofed 5.2-2 and Ubuntu 20.04.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Amir Tzin <amirtz@nvidia.com>
(provided by Nvidia via private email) Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1923104
The peer_memory_client scheme allows a driver to register with the ib_umem
system that it has the ability to understand user virtual address ranges
that are not compatible with get_user_pages(). For instance VMAs created
with io_remap_pfn_range(), or other driver special VMA.
For ranges the interface understands it can provide a DMA mapped sg_table
for use by the ib_umem, allowing user virtual ranges that cannot be
supported by get_user_pages() to be used as umems for RDMA.
This is designed to preserve the kABI, no functions or structures are
changed, only new symbols are added:
This interface is compatible with the two out of tree GPU drivers:
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/amdkfd/kfd_peerdirect.c
https://github.com/Mellanox/nv_peer_memory/blob/master/nv_peer_mem.c
NOTES (remote before sending):
- The exact locking semantics from the GPU side during invalidation
are confusing. I've made it sane but perhaps this will hit locking
problems. Test with lockdep and test invalidation.
The main difference here is that get_pages and dma_map are called
from a context that will block progress of invalidation.
The old design blocked progress of invalidation using a completion for
unmap and unpin, so those should be proven safe now.
Since the old design used a completion it doesn't work with lockdep,
even though it has basically the same blocking semantics.
- The API exported to the GPU side is crufty and makes very little
sense. Functionally it should be the same still, but many useless
things were dropped off
- I rewrote all the comments please check spelling/grammar
- Compile tested only
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
(backported from commit a42989294cf39d6e829424734ab0e7ec48bebcef
git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git)
[ dannf: Backport provided to Ubuntu's 5.4 by Feras; confirmed
to pass Nvidia's internal testing. ] Signed-off-by: dann frazier <dann.frazier@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>