16 Commits

Author SHA1 Message Date
David Laight
d10bb374c4 lib: mul_u64_u64_div_u64(): optimise the divide code
Replace the bit by bit algorithm with one that generates 16 bits per
iteration on 32bit architectures and 32 bits on 64bit ones.

On my zen 5 this reduces the time for the tests (using the generic code)
from ~3350ns to ~1000ns.

Running the 32bit algorithm on 64bit x86 takes ~1500ns.  It'll be slightly
slower on a real 32bit system, mostly due to register pressure.

The savings for 32bit x86 are much higher (tested in userspace).  The
worst case (lots of bits in the quotient) drops from ~900 clocks to ~130
(pretty much independant of the arguments).  Other 32bit architectures may
see better savings.

It is possibly to optimise for divisors that span less than
__LONG_WIDTH__/2 bits.  However I suspect they don't happen that often and
it doesn't remove any slow cpu divide instructions which dominate the
result.

Typical improvements for 64bit random divides:
               old     new
sandy bridge:  470     150
haswell:       400     144
piledriver:    960     467   I think rdpmc is very slow.
zen5:          244      80
(Timing is 'rdpmc; mul_div(); rdpmc' with the multiply depending on the
first rdpmc and the second rdpmc depending on the quotient.)

Object code (64bit x86 test program): old 0x173 new 0x141.

Link: https://lkml.kernel.org/r/20251105201035.64043-9-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:42 -08:00
David Laight
630f96a687 lib: mul_u64_u64_div_u64(): optimise multiply on 32bit x86
gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a +
u32_b).  As well as the extra instructions it can generate a lot of spills
to stack (including spills of constant zeros and even multiplies by
constant zero).

mul_u32_u32() already exists to optimise the multiply.  Add a similar
add_u64_32() for the addition.  Disable both for clang - it generates
better code without them.

Move the 64x64 => 128 multiply into a static inline helper function for
code clarity.  No need for the a/b_hi/lo variables, the implicit casts on
the function calls do the work for us.  Should have minimal effect on the
generated code.

Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply in
mul_u64_add_u64_div_u64().

Link: https://lkml.kernel.org/r/20251105201035.64043-8-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:42 -08:00
David Laight
f0bff2eb04 lib: test_mul_u64_u64_div_u64(): test both generic and arch versions
Change the #if in div64.c so that test_mul_u64_u64_div_u64.c can compile
and test the generic version (including the 'long multiply') on
architectures (eg amd64) that define their own copy.

Test the kernel version and the locally compiled version on all arch. 
Output the time taken (in ns) on the 'test completed' trace.

For reference, on my zen 5, the optimised version takes ~220ns and the
generic version ~3350ns.  Using the native multiply saves ~200ns and
adding back the ilog2() 'optimisation' test adds ~50ms.

Link: https://lkml.kernel.org/r/20251105201035.64043-7-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:42 -08:00
David Laight
6480241f31 lib: add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
The existing mul_u64_u64_div_u64() rounds down, a 'rounding up' variant
needs 'divisor - 1' adding in between the multiply and divide so cannot
easily be done by a caller.

Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d and
implement the 'round down' and 'round up' using it.

Update the x86-64 asm to optimise for 'c' being a constant zero.

Add kerndoc definitions for all three functions.

Link: https://lkml.kernel.org/r/20251105201035.64043-5-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:41 -08:00
David Laight
d91f891d58 lib: mul_u64_u64_div_u64(): simplify check for a 64bit product
If the product is only 64bits div64_u64() can be used for the divide. 
Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a simple
post-multiply check that the high 64bits are zero.

This has the advantage of being simpler, more accurate and less code.  It
will always be faster when the product is larger than 64bits.

Most 64bit cpu have a native 64x64=128 bit multiply, this is needed (for
the low 64bits) even when div64_u64() is called - so the early check gains
nothing and is just extra code.

32bit cpu will need a compare (etc) to generate the 64bit ilog2() from two
32bit bit scans - so that is non-trivial.  (Never mind the mess of x86's
'bsr' and any oddball cpu without fast bit-scan instructions.) Whereas the
additional instructions for the 128bit multiply result are pretty much one
multiply and two adds (typically the 'adc $0,%reg' can be run in parallel
with the instruction that follows).

The only outliers are 64bit systems without 128bit mutiply and simple in
order 32bit ones with fast bit scan but needing extra instructions to get
the high bits of the multiply result.  I doubt it makes much difference to
either, the latter is definitely not mainstream.

If anyone is worried about the analysis they can look at the generated
code for x86 (especially when cmov isn't used).

Link: https://lkml.kernel.org/r/20251105201035.64043-4-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:41 -08:00
David Laight
08092babd3 lib: mul_u64_u64_div_u64(): combine overflow and divide by zero checks
Since the overflow check always triggers when the divisor is zero
move the check for divide by zero inside the overflow check.
This means there is only one test in the normal path.

Link: https://lkml.kernel.org/r/20251105201035.64043-3-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:41 -08:00
David Laight
5944f875ac lib: mul_u64_u64_div_u64(): rename parameter 'c' to 'd'
Patch series "Implement mul_u64_u64_div_u64_roundup()", v5.

The pwm-stm32.c code wants a 'rounding up' version of
mul_u64_u64_div_u64().  This can be done simply by adding 'divisor - 1' to
the 128bit product.  Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a *
b + c)/d based on the existing code.  Define mul_u64_u64_div_u64(a, b, d)
as mul_u64_add_u64_div_u64(a, b, 0, d) and mul_u64_u64_div_u64_roundup(a,
b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.  That is
optimised to avoid the 'add c' when c is known to be zero.  In all other
cases the extra code will be noise compared to the software divide code.

The test module has been updated to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.


This patch (of 9):

Change to prototype from mul_u64_u64_div_u64(u64 a, u64 b, u64 c) to
mul_u64_u64_div_u64(u64 a, u64 b, u64 d).  Using 'd' for 'divisor' makes
more sense.

An upcoming change adds a 'c' parameter to calculate (a * b + c)/d.

Link: https://lkml.kernel.org/r/20251105201035.64043-1-david.laight.linux@gmail.com
Link: https://lkml.kernel.org/r/20251105201035.64043-2-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20 14:03:41 -08:00
Nicolas Pitre
e795000e75 mul_u64_u64_div_u64: fix the division-by-zero behavior
The current implementation forces a compile-time 1/0 division, which
generates an undefined instruction (ud2 on x86) rather than a proper
runtime division-by-zero exception.

Change to trigger an actual div-by-0 exception at runtime, consistent with
other division operations.  Use a non-1 dividend to prevent the compiler
from optimizing the division into a comparison.

Link: https://lkml.kernel.org/r/q246p466-1453-qon9-29so-37105116009q@onlyvoer.pbz
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:57:53 -07:00
Nicolas Pitre
1635e62e75 mul_u64_u64_div_u64: basic sanity test
Verify that edge cases produce proper results, and some more.

[npitre@baylibre.com: avoid undefined shift value]
  Link: https://lkml.kernel.org/r/7rrs9pn1-n266-3013-9q6n-1osp8r8s0rrn@syhkavp.arg
Link: https://lkml.kernel.org/r/20240707190648.1982714-3-nico@fluxnic.net
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:22 -07:00
Nicolas Pitre
b29a62d87c mul_u64_u64_div_u64: make it precise always
Patch series "mul_u64_u64_div_u64: new implementation", v3.

This provides an implementation for mul_u64_u64_div_u64() that always
produces exact results.


This patch (of 2):

Library facilities must always return exact results.  If the caller may be
contented with approximations then it should do the approximation on its
own.

In this particular case the comment in the code says "the algorithm
... below might lose some precision". Well, if you try it with e.g.:

	a = 18446462598732840960
	b = 18446462598732840960
	c = 18446462598732840961

then the produced answer is 0 whereas the exact answer should be
18446462598732840959.  This is _some_ precision lost indeed!

Let's reimplement this function so it always produces the exact result
regardless of its inputs while preserving existing fast paths when
possible.

Uwe said:

: My personal interest is to get the calculations in pwm drivers right. 
: This function is used in several drivers below drivers/pwm/ .  With the
: errors in mul_u64_u64_div_u64(), pwm consumers might not get the
: settings they request.  Although I have to admit that I'm not aware it
: breaks real use cases (because typically the periods used are too short
: to make the involved multiplications overflow), but I pretty sure am
: not aware of all usages and it breaks testing.
: 
: Another justification is commits like
: https://git.kernel.org/tip/77baa5bafcbe1b2a15ef9c37232c21279c95481c,
: where people start to work around the precision shortcomings of
: mul_u64_u64_div_u64().

Link: https://lkml.kernel.org/r/20240707190648.1982714-1-nico@fluxnic.net
Link: https://lkml.kernel.org/r/20240707190648.1982714-2-nico@fluxnic.net
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Tested-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:22 -07:00
Uwe Kleine-König
8c86fb68ff mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
As indicated in the added comment, the algorithm works better if b is big.
As multiplication is commutative, a and b can be swapped.  Do this if a
is bigger than b.

Link: https://lkml.kernel.org/r/20240303092408.662449-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 13:09:22 -07:00
Liam Beguin
d28a1de5d1 math64: favor kernel-doc from header files
Fix the kernel-doc markings for div64 functions to point to the header
file instead of the lib/ directory.  This avoids having implementation
specific comments in generic documentation.  Furthermore, given that
some kernel-doc comments are identical, drop them from lib/math64 and
only keep there comments that add implementation details.

Signed-off-by: Liam Beguin <liambeguin@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20221118182309.3824530-1-liambeguin@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-11-21 14:30:53 -07:00
David S. Miller
bf45947864 math: Export mul_u64_u64_div_u64
Fixes: f51d7bf1db ("ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation")
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 16:42:54 -07:00
Andy Shevchenko
aa6159ab99 kernel.h: split out mathematical helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out
mathematical helpers.

At the same time convert users in header and lib folder to use new
header.  Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.

[sfr@canb.auug.org.au: fix powerpc build]
  Link: https://lkml.kernel.org/r/20201029150809.13059608@canb.auug.org.au

Link: https://lkml.kernel.org/r/20201028173212.41768-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 22:46:15 -08:00
Oleg Nesterov
3dc167ba57 sched/cputime: Improve cputime_adjust()
People report that utime and stime from /proc/<pid>/stat become very
wrong when the numbers are big enough, especially if you watch these
counters incrementally.

Specifically, the current implementation of: stime*rtime/total,
results in a saw-tooth function on top of the desired line, where the
teeth grow in size the larger the values become. IOW, it has a
relative error.

The result is that, when watching incrementally as time progresses
(for large values), we'll see periods of pure stime or utime increase,
irrespective of the actual ratio we're striving for.

Replace scale_stime() with a math64.h helper: mul_u64_u64_div_u64()
that is far more accurate. This also allows architectures to override
the implementation -- for instance they can opt for the old algorithm
if this new one turns out to be too expensive for them.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200519172506.GA317395@hirez.programming.kicks-ass.net
2020-06-15 14:10:00 +02:00
Andy Shevchenko
2c64e9cb0b lib: Move mathematic helpers to separate folder
For better maintenance and expansion move the mathematic helpers to the
separate folder.

No functional change intended.

Note, the int_sqrt() is not used as a part of lib, so, moved to regular
obj.

Link: http://lkml.kernel.org/r/20190323172531.80025-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Ray Jui <rjui@broadcom.com>
[mchehab+samsung@kernel.org: fix broken doc references for div64.c and gcd.c]
  Link: http://lkml.kernel.org/r/734f49bae5d4052b3c25691dfefad59bea2e5843.1555580999.git.mchehab+samsung@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00