[kwlug disc.] any folks using tricks for better linux
performance?
Andrew Kohlsmith
akohlsmith-kwlug at benshaw.com
Sat Aug 25 21:06:34 EDT 2007
On Saturday 25 August 2007 4:06:53 am Robert P. J. Day wrote:
> i'm perusing some online docs related to speeding up (primarily
> embedded) linux systems, and i'm curious how many people on this list
> are using any of the following in a practical way to either increase
> speed or save space:
>
> 1) running the kernel XIP (execute in place)
> 2) booting without sysfs
> 3) using "kexec" for reboots
I've just finished up the implementation of an industrial embedded Linux
system. It's a ColdFire MCF5282 (no MMU) with 32MB SDRAM and 32MB Flash,
CAN, Ethernet, RS485 and Bluetooth. I've given the details of the system to
underscore the variety of subsystems and places where optimization was
evaluated. It's taken about two years of full-time development to take it
from concept, through several false starts, field testing prototypes, HALT
and EMI/RFI testing, debugging, protocol conformance and physical case design
to finished product, and in that time and through all of those processes I
have come to one conclusion:
There is no such thing as generic optimization.
XIP is, generally speaking, not an optimization in my opinion. Not for space
nor for speed. You can't have a compressed filesystem for XIP, so your Flash
requirements are larger. Also, Flash is often notoriously slow, oftentimes
so slow that it's unacceptable. People claim XIP is a space/speed
improvement because your typical implementation takes a compressed romfs
image, uncompresses it to RAM and copies applications from that RAM image to
a working copy. XIP eliminates that last copy, saving RAM and copy time.
Now, there is an interesting variant on XIP which I haven't yet had time to
play with: you compress the romfs image as an uncompressed image, but each
file is compressed individually. Use a common compression symbol table to
increase compression performance if you want. Now, whenever you want to
access a file you uncompress it on the fly to RAM and execute it there.
Multiple running instances of the same file use a common .text section, which
is of course your XIP. This eliminates storing the entire filesystem in RAM,
and the copy time is only done on the first file access.
I'm running 2.4.31-uc0 which of course does not have sysfs, but honestly in an
embedded system even /proc is not really required. Unless you absolutely
require features in 2.6.x, a big space optimization involves sticking to
2.4.x.
Since I am a militant anti-rebooter, I would take any improvement to reboot
time as a false optimization. Fix the problem that requires the
reboots. :-)
On x86, I generally compile my kernels for i686; Any important part of the
kernel which requires speed (e.g. RAID) seems to already try several
variations on loop execution at runtime to determine the best to use, so I
don't see the real point in selecting absolute bleeding-edge kernel
compilation flag. I have yet to see repeatable, concrete and SIGNIFICANT
results from the ricer crowd who tend to flock to Gentoo, claiming that they
can actually see fractional percentage speed optimizations. I also like the
fact that with a more generic kernel, I can rip out the hard drive and throw
it in another physical system in an emergency and not run against kernel
panics due to overzealous optimization or processor-specific instructions
that shave clock cycles for no real net gain.
I mention this because it's been my experience that to achieve significant
speedups or space savings involves approaching the problem specific to the
task at hand. You can't just play with a few gcc flags and get a better
system. In my particular embedded system, for example, I achieved a
*significant* speed improvement in my system through the following:
1) multithreading - I broke the main loop up into a CAN thread, a UART thread,
and a timing thread. The main loop does next to nothing, and the processor
is now optimally loaded.
2) optimizing the UART driver - I'm using RS485, which needs to control a bus
driver in addition to just shifting bits out serially, as all devices use a
common bus to communicate. By using the processor's built-in ability to
de-assert RTS after the last byte has been shifted out, I was able to
eliminate both the need to poll the UART to see if the transmitter shift and
holding registers were empty, but I was also able to completely eliminate the
need for another UART interrupt. My RS485 communications were measurably
faster, which allowed more time for everything else.
3) optimizing the CAN driver - This didn't give significant improvements, but
optimizing the interrupt routine allowed me to reduce the queue buffer, which
gave better latency and took a little less memory.
Other minor optimizations:
4) Really looking at the SDRAM timings and tightening them up to the point
where I started getting RAM errors, then backing it off a little. The faster
you can access your system RAM, obviously, the better. This had to be done
once again at the lowest possible temperature, as the SDRAM timings moved
enough to cause an issue.
5) Optimizing the Flash access time for the same reasons; I started out by
using 15 wait states to ensure there were no issues, but accessing large
tables from Flash were noticeably slow. At 80MHz, I only needed 9 wait
states to ensure proper timing, and the accesses sped up a little (but it is
still damned slow). Reflashing sped up a little here too, since the
verification could be done more quickly. :-)
6) Ethernet - When the Ethernet's not plugged in, the default retries for
kernel autoconfiguration caused the bootup time to be unrealistically slow.
I reduced the DHCP retries from 10 to 2, and the DHCP timeouts from something
like 10s to 1s. This helped the overall user experience, because everyone
wanted to see what this thing did and the long bootups were hard to
explain. :-)
There are some areas I could still focus on. There is a hardware EMAC on this
chip that is completely unused at this point, but may come in handy for CRC
calculation. The UARTs can also use DMA, although it will take a significant
rewrite to make use of it, for what I think will be marginal gain. I'd also
like to take a look at the 2.6 kernel to see if m68k is any faster due to all
the general speed improvements the kernel's gone through since 2.4.31. I
could also do some space/speed tradeoffs by utilizing lookup tables, and
maybe doing some loop unrolling, but by and large the device meets its
requirements and anything beyond this is more for self-education.
I guess after all of this (very real) development and measurement, the only
real conclusion I can give to you is that optimization needs to be taken on a
case-by-case basis, and profiling is the only real way to tell where you
should be spending your time. Worrying about optimization so early in the
game (your request seems to be a very "high level" question to me) is pretty
futile; if there is a specific area to optimize, it will become glaringly
obvious.
-A.
More information about the KWLUG-Disc
mailing list