[kwlug disc.] any folks using tricks for better linux performance?

Andrew Kohlsmith akohlsmith-kwlug at benshaw.com
Sat Aug 25 21:06:34 EDT 2007


On Saturday 25 August 2007 4:06:53 am Robert P. J. Day wrote:
>   i'm perusing some online docs related to speeding up (primarily
> embedded) linux systems, and i'm curious how many people on this list
> are using any of the following in a practical way to either increase
> speed or save space:
>
> 1) running the kernel XIP (execute in place)
> 2) booting without sysfs
> 3) using "kexec" for reboots

I've just finished up the implementation of an industrial embedded Linux 
system.  It's a ColdFire MCF5282 (no MMU) with 32MB SDRAM and 32MB Flash, 
CAN, Ethernet, RS485 and Bluetooth. I've given the details of the system to 
underscore the variety of subsystems and places where optimization was 
evaluated. It's taken about two years of full-time development to take it 
from concept, through several false starts, field testing prototypes, HALT 
and EMI/RFI testing, debugging, protocol conformance and physical case design 
to finished product, and in that time and through all of those processes I 
have come to one conclusion:

There is no such thing as generic optimization.

XIP is, generally speaking, not an optimization in my opinion.  Not for space 
nor for speed.  You can't have a compressed filesystem for XIP, so your Flash 
requirements are larger.  Also, Flash is often notoriously slow, oftentimes 
so slow that it's unacceptable.  People claim XIP is a space/speed 
improvement because your typical implementation takes a compressed romfs 
image, uncompresses it to RAM and copies applications from that RAM image to 
a working copy.  XIP eliminates that last copy, saving RAM and copy time.

Now, there is an interesting variant on XIP which I haven't yet had time to 
play with: you compress the romfs image as an uncompressed image, but each 
file is compressed individually.  Use a common compression symbol table to 
increase compression performance if you want.  Now, whenever you want to 
access a file you uncompress it on the fly to RAM and execute it there.  
Multiple running instances of the same file use a common .text section, which 
is of course your XIP.  This eliminates storing the entire filesystem in RAM, 
and the copy time is only done on the first file access.

I'm running 2.4.31-uc0 which of course does not have sysfs, but honestly in an 
embedded system even /proc is not really required.  Unless you absolutely 
require features in 2.6.x, a big space optimization involves sticking to 
2.4.x.

Since I am a militant anti-rebooter, I would take any improvement to reboot 
time as a false optimization.  Fix the problem that requires the 
reboots.  :-)

On x86, I generally compile my kernels for i686; Any important part of the 
kernel which requires speed (e.g. RAID) seems to already try several 
variations on loop execution at runtime to determine the best to use, so I 
don't see the real point in selecting absolute bleeding-edge kernel 
compilation flag.  I have yet to see repeatable, concrete and SIGNIFICANT 
results from the ricer crowd who tend to flock to Gentoo, claiming that they 
can actually see fractional percentage speed optimizations.  I also like the 
fact that with a more generic kernel, I can rip out the hard drive and throw 
it in another physical system in an emergency and not run against kernel 
panics due to overzealous optimization or processor-specific instructions 
that shave clock cycles for no real net gain.

I mention this because it's been my experience that to achieve significant 
speedups or space savings involves approaching the problem specific to the 
task at hand.  You can't just play with a few gcc flags and get a better 
system.  In my particular embedded system, for example, I achieved a 
*significant* speed improvement in my system through the following:

1) multithreading - I broke the main loop up into a CAN thread, a UART thread, 
and a timing thread.  The main loop does next to nothing, and the processor 
is now optimally loaded.

2) optimizing the UART driver - I'm using RS485, which needs to control a bus 
driver in addition to just shifting bits out serially, as all devices use a 
common bus to communicate.  By using the processor's built-in ability to 
de-assert RTS after the last byte has been shifted out, I was able to 
eliminate both the need to poll the UART to see if the transmitter shift and 
holding registers were empty, but I was also able to completely eliminate the 
need for another UART interrupt.  My RS485 communications were measurably 
faster, which allowed more time for everything else.

3) optimizing the CAN driver - This didn't give significant improvements, but 
optimizing the interrupt routine allowed me to reduce the queue buffer, which 
gave better latency and took a little less memory.

Other minor optimizations:

4) Really looking at the SDRAM timings and tightening them up to the point 
where I started getting RAM errors, then backing it off a little.  The faster 
you can access your system RAM, obviously, the better.  This had to be done 
once again at the lowest possible temperature, as the SDRAM timings moved 
enough to cause an issue.

5) Optimizing the Flash access time for the same reasons; I started out by 
using 15 wait states to ensure there were no issues, but accessing large 
tables from Flash were noticeably slow.  At 80MHz, I only needed 9 wait 
states to ensure proper timing, and the accesses sped up a little (but it is 
still damned slow).  Reflashing sped up a little here too, since the 
verification could be done more quickly.  :-)

6) Ethernet - When the Ethernet's not plugged in, the default retries for 
kernel autoconfiguration caused the bootup time to be unrealistically slow.  
I reduced the DHCP retries from 10 to 2, and the DHCP timeouts from something 
like 10s to 1s.  This helped the overall user experience, because everyone 
wanted to see what this thing did and the long bootups were hard to 
explain.  :-)

There are some areas I could still focus on.  There is a hardware EMAC on this 
chip that is completely unused at this point, but may come in handy for CRC 
calculation.  The UARTs can also use DMA, although it will take a significant 
rewrite to make use of it, for what I think will be marginal gain.  I'd also 
like to take a look at the 2.6 kernel to see if m68k is any faster due to all 
the general speed improvements the kernel's gone through since 2.4.31.  I 
could also do some space/speed tradeoffs by utilizing lookup tables, and 
maybe doing some loop unrolling, but by and large the device meets its 
requirements and anything beyond this is more for self-education.

I guess after all of this (very real) development and measurement, the only 
real conclusion I can give to you is that optimization needs to be taken on a 
case-by-case basis, and profiling is the only real way to tell where you 
should be spending your time.  Worrying about optimization so early in the 
game (your request seems to be a very "high level" question to me) is pretty 
futile; if there is a specific area to optimize, it will become glaringly 
obvious.

-A.


More information about the KWLUG-Disc mailing list