Discussion:
Build time data
Chris Tapp
2012-04-11 20:42:13 UTC
Permalink
Is there a page somewhere that gives a rough idea of how quickly a full build runs on various systems?

I need a faster build platform, but want to get a reasonable price / performance balance ;-)

I'm looking at something like an i7-2700K but am not yet tied...

Chris Tapp

opensource-UUnwy/L99c5Wk0Htik3J/***@public.gmane.org
www.keylevel.com
Autif Khan
2012-04-11 21:19:29 UTC
Permalink
Post by Chris Tapp
Is there a page somewhere that gives a rough idea of how quickly a full build runs on various systems?
I dont think there is a page anywhere.

This is as rough as it can get on the two machine that I have not
including the time it takes to download the source files.

HP 2.8GHz core i-7 dual core hyper threaded machine with a 5400 rpm disk:

BB_NUM_THREADS = "8"
PARALLEL_MAKE = "8"

core image minimal 2.5 hours
core image sato 5 hours
core image sdk 8-10 hours

A build machine - Core i7-3960X 3.3 GHz - 6 cores hyperthreaded, with
an SSD for build output and a 7200 rpm 6.0 Gbps claiming HDD for
downloads, poky, whatever

BB_NUM_THREADS = "24"
PARALLEL_MAKE = "24"

core image minimal 27 minutes
core image sato 58-62 minutes
core image sdk 110-120 minutes

OS was always Ubuntu 11.10 - xubuntu on HP laptop, server in build machine.
Post by Chris Tapp
I need a faster build platform, but want to get a reasonable price / performance balance ;-)
I'm looking at something like an i7-2700K but am not yet tied...
The build machine cost about $3500 or so from tiger direct/newegg

It was well worth it - instead of doing nightly builds, I can now do a
clean build in under one hour.
Bob Cochran
2012-04-11 21:38:00 UTC
Permalink
Post by Chris Tapp
Is there a page somewhere that gives a rough idea of how quickly a full build runs on various systems?
I need a faster build platform, but want to get a reasonable price / performance balance ;-)
I'm looking at something like an i7-2700K but am not yet tied...
Chris Tapp
www.keylevel.com
I haven't seen one, but it would be great to have this on the wiki where
everyone could post what they're seeing & using.

Maybe the autobuilder has some useful statistics
(http://autobuilder.yoctoproject.org:8010/)? Of course, you'll have to
be careful to determine whether anything else was running at the time of
the build.

On a related note, I have been wondering whether I would get the bang
for the buck with an SSD for my build machines. I would guess that
building embedded Linux images isn't a typical use pattern for an SSD. I
wonder if the long write & erase durations for FLASH technology would
show its ugly face during a poky build. I would think that the embedded
micro inside the SSD managing the writes might get taxed to the limit
trying to slice the data. I would appreciate anyone's experience with
SSDs on build machines.
Darren Hart
2012-04-12 00:30:47 UTC
Permalink
Post by Chris Tapp
Is there a page somewhere that gives a rough idea of how quickly a full build runs on various systems?
I need a faster build platform, but want to get a reasonable price / performance balance ;-)
I'm looking at something like an i7-2700K but am not yet tied...
We really do need to get some pages up on this as it comes up a lot.

Currently Yocto Project builds scale well up to about 12 Cores, so first
step is to get as many cores as you can. Sacrifice some speed for cores
if you have to. If you can do dual-socket, do it. If not, try for a six
core.

Next up is storage. We read and write a LOT of data. SSDs are one way to
go, but we've been known to chew through them and they aren't priced as
consumables. You can get about 66% of the performance of a single SSD
with a pair of good quality SATA2 or better drives configured in RAID0
(no redundancy). Ideally, you would have your OS and sources on an SSD
and use a RAID0 array to build on. This data is all recreatable, so it's
"OK" if you lose a disk and therefor ALL of your build data.

Now RAM, you will want about 2 GB of RAM per core, with a minimum of 4GB.

Finally, software. Be sure to run a "server" kernel which is optimized
for throughput as opposed to interactivity (like Desktop kernels). This
implies CONFIG_PREEMPT_NONE=y. You'll want a 64-bit kernel to avoid the
performance penalty inherent with 32bit PAE kernels - and you will want
lots of memory. You can save some IO by mounting your
its-ok-if-i-lose-all-my-data build partition as follows:

/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000

As well as drop the journal from it when you format it. Just don't power
off your machine without properly shutting down!

That should get you some pretty good build times.

I run on a beast with 12 cores, 48GB of RAM, OS and sources on a G2
Intel SSD, with two Seagate Barracudas in a RAID0 array for my /build
partition. I run a headless Ubuntu 11.10 (x86_64) installation running
the 3.0.0-16-server kernel. I can build core-image-minimal in < 30
minutes and core-image-sato in < 50 minutes from scratch.

Hopefully that gives you some ideas to get started.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Osier-mixon, Jeffrey
2012-04-12 00:43:32 UTC
Permalink
Excellent topic for a wiki page.
Post by Darren Hart
Post by Chris Tapp
Is there a page somewhere that gives a rough idea of how quickly a full build runs on various systems?
I need a faster build platform, but want to get a reasonable price / performance balance ;-)
I'm looking at something like an i7-2700K but am not yet tied...
We really do need to get some pages up on this as it comes up a lot.
Currently Yocto Project builds scale well up to about 12 Cores, so first
step is to get as many cores as you can. Sacrifice some speed for cores
if you have to. If you can do dual-socket, do it. If not, try for a six
core.
Next up is storage. We read and write a LOT of data. SSDs are one way to
go, but we've been known to chew through them and they aren't priced as
consumables. You can get about 66% of the performance of a single SSD
with a pair of good quality SATA2 or better drives configured in RAID0
(no redundancy). Ideally, you would have your OS and sources on an SSD
and use a RAID0 array to build on. This data is all recreatable, so it's
"OK" if you lose a disk and therefor ALL of your build data.
Now RAM, you will want about 2 GB of RAM per core, with a minimum of 4GB.
Finally, software. Be sure to run a "server" kernel which is optimized
for throughput as opposed to interactivity (like Desktop kernels). This
implies CONFIG_PREEMPT_NONE=y. You'll want a 64-bit kernel to avoid the
performance penalty inherent with 32bit PAE kernels - and you will want
lots of memory. You can save some IO by mounting your
/dev/md0        /build          ext4
noauto,noatime,nodiratime,commit=6000
As well as drop the journal from it when you format it. Just don't power
off your machine without properly shutting down!
That should get you some pretty good build times.
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a G2
Intel SSD, with two Seagate Barracudas in a RAID0 array for my /build
partition. I run a headless Ubuntu 11.10 (x86_64) installation running
the 3.0.0-16-server kernel. I can build core-image-minimal in < 30
minutes and core-image-sato in < 50 minutes from scratch.
Hopefully that gives you some ideas to get started.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
_______________________________________________
yocto mailing list
https://lists.yoctoproject.org/listinfo/yocto
--
Jeff Osier-Mixon http://jefro.net/blog
Yocto Project Community Manager @Intel http://yoctoproject.org
Bob Cochran
2012-04-12 04:39:36 UTC
Permalink
Post by Darren Hart
SSDs are one way to
go, but we've been known to chew through them and they aren't priced as
consumables.
Hi Darren,

Could you please elaborate on "been known to chew through them"?

Are you running into an upper limit on write / erase cycles? Are you
encountering hard (or soft) failures?

Thanks,

Bob
Darren Hart
2012-04-12 07:10:35 UTC
Permalink
Post by Bob Cochran
Post by Darren Hart
SSDs are one way to
go, but we've been known to chew through them and they aren't priced as
consumables.
Hi Darren,
Could you please elaborate on "been known to chew through them"?
Are you running into an upper limit on write / erase cycles? Are you
encountering hard (or soft) failures?
Some have reported early physical disk failure. Due to the cost of SSDs,
not a lot of people seem to be trying it out. I *believe* the current
generation of SSDs would perform admirably, but I haven't tested that. I
know Deny builds with SSDs, perhaps he would care to comment?
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Joshua Immanuel
2012-04-12 07:35:00 UTC
Permalink
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a G2
Intel SSD, with two Seagate Barracudas in a RAID0 array for my /build
partition. I run a headless Ubuntu 11.10 (x86_64) installation running
the 3.0.0-16-server kernel. I can build core-image-minimal in < 30
minutes and core-image-sato in < 50 minutes from scratch.
wow. Can I get a shell? :D
--
Joshua Immanuel
HiPro IT Solutions Private Limited
http://hipro.co.in
Martin Jansa
2012-04-12 08:00:22 UTC
Permalink
Post by Bob Cochran
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a G2
Intel SSD, with two Seagate Barracudas in a RAID0 array for my /build
partition. I run a headless Ubuntu 11.10 (x86_64) installation running
the 3.0.0-16-server kernel. I can build core-image-minimal in < 30
minutes and core-image-sato in < 50 minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to be
able to do my builds in tmpfs and keep only more permanent data on RAID.

Cheers,
--
Martin 'JaMa' Jansa jabber: Martin.Jansa-***@public.gmane.org
Joshua Immanuel
2012-04-12 09:36:25 UTC
Permalink
Post by Martin Jansa
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on
a G2 Intel SSD, with two Seagate Barracudas in a RAID0 array for
my /build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to be
able to do my builds in tmpfs and keep only more permanent data on RAID.
+1

I tried using the tmpfs for WORKDIR on my T420 which has 8GB of RAM. (In
India, maximum single slot DDR3 RAM we can get is 4GB.) Obviously, this
is not sufficient :( Maybe I shouldn't use the laptop for build
purposes.

Moreover, every time I build the image in yocto, temperature peeks to 87
degree Celsius. Hoping that my HDD should not die.
--
Joshua Immanuel
HiPro IT Solutions Private Limited
http://hipro.co.in
Darren Hart
2012-04-12 14:12:56 UTC
Permalink
Post by Martin Jansa
Post by Bob Cochran
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on
a G2 Intel SSD, with two Seagate Barracudas in a RAID0 array
for my /build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to
be able to do my builds in tmpfs and keep only more permanent data
on RAID.
We've done some experiments with tmpfs, adding Beth on CC. If I recall
correctly, my RAID0 array with the mount options I specified
accomplishes much of what tmpfs does for me without the added setup.
With a higher commit interval, the kernel doesn't try to sync the
dcache with the disks as frequently (eg not even once during a build),
so it's effectively writing to memory (although there is still plenty
of IO occurring).

The other reason is that while 48GB is plenty for a single build, I
often run many builds in parallel, sometimes in virtual machines when
I need to reproduce or test something on different hosts.

For example:
https://picasaweb.google.com/lh/photo/7PCrqXQqxL98SAY1ecNzDdMTjNZETYmyPJy0liipFm0?feat=directlink


- --
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Flanagan, Elizabeth
2012-04-12 23:37:00 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Martin Jansa
Post by Bob Cochran
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on
a G2 Intel SSD, with two Seagate Barracudas in a RAID0 array
for my /build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to
be able to do my builds in tmpfs and keep only more permanent data
on RAID.
We've done some experiments with tmpfs, adding Beth on CC. If I recall
correctly, my RAID0 array with the mount options I specified
accomplishes much of what tmpfs does for me without the added setup.
This should be the case in general. For the most part, if you have a decent
RAID setup (We're using RAID10 on the ab) with fast disks you should be
able to hit tmpfs speed (or close to it). I've done some experiments with
this and what I found was maybe a 5 minute difference, sometimes, from a
clean build between tmpfs and RAID10.

I discussed this during Yocto Developer Day. Let me boil it down a bit to
explain some of what I did on the autobuilders.

Caveat first though. I would avoid using autobuilder time as representative
of prime yocto build time. The autobuilder hosts a lot of different
services that sometimes impact build time and this can vary depending on
what else is going on on the machine.

There are four places, in general, where you want to look at optimizing
outside of dependency issues. CPU, disk, memory, build process. What I
found was that the most useful of these in getting the autobuilder time
down was disk and build process.

With disk, spreading it across the RAID saved us not only a bit of time,
but also helped us avoid trashed disks. More disk thrash == higher failure
rate. So far this year we've seen two disk failures that have resulted in
almost zero autobuilder downtime.

The real time saver however ended up being maintaining sstate across build
runs. Even with our sstate on nfs, we're still seeing a dramatic decrease
in build time.

I would be interested in seeing what times you get with tmpfs. I've done
tmpfs builds before and have seen good results, but bang for the buck did
end up being a RAID array.


With a higher commit interval, the kernel doesn't try to sync the
dcache with the disks as frequently (eg not even once during a build),
so it's effectively writing to memory (although there is still plenty
of IO occurring).
The other reason is that while 48GB is plenty for a single build, I
often run many builds in parallel, sometimes in virtual machines when
I need to reproduce or test something on different hosts.
https://picasaweb.google.com/lh/photo/7PCrqXQqxL98SAY1ecNzDdMTjNZETYmyPJy0liipFm0?feat=directlink
- --
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJPhuLfAAoJEKbMaAwKp3648pYH/1HGCzI1QP1mj1OPfbo1TNou
nq1dCnEQOc+vUqShrmgjEY5H2G7Kqu5Y8JRp8m3D6v2iUPwu+ko3xASJkIVetgTn
1J+dkZl93Gbm8nm63b5bES0mMqyiycNgXW4KTL0iA+4mLbKSXck7nF/gIyjE4iHa
SR+DDavSoOIJUiZsJBJpIdS4sY2RpalohhJvp97Qfmbxmqlo2RJkqzB7OmLliKbB
zGiuXeFgGojZXIRl11Rr36kqqA75WoTlNYjlkcg1paEhCr4zCMh0sujGaPQgVPtu
YU+FCtGxQ569f+hahdJraCU9T4IbMK4AOk30VqVxPifCqFhIvr7FnVRkYtV5pZM=
=tdFq
-----END PGP SIGNATURE-----
--
Elizabeth Flanagan
Yocto Project
Build and Release
Martin Jansa
2012-04-13 05:51:51 UTC
Permalink
Post by Flanagan, Elizabeth
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Martin Jansa
Post by Bob Cochran
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on
a G2 Intel SSD, with two Seagate Barracudas in a RAID0 array
for my /build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to
be able to do my builds in tmpfs and keep only more permanent data
on RAID.
We've done some experiments with tmpfs, adding Beth on CC. If I recall
correctly, my RAID0 array with the mount options I specified
accomplishes much of what tmpfs does for me without the added setup.
This should be the case in general. For the most part, if you have a decent
RAID setup (We're using RAID10 on the ab) with fast disks you should be
able to hit tmpfs speed (or close to it). I've done some experiments with
this and what I found was maybe a 5 minute difference, sometimes, from a
clean build between tmpfs and RAID10.
5 minutes on very small image like core-image-minimal (30 min) is 1/6 of
that time :)..

I have much bigger images and even bigger ipk feed, so to rebuild from
scratch takes about 24 hours for one architecture..

And my system is very slow compared to yours, I've found my measurement
of core-image-minimal-with-mtdutils around 95 mins
http://patchwork.openembedded.org/patch/17039/
but this was with Phenom II X4 965, 4GB RAM, RAID0 (3 SATA2 disks) for
WORKDIR, RAID5 (the same 3 SATA2 disks) BUILDDIR (raid as mdraid), now
I have Bulldozer AMD FX(tm)-8120, 16GB RAM, still the same RAID0 but
different motherboard..

Problem with tmpfs is that no RAM is big enough to build whole feed in
one go, so I have to build in steps (e.g. bitbake gcc for all machines
with the same architecture, then cleanup WORKDIR and switch to another
arch, then bitbake small-image, bigger-image, qt4-x11-free, ...).
qt4-x11-free is able to eat 15GB tmpfs almost completely.
Post by Flanagan, Elizabeth
I discussed this during Yocto Developer Day. Let me boil it down a bit to
explain some of what I did on the autobuilders.
Caveat first though. I would avoid using autobuilder time as representative
of prime yocto build time. The autobuilder hosts a lot of different
services that sometimes impact build time and this can vary depending on
what else is going on on the machine.
There are four places, in general, where you want to look at optimizing
outside of dependency issues. CPU, disk, memory, build process. What I
found was that the most useful of these in getting the autobuilder time
down was disk and build process.
With disk, spreading it across the RAID saved us not only a bit of time,
but also helped us avoid trashed disks. More disk thrash == higher failure
rate. So far this year we've seen two disk failures that have resulted in
almost zero autobuilder downtime.
True for RAID10, but for WORKDIR itself RAID0 is cheeper and even higher
failure rate it's not big issue for WORKDIR.. just have to cleansstate
tasks which were in hit in the middle of build..
Post by Flanagan, Elizabeth
The real time saver however ended up being maintaining sstate across build
runs. Even with our sstate on nfs, we're still seeing a dramatic decrease
in build time.
I would be interested in seeing what times you get with tmpfs. I've done
tmpfs builds before and have seen good results, but bang for the buck did
end up being a RAID array.
I'll check if core-image-minimal can be built with just 15GB tmpfs,
otherwise I would have to build it in 2 steps and the time wont be
precise.
Post by Flanagan, Elizabeth
With a higher commit interval, the kernel doesn't try to sync the
dcache with the disks as frequently (eg not even once during a build),
so it's effectively writing to memory (although there is still plenty
of IO occurring).
The other reason is that while 48GB is plenty for a single build, I
often run many builds in parallel, sometimes in virtual machines when
I need to reproduce or test something on different hosts.
https://picasaweb.google.com/lh/photo/7PCrqXQqxL98SAY1ecNzDdMTjNZETYmyPJy0liipFm0?feat=directlink
--
Martin 'JaMa' Jansa jabber: Martin.Jansa-***@public.gmane.org
Darren Hart
2012-04-13 06:08:19 UTC
Permalink
Post by Martin Jansa
And my system is very slow compared to yours, I've found my
measurement of core-image-minimal-with-mtdutils around 95 mins
http://patchwork.openembedded.org/patch/17039/ but this was with
Phenom II X4 965, 4GB RAM, RAID0 (3 SATA2 disks) for WORKDIR, RAID5
(the same 3 SATA2 disks) BUILDDIR (raid as mdraid), now I have
Bulldozer AMD FX(tm)-8120, 16GB RAM, still the same RAID0 but
different motherboard..
Why RAID5 for BUILDDIR? The write overhead of RAID5 is very high. The
savings RAID5 alots you is more significant with more disks, but with
3 disks it's only 1 disk better than RAID10, with a lot more overhead.

I spent some time outlining all this a while back:
http://www.dvhart.com/2011/03/qnap_ts419p_configuration_raid_levels_and_throughput/

Here's the relevant bit:

"RAID 5 distributes parity across all the drives in the array, this
parity calculation is both compute intensive and IO intensive. Every
write requires the parity calculation, and data must be written to
every drive."



- --
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Martin Jansa
2012-04-13 06:38:02 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Martin Jansa
And my system is very slow compared to yours, I've found my
measurement of core-image-minimal-with-mtdutils around 95 mins
http://patchwork.openembedded.org/patch/17039/ but this was with
Phenom II X4 965, 4GB RAM, RAID0 (3 SATA2 disks) for WORKDIR, RAID5
(the same 3 SATA2 disks) BUILDDIR (raid as mdraid), now I have
Bulldozer AMD FX(tm)-8120, 16GB RAM, still the same RAID0 but
different motherboard..
Why RAID5 for BUILDDIR? The write overhead of RAID5 is very high. The
savings RAID5 alots you is more significant with more disks, but with
3 disks it's only 1 disk better than RAID10, with a lot more overhead.
Becaure RAID10 needs at least 4 drivers and all my SATA ports are
already used and also it's on my /home partition.. please not that this
is not some company build server, just my desktop where it happens I do
a lot of builds for comunity distribution for smartphones
http://shr-project.org

Server we have available for builds is _much_ slower then this
especially IO (some virtualized host on busy server), but has much
better network bandwidth.. :).

Cheers,
http://www.dvhart.com/2011/03/qnap_ts419p_configuration_raid_levels_and_throughput/
"RAID 5 distributes parity across all the drives in the array, this
parity calculation is both compute intensive and IO intensive. Every
write requires the parity calculation, and data must be written to
every drive."
- --
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJPh8LTAAoJEKbMaAwKp364pa8H/A8BSudN/g7ixFmUTYMNGHlC
2+H59MgNHYWRYzNn9QvN6vyyfXzX7C00HUTQ4MQ3CmisTUza2tbJEdX9CpeIBQNg
Ny8iqyNNoInTFx2T1Yi2eA9Ytegtue9Ls+IcBRbpIbs6Zo1Qwzi6oemdPZN7g3YI
rH/NKALWIBt/Y/Dt2k0fz7WsQGYOuE/lYpL/CmukU7vNNEUAdOs7tZa5o1ZOQDuj
zGCwuVH9QwrDJEXNsMtjNY37aJeAgDMwSXjN0pKv1WQI9j47kYQQrrp2qKVQYhV1
x4QxJ5aOuV7BaS0Y7zYkNo9nv+yKPODt25s5L83k5vjbMhCvczmMJn3jupQuUhQ=
=3GDA
-----END PGP SIGNATURE-----
--
Martin 'JaMa' Jansa jabber: Martin.Jansa-***@public.gmane.org
Wolfgang Denk
2012-04-13 07:24:33 UTC
Permalink
Dear Darren Hart,
Post by Darren Hart
Post by Martin Jansa
Phenom II X4 965, 4GB RAM, RAID0 (3 SATA2 disks) for WORKDIR, RAID5
(the same 3 SATA2 disks) BUILDDIR (raid as mdraid), now I have
Bulldozer AMD FX(tm)-8120, 16GB RAM, still the same RAID0 but
different motherboard..
Why RAID5 for BUILDDIR? The write overhead of RAID5 is very high. The
savings RAID5 alots you is more significant with more disks, but with
3 disks it's only 1 disk better than RAID10, with a lot more overhead.
Indeed, RAID5 with just 3 devices makes little sense - especially
when running on the same drives as the RAID0 workdir.
Post by Darren Hart
http://www.dvhart.com/2011/03/qnap_ts419p_configuration_raid_levels_and_throughput/
Well, such data from a 4 spindle array are nor teling much. When you
are asking for I/O performance on RAID arrays, you want to distibute
load over _many_ spindles. Do your comparisons on a 8 or 16 (or more)
spindle setup, and the results will be much different. Also, your
test of copying huge files is just one usage mode: strictly
sequential access. But what we see with OE / Yocto builds is
completely different. Here you will see a huge number of small and
even tiny data transfers.

"Classical" recommendations for performance optimization od RAID
arrays (which are usually tuning for such big, sequentuial accesses
only) like using big stripe sizes and huge read-ahead etc. turn out
to be counter-productive here. But it makes no sense to have for
example a stripe size of 256 kB or more when 95% or more of your disk
accesses write less than 4 kB only.
Post by Darren Hart
"RAID 5 distributes parity across all the drives in the array, this
parity calculation is both compute intensive and IO intensive. Every
write requires the parity calculation, and data must be written to
every drive."
But did you look at a real system? I never found the CPU load of the
parity calculations to be a bottleneck. I rather have the CPU spend
cycles on computing parity, instead of running it with all cores idle
because it's waitong for I/O to complete. I found that for the work
loads we have (software builds like Yocto etc.) a multi-spindle
software RAID array outperforms all other solutions (and especially
the h/w RAID controllers I had access to so far - these don't even
closely reach the same number of IOPS).

OH - and BTW: if you care about reliability, then don't use RAID5.
Go for RAID6. Yes, it's more expensive, but it's also much less
painful when you have to rebuild the array in case of a disk failure.
I've seen too many cases where a second disk would fail during the
rebuild to ever go with RAID5 for big systems again - restoring
several TB of data from tape ain't no fun.

See also the RAID wiki for specific performance optizations on such
RAID arrays.

Best regards,

Wolfgang Denk
--
DENX Software Engineering GmbH, MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd-***@public.gmane.org
Never put off until tomorrow what you can put off indefinitely.
Martin Jansa
2012-04-17 15:29:23 UTC
Permalink
Post by Martin Jansa
Post by Flanagan, Elizabeth
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Martin Jansa
Post by Bob Cochran
Darren,
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on
a G2 Intel SSD, with two Seagate Barracudas in a RAID0 array
for my /build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
why not use so much RAM for WORKDIR in tmpfs? I bought 16GB just to
be able to do my builds in tmpfs and keep only more permanent data
on RAID.
We've done some experiments with tmpfs, adding Beth on CC. If I recall
correctly, my RAID0 array with the mount options I specified
accomplishes much of what tmpfs does for me without the added setup.
This should be the case in general. For the most part, if you have a decent
RAID setup (We're using RAID10 on the ab) with fast disks you should be
able to hit tmpfs speed (or close to it). I've done some experiments with
this and what I found was maybe a 5 minute difference, sometimes, from a
clean build between tmpfs and RAID10.
5 minutes on very small image like core-image-minimal (30 min) is 1/6 of
that time :)..
I have much bigger images and even bigger ipk feed, so to rebuild from
scratch takes about 24 hours for one architecture..
And my system is very slow compared to yours, I've found my measurement
of core-image-minimal-with-mtdutils around 95 mins
http://patchwork.openembedded.org/patch/17039/
but this was with Phenom II X4 965, 4GB RAM, RAID0 (3 SATA2 disks) for
WORKDIR, RAID5 (the same 3 SATA2 disks) BUILDDIR (raid as mdraid), now
I have Bulldozer AMD FX(tm)-8120, 16GB RAM, still the same RAID0 but
different motherboard..
Problem with tmpfs is that no RAM is big enough to build whole feed in
one go, so I have to build in steps (e.g. bitbake gcc for all machines
with the same architecture, then cleanup WORKDIR and switch to another
arch, then bitbake small-image, bigger-image, qt4-x11-free, ...).
qt4-x11-free is able to eat 15GB tmpfs almost completely.
Post by Flanagan, Elizabeth
I discussed this during Yocto Developer Day. Let me boil it down a bit to
explain some of what I did on the autobuilders.
Caveat first though. I would avoid using autobuilder time as representative
of prime yocto build time. The autobuilder hosts a lot of different
services that sometimes impact build time and this can vary depending on
what else is going on on the machine.
There are four places, in general, where you want to look at optimizing
outside of dependency issues. CPU, disk, memory, build process. What I
found was that the most useful of these in getting the autobuilder time
down was disk and build process.
With disk, spreading it across the RAID saved us not only a bit of time,
but also helped us avoid trashed disks. More disk thrash == higher failure
rate. So far this year we've seen two disk failures that have resulted in
almost zero autobuilder downtime.
True for RAID10, but for WORKDIR itself RAID0 is cheeper and even higher
failure rate it's not big issue for WORKDIR.. just have to cleansstate
tasks which were in hit in the middle of build..
Post by Flanagan, Elizabeth
The real time saver however ended up being maintaining sstate across build
runs. Even with our sstate on nfs, we're still seeing a dramatic decrease
in build time.
I would be interested in seeing what times you get with tmpfs. I've done
tmpfs builds before and have seen good results, but bang for the buck did
end up being a RAID array.
I'll check if core-image-minimal can be built with just 15GB tmpfs,
otherwise I would have to build it in 2 steps and the time wont be
precise.
It was enough with rm_work, so here are my results:

The difference is much smaller then I've expected, but again those are
very small images (next time I'll try to do just qt4 builds).

Fastest is TMPDIR on tmpfs (BUILDDIR is not important - same times with
BUILDDIR also in tmpfs and on SATA2 disk).

raid0 is only about 4% slower

single SATA2 disk is slowest but only a bit slower then raid5, but that
could be caused by bug #2314 as I had to run build twice..

And all times were just from first successfull build, it could be
different with avg time over 10 builds..

And all builds on
AMD FX(tm)-8120 Eight-Core Processor
16G DDR3-1600 RAM
standalone SATA2 disk ST31500341AS
mdraid on 3 older SATA2 disks HDS728080PLA380

bitbake:
commit 4219e2ea033232d95117211947b751bdb5efafd4
Author: Saul Wold <sgw-VuQAYsv1563Yd54FQh9/***@public.gmane.org>
Date: Tue Apr 10 17:57:15 2012 -0700

openembedded-core:
commit 4396db54dba4afdb9f1099f4e386dc25c76f49fb
Author: Richard Purdie <richard.purdie-hQyY1W1yCW8ekmWlsbkhG0B+***@public.gmane.org>
Date: Sat Apr 14 23:42:16 2012 +0100
+ fix for opkg-utils, so that package-index doesn't take ages to complete

BUILDDIR = 1 SATA2 disk
TMPDIR = tmpfs

real 84m32.995s
user 263m46.316s
sys 48m26.376s

BUILDDIR = tmpfs
TMPDIR = tmpfs

real 84m10.528s
user 264m16.144s
sys 50m21.853s

BUILDDIR = raid5
TMPDIR = raid5

real 91m20.470s
user 263m47.156s
sys 52m23.400s

BUILDDIR = raid0
TMPDIR = raid0

real 87m29.526s
user 263m0.799s
sys 51m37.242s

BUILDDIR = 1 SATA2 disk
TMPDIR = the same SATA2 disk

Summary: 1 task failed:
/OE/oe-core/openembedded-core/meta/recipes-core/eglibc/eglibc_2.15.bb,
do_compile
Summary: There was 1 ERROR message shown, returning a non-zero exit
code.
see https://bugzilla.yoctoproject.org/show_bug.cgi?id=2314

real 48m23.412s
user 163m55.082s
sys 23m26.990s
+
touch
oe-core/tmp-eglibc/work/x86_64-oe-linux/eglibc-2.15-r6+svnr17386/eglibc-2_15/libc/Makerules
+
Summary: There were 6 WARNING messages shown.

real 44m13.401s
user 92m44.427s
sys 27m38.347s

=

real 92m36.813s
user 255m99.509s
sys 51m05.337s
--
Martin 'JaMa' Jansa jabber: Martin.Jansa-***@public.gmane.org
Björn Stenberg
2012-04-12 14:08:02 UTC
Permalink
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no need to specify both.
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a G2
Intel SSD, with two Seagate Barracudas in a RAID0 array for my /build
partition. I run a headless Ubuntu 11.10 (x86_64) installation running
the 3.0.0-16-server kernel. I can build core-image-minimal in < 30
minutes and core-image-sato in < 50 minutes from scratch.
I'm guessing those are rather fast cores? I build on a different type of beast: 64 cores at 2.1GHz and 128 GB ram. The OS is on a single SSD and the build dir (and sources) is on a RAID0 array of Intel 520 SSDs. Kernel is the same ubuntu 3.0.0-16-server as yours.

Yet for all the combined horsepower, I am unable to match your time of 30 minutes for core-image-minimal. I clock in at around 37 minutes for a qemux86-64 build with ipk output:

------
NOTE: Tasks Summary: Attempted 1363 tasks of which 290 didn't need to be rerun and all succeeded.

real 36m32.118s
user 214m39.697s
sys 108m49.152s
------

These numbers also show that my build is running less than 9x realtime, indicating that 80% of my cores sit idle most of the time. This confirms what "ps xf" says during the builds: Only rarely is bitbake running more than a handful tasks at once, even with BB_NUMBER_THREADS at 64. And many of these tasks are in turn running sequential loops on a single core.

I'm hoping to find time soon to look deeper into this issue and suggest remedies. It my distinct feeling that we should be able to build significantly faster on powerful machines.
--
Björn
Darren Hart
2012-04-12 14:34:08 UTC
Permalink
Post by Björn Stenberg
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no need to specify both.
Excellent, thanks for the tip.
Post by Björn Stenberg
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a
G2 Intel SSD, with two Seagate Barracudas in a RAID0 array for my
/build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
I'm guessing those are rather fast cores?
I build on a different type
of beast: 64 cores at 2.1GHz and 128 GB ram. The OS is on a single
SSD and the build dir (and sources) is on a RAID0 array of Intel 520
SSDs. Kernel is the same ubuntu 3.0.0-16-server as yours.
Now that I think about it, my downloads are on the RAID0 array too.

One thing that comes to mind is the parallel settings, BB_NUMBER_THREADS
and PARALLEL_MAKE. I noticed a negative impact if I increased these
beyond 12 and 14 respectively. I tested this with bb-matrix
(scripts/contrib/bb-perf/bb-matrix.sh). The script is a bit fickle, but
can provide useful results and killer 3D surface plots of build time
with BB and PM on the axis. Can't seem to find a plot image at the
moment for some reason...
Post by Björn Stenberg
Yet for all the combined horsepower, I am unable to match your time
of 30 minutes for core-image-minimal. I clock in at around 37 minutes
------ NOTE: Tasks Summary: Attempted 1363 tasks of which 290 didn't
need to be rerun and all succeeded.
real 36m32.118s user 214m39.697s sys 108m49.152s ------
These numbers also show that my build is running less than 9x
realtime, indicating that 80% of my cores sit idle most of the time.
Yup, that sounds about right. The build has a linear component to it,
and anything above about 12 just doesn't help. In fact the added
scheduling overhead seems to hurt.
Post by Björn Stenberg
This confirms what "ps xf" says during the builds: Only rarely is
bitbake running more than a handful tasks at once, even with
BB_NUMBER_THREADS at 64. And many of these tasks are in turn running
sequential loops on a single core.
I'm hoping to find time soon to look deeper into this issue and
suggest remedies. It my distinct feeling that we should be able to
build significantly faster on powerful machines.
Reducing the dependency chains that result in the linear component of
the build (forcing serialized execution) is one place we've focused, and
could probably still use some attention. CC'ing RP as he's done a lot there.
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Chris Tapp
2012-04-12 22:43:12 UTC
Permalink
Post by Darren Hart
Post by Björn Stenberg
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no
need to specify both.
Excellent, thanks for the tip.
Post by Björn Stenberg
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a
G2 Intel SSD, with two Seagate Barracudas in a RAID0 array for my
/build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
I'm guessing those are rather fast cores?
Nice, but well out of my budget - I've got to make do with what one of your CPUs costs for the whole system ;-)
Post by Darren Hart
Post by Björn Stenberg
I build on a different type
of beast: 64 cores at 2.1GHz and 128 GB ram. The OS is on a single
SSD and the build dir (and sources) is on a RAID0 array of Intel 520
SSDs. Kernel is the same ubuntu 3.0.0-16-server as yours.
Now that I think about it, my downloads are on the RAID0 array too.
One thing that comes to mind is the parallel settings, BB_NUMBER_THREADS
and PARALLEL_MAKE. I noticed a negative impact if I increased these
beyond 12 and 14 respectively. I tested this with bb-matrix
(scripts/contrib/bb-perf/bb-matrix.sh). The script is a bit fickle, but
can provide useful results and killer 3D surface plots of build time
with BB and PM on the axis. Can't seem to find a plot image at the
moment for some reason...
Post by Björn Stenberg
Yet for all the combined horsepower, I am unable to match your time
of 30 minutes for core-image-minimal. I clock in at around 37 minutes
------ NOTE: Tasks Summary: Attempted 1363 tasks of which 290 didn't
need to be rerun and all succeeded.
real 36m32.118s user 214m39.697s sys 108m49.152s ------
These numbers also show that my build is running less than 9x
realtime, indicating that 80% of my cores sit idle most of the time.
Yup, that sounds about right. The build has a linear component to it,
and anything above about 12 just doesn't help. In fact the added
scheduling overhead seems to hurt.
Post by Björn Stenberg
This confirms what "ps xf" says during the builds: Only rarely is
bitbake running more than a handful tasks at once, even with
BB_NUMBER_THREADS at 64. And many of these tasks are in turn running
sequential loops on a single core.
I'm hoping to find time soon to look deeper into this issue and
suggest remedies. It my distinct feeling that we should be able to
build significantly faster on powerful machines.
Reducing the dependency chains that result in the linear component of
the build (forcing serialized execution) is one place we've focused, and
could probably still use some attention. CC'ing RP as he's done a lot there.
Current plan for a 'budget' system is:

DX79TO motherboard, i7 3820, 16GB RAM, a pair of 60GB OCZ Vertex III's in RAID-0 for downloads / build, SATA HD for OS (Ubuntu 11.10 x86_64).

That'll give me a 2.7x boost just on CPU and the SSDs (and maybe some over-clocking) will give some more.

Not sure if SSDs in RAID-0 will give any boost, so I'll run some tests.

Thanks to all for the comments in this thread.

Chris Tapp

opensource-UUnwy/L99c5Wk0Htik3J/***@public.gmane.org
www.keylevel.com
Darren Hart
2012-04-12 22:56:24 UTC
Permalink
Post by Chris Tapp
Post by Darren Hart
Post by Björn Stenberg
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no
need to specify both.
Excellent, thanks for the tip.
Post by Björn Stenberg
Post by Darren Hart
I run on a beast with 12 cores, 48GB of RAM, OS and sources on a
G2 Intel SSD, with two Seagate Barracudas in a RAID0 array for my
/build partition. I run a headless Ubuntu 11.10 (x86_64)
installation running the 3.0.0-16-server kernel. I can build
core-image-minimal in < 30 minutes and core-image-sato in < 50
minutes from scratch.
I'm guessing those are rather fast cores?
Nice, but well out of my budget - I've got to make do with what one of your CPUs costs for the whole system ;-)
Post by Darren Hart
Post by Björn Stenberg
I build on a different type
of beast: 64 cores at 2.1GHz and 128 GB ram. The OS is on a single
SSD and the build dir (and sources) is on a RAID0 array of Intel 520
SSDs. Kernel is the same ubuntu 3.0.0-16-server as yours.
Now that I think about it, my downloads are on the RAID0 array too.
One thing that comes to mind is the parallel settings, BB_NUMBER_THREADS
and PARALLEL_MAKE. I noticed a negative impact if I increased these
beyond 12 and 14 respectively. I tested this with bb-matrix
(scripts/contrib/bb-perf/bb-matrix.sh). The script is a bit fickle, but
can provide useful results and killer 3D surface plots of build time
with BB and PM on the axis. Can't seem to find a plot image at the
moment for some reason...
Post by Björn Stenberg
Yet for all the combined horsepower, I am unable to match your time
of 30 minutes for core-image-minimal. I clock in at around 37 minutes
------ NOTE: Tasks Summary: Attempted 1363 tasks of which 290 didn't
need to be rerun and all succeeded.
real 36m32.118s user 214m39.697s sys 108m49.152s ------
These numbers also show that my build is running less than 9x
realtime, indicating that 80% of my cores sit idle most of the time.
Yup, that sounds about right. The build has a linear component to it,
and anything above about 12 just doesn't help. In fact the added
scheduling overhead seems to hurt.
Post by Björn Stenberg
This confirms what "ps xf" says during the builds: Only rarely is
bitbake running more than a handful tasks at once, even with
BB_NUMBER_THREADS at 64. And many of these tasks are in turn running
sequential loops on a single core.
I'm hoping to find time soon to look deeper into this issue and
suggest remedies. It my distinct feeling that we should be able to
build significantly faster on powerful machines.
Reducing the dependency chains that result in the linear component of
the build (forcing serialized execution) is one place we've focused, and
could probably still use some attention. CC'ing RP as he's done a lot there.
DX79TO motherboard, i7 3820, 16GB RAM, a pair of 60GB OCZ Vertex III's in RAID-0 for downloads / build, SATA HD for OS (Ubuntu 11.10 x86_64).
That'll give me a 2.7x boost just on CPU and the SSDs (and maybe some over-clocking) will give some more.
Not sure if SSDs in RAID-0 will give any boost, so I'll run some tests.
Thanks to all for the comments in this thread.
Get back to us with times, and we'll build up a wiki page.
Post by Chris Tapp
Chris Tapp
www.keylevel.com
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Chris Tapp
2012-04-18 19:41:17 UTC
Permalink
Post by Darren Hart
Get back to us with times, and we'll build up a wiki page.
Some initial results / comments:

I'm running on:
- i7 3820 (quad core, hyper-treading, 3.6GHz)
- 16GB RAM (1600MHz XMP profile)
- Asus P9X79 Pro motherboard
- Ubuntu 11.10 x86_64 server installed on a 60GB OCZ Vertex 3 SSD on a 3Gb/s interface
- Two 60GB OCZ Vertex 3s as RAID-0 on 6Gb/s interfaces.

The following results use a DL_DIR on the OS SSD (pre-populated) - I'm not interested in the speed of the internet, especially as I've only got a relatively slow connection ;-)

Poky-6.0.1 is also installed on the OS SSD.

I've done a few builds of core-image-minimal:

1) Build dir on the OS SSD
2) Build dir on the SSD RAID + various bits of tuning.

The results are basically the same, so it seems as if the SSD RAID makes no difference. Benchmarking it does show twice the read/write performance of the OS SSD, as expected. Disabling journalling and increasing the commit time to 6000 also made no significant difference to the build times, which were (to the nearest minute):

Real : 42m
User : 133m
System : 19m

These time were starting from nothing, and seem to fit with your 30 minutes with 3 times as many cores! BTW, BB_NUMBER_THREADS was set to 16 and PARALLEL_MAKE to 12.

I also tried rebuilding the kernel:
bitbake -c clean linux-yocto
rm -rf the sstate bits for the above
bitbake linux-yocto

and got the following times:

Real : 39m
User : 105m
System : 16m

Which kind of fits with an observation. The minimal build had something like 1530 stages to complete. The first 750 to 800 of these flew past with all 8 'cores' running at just about 100% all the time. Load average (short term) was about 19, so plenty ready to run. However, round about the time python-native, the kernel, libxslt, gettext kicked in the cpu usage dropped right off - to the point that the short term load average dropped below 3. It did pick up again later on (after the kernel was completed) before slowing down again towards the end (when it would seem reasonable to expect that less can run in parallel).

It seems as if some of these bits (or others around this time) aren't making use of parallel make or there is a queue of dependent tasks that needs to be serialized.

The kernel build is a much bigger part of the build than I was expecting, but this is only a small image. However, it looks as if the main compilation phase completes very early on and a lot of time is then spent building the modules (in a single thread, it seems) and in packaging - which leads me to ask if RPM is the best option (speed wise)? I don't use the packages myself (though understand they are needed internally), so I can use the fastest (if there is one).

Is there anything else I should be considering to improve build times? As I said above, this is just a rough-cut at some benchmarking and I plan to do some more, especially if there are other things to try and/or any other information that would be useful.

Still, it's looking much, much faster than my old build system :-)

Chris Tapp

opensource-UUnwy/L99c5Wk0Htik3J/***@public.gmane.org
www.keylevel.com
Chris Tapp
2012-04-18 20:27:30 UTC
Permalink
Post by Darren Hart
Get back to us with times, and we'll build up a wiki page.
<snip>
bitbake -c clean linux-yocto
rm -rf the sstate bits for the above
bitbake linux-yocto
Real : 11m
User : 15m
System : 2m
The comments about low load averages during kernel build still stand.

Chris Tapp

opensource-UUnwy/L99c5Wk0Htik3J/***@public.gmane.org
www.keylevel.com
Darren Hart
2012-04-18 20:55:38 UTC
Permalink
Post by Chris Tapp
Post by Darren Hart
Get back to us with times, and we'll build up a wiki page.
- i7 3820 (quad core, hyper-treading, 3.6GHz)
- 16GB RAM (1600MHz XMP profile)
- Asus P9X79 Pro motherboard
- Ubuntu 11.10 x86_64 server installed on a 60GB OCZ Vertex 3 SSD on a 3Gb/s interface
- Two 60GB OCZ Vertex 3s as RAID-0 on 6Gb/s interfaces.
The following results use a DL_DIR on the OS SSD (pre-populated) -
I'm
not interested in the speed of the internet, especially as I've only got
a relatively slow connection ;-)
Post by Chris Tapp
Poky-6.0.1 is also installed on the OS SSD.
1) Build dir on the OS SSD
2) Build dir on the SSD RAID + various bits of tuning.
The results are basically the same, so it seems as if the SSD RAID
makes no difference. Benchmarking it does show twice the read/write
performance of the OS SSD, as expected. Disabling journalling and
increasing the commit time to 6000 also made no significant difference
to the build times, which were (to the nearest minute):


That is not surprising. With 4 cores and a very serialized build target,
I would not expect your SSD to be the bottleneck.
Post by Chris Tapp
Real : 42m
User : 133m
System : 19m
These time were starting from nothing, and seem to fit with your 30
minutes with 3 times as many cores! BTW, BB_NUMBER_THREADS was set to 16
and PARALLEL_MAKE to 12.

A couple of things to keep in mind here. The minimal build is very
serialized in comparison to something like a sato build. If you want to
optimize your build times, look at the bbmatrix* scripts shipped with
poky to find the sweet spot for your target image and your build system.
I suspect you will find your BB_NUMBER_THREADS and PARALLEL_MAKE
settings are two high for your system. I'd start with them at 8 and 8,
or 8 and 6 respectively.
Post by Chris Tapp
bitbake -c clean linux-yocto
rm -rf the sstate bits for the above
bitbake linux-yocto
Real : 39m
User : 105m
System : 16m
Which kind of fits with an observation. The minimal build had
something like 1530 stages to complete. The first 750 to 800 of these
flew past with all 8 'cores' running at just about 100% all the time.
Load average (short term) was about 19, so plenty ready to run. However,
round about the time python-native, the kernel, libxslt, gettext kicked
in the cpu usage dropped right off - to the point that the short term
load average dropped below 3. It did pick up again later on (after the
kernel was completed) before slowing down again towards the end (when it
would seem reasonable to expect that less can run in parallel).
Post by Chris Tapp
It seems as if some of these bits (or others around this time)
aren't
making use of parallel make or there is a queue of dependent tasks that
needs to be serialized.
Post by Chris Tapp
The kernel build is a much bigger part of the build than I was
expecting, but this is only a small image. However, it looks as if the
main compilation phase completes very early on and a lot of time is then
spent building the modules (in a single thread, it seems) and in
packaging - which leads me to ask if RPM is the best option (speed
wise)? I don't use the packages myself (though understand they are
needed internally), so I can use the fastest (if there is one).

IPK is faster than RPM. This is what I use on most of my builds.
Post by Chris Tapp
Is there anything else I should be considering to improve build
times?
Run the ubuntu server kernel to eliminate some scheduling overhead.
Reducing the parallel settings mentioned above should help here too.

Welcome to Ubuntu 11.10 (GNU/Linux 3.0.0-16-server x86_64)

***@rage:~
$ uname -r
3.0.0-16-server


As I said above, this is just a rough-cut at some benchmarking and I
plan to do some more, especially if there are other things to try and/or
any other information that would be useful.
Post by Chris Tapp
Still, it's looking much, much faster than my old build system :-)
Chris Tapp
www.keylevel.com
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Chris Tapp
2012-04-19 22:39:26 UTC
Permalink
On 18 Apr 2012, at 21:55, Darren Hart wrote:

<snip>
Post by Darren Hart
A couple of things to keep in mind here. The minimal build is very
serialized in comparison to something like a sato build. If you want to
optimize your build times, look at the bbmatrix* scripts shipped with
poky to find the sweet spot for your target image and your build system.
I suspect you will find your BB_NUMBER_THREADS and PARALLEL_MAKE
settings are two high for your system. I'd start with them at 8 and 8,
or 8 and 6 respectively.
I've run a few of the matrix variants (it's going to take a few days to get a full set). 8 and 16 threads are giving the same results (within a few seconds) for parallel make values in the range 6 to 12.

I tried a core-image-sato build and it completed in 61m/244m/40m, which is much closer to your <50m than I thought I would get.

One thing I noticed during the build was that gettext-native seemed slow. Doing a 'clean' on it and re-baking shows that it takes over 4 minutes to build with most of the time (2m38) being spent in 'do_configure'. It also seems as if this is on the critical path as nothing else was getting scheduled while it was building. There seems to be a lot of 'nothing' going on during the do_configure phase (i.e. very little CPU use). Or, to put it another way, 2.5% of the build time is taken up configuring this package!
Post by Darren Hart
IPK is faster than RPM. This is what I use on most of my builds.
Makes no noticeable difference in my testing so far, but I'll stick with IPK from now on.

<snip>
Post by Darren Hart
Run the ubuntu server kernel to eliminate some scheduling overhead.
Reducing the parallel settings mentioned above should help here too.
I'm running 11.x server as you mentioned this before ;-)

Chris Tapp

opensource-UUnwy/L99c5Wk0Htik3J/***@public.gmane.org
www.keylevel.com

Richard Purdie
2012-04-13 08:45:23 UTC
Permalink
Post by Darren Hart
Post by Björn Stenberg
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no
need to specify both.
Excellent, thanks for the tip.
Note the key here is that for a system with large amounts of memory, you
can effectively keep the build in memory due to the long commit time.

All the tests I've done show we are not IO bound anyway.
Post by Darren Hart
Post by Björn Stenberg
Yet for all the combined horsepower, I am unable to match your time
of 30 minutes for core-image-minimal. I clock in at around 37 minutes
------ NOTE: Tasks Summary: Attempted 1363 tasks of which 290 didn't
need to be rerun and all succeeded.
real 36m32.118s user 214m39.697s sys 108m49.152s ------
These numbers also show that my build is running less than 9x
realtime, indicating that 80% of my cores sit idle most of the time.
Yup, that sounds about right. The build has a linear component to it,
and anything above about 12 just doesn't help. In fact the added
scheduling overhead seems to hurt.
Post by Björn Stenberg
This confirms what "ps xf" says during the builds: Only rarely is
bitbake running more than a handful tasks at once, even with
BB_NUMBER_THREADS at 64. And many of these tasks are in turn running
sequential loops on a single core.
I'm hoping to find time soon to look deeper into this issue and
suggest remedies. It my distinct feeling that we should be able to
build significantly faster on powerful machines.
Reducing the dependency chains that result in the linear component of
the build (forcing serialized execution) is one place we've focused, and
could probably still use some attention. CC'ing RP as he's done a lot there.
The minimal build is about our worst case single threaded build as it is
highly dependency ordered. We've already done a lot of work looking at
the "single thread" of core dependencies and this is for example why we
have gettext-minimal-native which unlocked some of the core path
dependencies. When you look at what we build, there is a reason for most
of it unfortunately. There are emails from me about what I looked and
found on the mailing list, I tried to keep a record of it somewhere at
least. You can get some wins with things like ASSUME_PROVIDED +=
"git-native".

For something like a sato build you should see more parallelism.

I do also have some small gains in some pending patches:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t2&id=2023801e25d81e8cffb643eac259c18b9fecda0b
http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t2&id=ecf5f5de8368fdcf90c3d38eafc689d6d265514b
http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t2&id=2190a51ffac71c9d19305601f8a3a46e467b745a

which look at speeding up do_package, do_package_write_rpm and do_rootfs
(with rpm). There were developed too late for 1.2 and are in some cases
only partially complete but they show some ways we can squeeze some
extra performance out the system.

There are undoubtedly ways we can improve performance but I think we've
done the low hanging fruit and we need some fresh ideas.

Cheers,

Richard
Koen Kooi
2012-04-19 10:00:54 UTC
Permalink
Post by Richard Purdie
Post by Darren Hart
Post by Björn Stenberg
Post by Darren Hart
/dev/md0 /build ext4
noauto,noatime,nodiratime,commit=6000
A minor detail: 'nodiratime' is a subset of 'noatime', so there is no
need to specify both.
Excellent, thanks for the tip.
Note the key here is that for a system with large amounts of memory, you
can effectively keep the build in memory due to the long commit time.
All the tests I've done show we are not IO bound anyway.
Consider this scenario:

OS disk on spinning rust (sda1, /)
BUILDDIR on spinning rust (sdb1, /OE)
WORKDIR on SSD (sdc1, /OE/build/tmp/work)
SD card in USB reader (sde1)

When I do the following during a build all CPUs will enter IO wait and the build grinds to a halt:

cd /media ; xz -d -c foo.img.xz | pv -s 3488M > /dev/sde

That only touches the OS disk and the SD card, but for some reason the 3.2.8 kernel stops IO to the OE disks as well. do_patch for my kernel recipe has been taking more than an hour now, it usually completes in less than 5 minutes (a few hundred patches applied with a custom patcher, git-am).

regards,

Koen
Joshua Immanuel
2012-04-19 12:48:12 UTC
Permalink
Hello,
Post by Richard Purdie
There are undoubtedly ways we can improve performance but I think
we've done the low hanging fruit and we need some fresh ideas.
Is there a way to integrate distcc in yocto so that we could distribute
the build across machines.
--
Joshua Immanuel
HiPro IT Solutions Private Limited
http://hipro.co.in
Richard Purdie
2012-04-19 12:52:31 UTC
Permalink
Post by Joshua Immanuel
Hello,
Post by Richard Purdie
There are undoubtedly ways we can improve performance but I think
we've done the low hanging fruit and we need some fresh ideas.
Is there a way to integrate distcc in yocto so that we could distribute
the build across machines.
See icecream.bbclass but compiling is not the bottleneck, its configure,
install and packaging...

Cheers,

Richard
Samuel Stirtzel
2012-04-19 13:47:08 UTC
Permalink
Post by Richard Purdie
Post by Joshua Immanuel
Hello,
Post by Richard Purdie
There are undoubtedly ways we can improve performance but I think
we've done the low hanging fruit and we need some fresh ideas.
Is there a way to integrate distcc in yocto so that we could distribute
the build across machines.
See icecream.bbclass but compiling is not the bottleneck, its configure,
install and packaging...
Multi threaded package managers come to my mind, also multi threaded
bzip2 (see [1])
Maybe multi threaded autotools / cmake, but that will be future talk
(and a headache for the developers).
Post by Richard Purdie
Cheers,
Richard
_______________________________________________
yocto mailing list
https://lists.yoctoproject.org/listinfo/yocto
[1] http://compression.ca/pbzip2/
--
Regards
Samuel
Björn Stenberg
2012-04-13 08:47:02 UTC
Permalink
Post by Darren Hart
One thing that comes to mind is the parallel settings, BB_NUMBER_THREADS
and PARALLEL_MAKE. I noticed a negative impact if I increased these
beyond 12 and 14 respectively. I tested this with bb-matrix
(scripts/contrib/bb-perf/bb-matrix.sh). The script is a bit fickle, but
can provide useful results and killer 3D surface plots of build time
with BB and PM on the axis.
Very nice! I ran a batch overnight with permutations of 8,12,16,24,64 cores:

BB PM %e %S %U %P %c %w %R %F %M %x
8 8 2288.96 2611.37 10773.53 584% 810299 18460161 690464859 0 1715456 0
8 12 2198.40 2648.57 10846.28 613% 839750 18559413 690563187 0 1982864 0
8 16 2157.26 2672.79 10943.59 631% 898599 18487946 690761197 0 1715440 0
8 24 2125.15 2916.33 11199.27 664% 800009 18412764 690856116 0 1715440 0
8 64 2189.14 7084.14 12906.95 913% 1491503 18646891 699897733 0 1715440 0
12 8 2277.66 2625.82 10805.21 589% 691752 18596208 690998433 0 1715440 0
12 12 2194.04 2664.01 10934.65 619% 714997 18717017 691199925 0 1715440 0
12 16 2183.95 2736.33 11162.30 636% 1090270 18359128 690559327 0 1715440 0
12 24 2120.46 2907.63 11229.50 666% 829783 18644293 690729638 0 1715312 0
12 64 2171.58 6767.09 12822.86 902% 1524683 18634668 690904549 0 1867456 0
16 8 2294.59 2691.74 10813.69 588% 771621 18637582 686712129 0 1715344 0
16 12 2201.51 2704.54 11017.23 623% 753662 18590533 699231236 0 1715424 0
16 16 2154.54 2692.31 11023.28 636% 809586 18557781 691014487 0 1715440 0
16 24 2130.33 2932.18 11259.09 666% 905669 18531776 691082307 0 2030992 0
16 64 2184.01 6954.71 12922.39 910% 1467774 18800203 701770099 0 1715440 0
24 8 2284.88 2645.88 10854.89 590% 833061 18523938 691067170 0 1715328 0
24 12 2203.72 2696.96 11033.10 623% 931443 18457749 691187723 0 2016368 0
24 16 2176.02 2727.94 11113.33 636% 940044 18420200 690959670 0 1715440 0
24 24 2170.38 2938.80 11643.10 671% 1023328 18641215 686665448 15 1715440 0
24 64 2200.02 7188.60 12902.42 913% 1509158 18924772 690615091 66 1715440 0
64 8 2309.40 2702.33 10952.18 591% 753168 18687309 690927732 10 1867440 0
64 12 2230.80 2765.98 11131.22 622% 875495 18744802 691213524 28 1715216 0
64 16 2182.22 2786.22 11180.86 640% 881328 18724987 691020084 109 1768576 0
64 24 2136.20 3001.36 11238.81 666% 898320 18646384 691239254 46 1715312 0
64 64 2189.73 7154.10 12846.99 913% 1416830 18781801 690890798 41 1715424 0

What it shows is that BB_NUMBER_THREADS makes no difference at all in this range. As for PARALLEL_MAKE, it shows 24 is better than 16 but 64 is too high, incurring a massive scheduling penalty. I wonder if newer kernel versions have become more efficient. In hindsight, I should have included 32 and 48 cores in the test.

Unfortunately I was unable to produce plots with bb-matrix-plot.sh. It gave me pretty png files, but missing any plotted data:

# ../../poky/scripts/contrib/bb-perf/bb-matrix-plot.sh
line 0: Number of grid points must be in [2:1000] - not changed!

Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.

Result: http://imgur.com/mfgWb
--
Björn
Darren Hart
2012-04-13 14:41:17 UTC
Permalink
Post by Björn Stenberg
Post by Darren Hart
One thing that comes to mind is the parallel settings, BB_NUMBER_THREADS
and PARALLEL_MAKE. I noticed a negative impact if I increased these
beyond 12 and 14 respectively. I tested this with bb-matrix
(scripts/contrib/bb-perf/bb-matrix.sh). The script is a bit fickle, but
can provide useful results and killer 3D surface plots of build time
with BB and PM on the axis.
BB PM %e %S %U %P %c %w %R %F %M %x
8 8 2288.96 2611.37 10773.53 584% 810299 18460161 690464859 0 1715456 0
8 12 2198.40 2648.57 10846.28 613% 839750 18559413 690563187 0 1982864 0
8 16 2157.26 2672.79 10943.59 631% 898599 18487946 690761197 0 1715440 0
8 24 2125.15 2916.33 11199.27 664% 800009 18412764 690856116 0 1715440 0
8 64 2189.14 7084.14 12906.95 913% 1491503 18646891 699897733 0 1715440 0
12 8 2277.66 2625.82 10805.21 589% 691752 18596208 690998433 0 1715440 0
12 12 2194.04 2664.01 10934.65 619% 714997 18717017 691199925 0 1715440 0
12 16 2183.95 2736.33 11162.30 636% 1090270 18359128 690559327 0 1715440 0
12 24 2120.46 2907.63 11229.50 666% 829783 18644293 690729638 0 1715312 0
12 64 2171.58 6767.09 12822.86 902% 1524683 18634668 690904549 0 1867456 0
16 8 2294.59 2691.74 10813.69 588% 771621 18637582 686712129 0 1715344 0
16 12 2201.51 2704.54 11017.23 623% 753662 18590533 699231236 0 1715424 0
16 16 2154.54 2692.31 11023.28 636% 809586 18557781 691014487 0 1715440 0
16 24 2130.33 2932.18 11259.09 666% 905669 18531776 691082307 0 2030992 0
16 64 2184.01 6954.71 12922.39 910% 1467774 18800203 701770099 0 1715440 0
24 8 2284.88 2645.88 10854.89 590% 833061 18523938 691067170 0 1715328 0
24 12 2203.72 2696.96 11033.10 623% 931443 18457749 691187723 0 2016368 0
24 16 2176.02 2727.94 11113.33 636% 940044 18420200 690959670 0 1715440 0
24 24 2170.38 2938.80 11643.10 671% 1023328 18641215 686665448 15 1715440 0
24 64 2200.02 7188.60 12902.42 913% 1509158 18924772 690615091 66 1715440 0
64 8 2309.40 2702.33 10952.18 591% 753168 18687309 690927732 10 1867440 0
64 12 2230.80 2765.98 11131.22 622% 875495 18744802 691213524 28 1715216 0
64 16 2182.22 2786.22 11180.86 640% 881328 18724987 691020084 109 1768576 0
64 24 2136.20 3001.36 11238.81 666% 898320 18646384 691239254 46 1715312 0
64 64 2189.73 7154.10 12846.99 913% 1416830 18781801 690890798 41 1715424 0
What it shows is that BB_NUMBER_THREADS makes no difference at all in this range. As for PARALLEL_MAKE, it shows 24 is better than 16 but 64 is too high, incurring a massive scheduling penalty. I wonder if newer kernel versions have become more efficient. In hindsight, I should have included 32 and 48 cores in the test.
Right, gnuplot likes evenly spaced values of BB and PM. So you could
have done: 8,12,16,24,28,32 (anything about that is going to go down
anyway). Unfortunately, the gaps force the plot to generate spikes at
the interpolated points. I'm open to ideas on how to make it compatible
with arbitrary gaps and avoid the spikes.


Perhaps I should rewrite this with python matplotlib and scipy and use
the interpolate module. This is non-trivial, so not something I'll get
to quickly.
Post by Björn Stenberg
# ../../poky/scripts/contrib/bb-perf/bb-matrix-plot.sh
line 0: Number of grid points must be in [2:1000] - not changed!
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Warning: Single isoline (scan) is not enough for a pm3d plot.
Hint: Missing blank lines in the data file? See 'help pm3d' and FAQ.
Result: http://imgur.com/mfgWb
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Björn Stenberg
2012-04-19 07:24:37 UTC
Permalink
Post by Darren Hart
Right, gnuplot likes evenly spaced values of BB and PM. So you could
have done: 8,12,16,24,28,32
I did that, and uploaded it to the wiki:
https://wiki.yoctoproject.org/wiki/Build_Performance#parallelism

Looks like 24/32 is the sweet spot for this system, for this build.
--
Björn
Darren Hart
2012-04-19 14:11:50 UTC
Permalink
Post by Björn Stenberg
Post by Darren Hart
Right, gnuplot likes evenly spaced values of BB and PM. So you could
have done: 8,12,16,24,28,32
https://wiki.yoctoproject.org/wiki/Build_Performance#parallelism
Looks like 24/32 is the sweet spot for this system, for this build.
Fantastic! I'm glad to see a sweet spot above 12x12. I'll have to rerun
on my system to see if things have improved for me as well. Thanks for
taking the time to summarize the discussion and get it on the wiki!
--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel
Tomas Frydrych
2012-04-13 09:56:47 UTC
Permalink
Post by Darren Hart
Next up is storage.
Indeed. In my experience by far the biggest limiting factor in the
builds is getting io bound. If you are not running a dedicated build
machine, it is well worth using a dedicated disk for the poky tmp dir;
assuming you have cpu time left, this leaves the machine completely
usable for other things.
Post by Darren Hart
Now RAM, you will want about 2 GB of RAM per core, with a minimum of 4GB.
My experience does not bear this out at all; building Yocto on a 6 core
hyper threaded desktop machine I have never ever seen the system memory
use to get significantly over a 2GB mark (out of 8GB available), doing
Yocto build using 10 cores/threads.


On a custom desktop machine with i7-x990 3.47GHz, 8GB ram, quiet
conventional hard disks, letting poky use 10 cores/threads (so I can get
my work done while it does its own thing in the background), a fresh
build of core-image-minimal for beagleboard, with debug & profile tools
and test apps, takes 77 minutes.

Obviously, not anywhere near as fast as the Intel OTC Xenon beast, but
much cheaper HW, and for my purposes the build speed is well in a region
where it is no longer a productivity issue.

Tomas
Koen Kooi
2012-04-13 10:23:18 UTC
Permalink
Post by Tomas Frydrych
Post by Darren Hart
Next up is storage.
Indeed. In my experience by far the biggest limiting factor in the
builds is getting io bound. If you are not running a dedicated build
machine, it is well worth using a dedicated disk for the poky tmp dir;
assuming you have cpu time left, this leaves the machine completely
usable for other things.
Post by Darren Hart
Now RAM, you will want about 2 GB of RAM per core, with a minimum of 4GB.
My experience does not bear this out at all; building Yocto on a 6 core
hyper threaded desktop machine I have never ever seen the system memory
use to get significantly over a 2GB mark (out of 8GB available), doing
Yocto build using 10 cores/threads.
Try building webkit or asio, the linker will uses ~1.5GB per object, so for asio you need PARALLEL_MAKE * 1.5 GB of ram to avoid swapping to disk.
Loading...