[OpenRISC] A few observations about NetBSD porting to OpenRISC

Discussion:

Matt Thomas

2014-08-20 17:30:27 UTC

Recently, I started making NetBSD support OpenRisc.

I'm using binutils from the top of the tree and GCC 4.9 for my toolchain.

I looked at using llvm-openrisc but NetBSD's LLVM is 3.6 while llvm-openrisc
is 3.1. Since my expertise with toolchains is more gcc centric, I went that
way.

So I'm wondering on what ISA features I can count on. Are OR32BIS II
instructions widely implemented? floating point?

I was deciding on whether to focus on whether to just support the no-delay
version of the ISA. I found that PIC code and -mno-delay seem to be
incompatible at the moment.

The problem is computing the GOT pointer doesn't take into account -mno-delay
or -mcompat-delay. It's always emitted as:

l.jal 8
l.movhi r16,gotpchi(_GLOBAL_OFFSET_TABLE_-4)
l.ori r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+0)
l.add r16,r16,r9

The problem is for no-delay the l.jal should have an argument of 4 or the
l.movhi will never be executed since it was branched over. I think for -m
no-delay or -mcompat-delay it should be:

l.jal 4
l.movhi r16,gotpchi(_GLOBAL_OFFSET_TABLE_+0)
l.ori r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+4)
l.add r16,r16,r9

I notice that r16 is being as the GOT pointer and r10 as the thread pointer
though there aren't document as such in the OpenRISC 1.1 Architecture.

I was surprised to see that patterns for ffssi2, ctzsi2, and clzsi2 aren't
present for gcc given the l.ff1 and l.fl1 instructions.

Looking at the emitted gcc code, I see.

l.addi r1,r1,16
l.lwz r9,-4(r1) # SI load
l.lwz r1,-16(r1) # SI load

The load of r1 after the l.addi serves no useful purpose.

One nice thing I have noticed is that it is rather easy to convert
PowerPC assembly to OpenRISC.

Stefan Kristiansson

2014-08-21 03:11:04 UTC

Permalink

Post by Matt Thomas
Recently, I started making NetBSD support OpenRisc.

Great!

Post by Matt Thomas
I'm using binutils from the top of the tree and GCC 4.9 for my toolchain.
I looked at using llvm-openrisc but NetBSD's LLVM is 3.6 while
llvm-openrisc
is 3.1. Since my expertise with toolchains is more gcc centric, I went that
way.

https://github.com/openrisc/llvm-or1k is actually 3.5 and these guys have
even more recent versions
http://compilergroup-srv.elet.polimi.it/pulp/git/pulp-public
(Their git server seems a bit unreliable though, I haven't been able to
pull anything from it ever)

Post by Matt Thomas
So I'm wondering on what ISA features I can count on. Are OR32BIS II
instructions widely implemented? floating point?

Strictly, you can't count on any of the OR32BIS II instructions being
implemented, but in practice, 'all' implementations have support for mul
div and ff1/fl1.
FPU is not that commonly implemented/used, but I doubt there are code in
the kernel that depends on FPU?
I also doubt there are much user space FPU code that is written in asm?

Post by Matt Thomas
I was deciding on whether to focus on whether to just support the no-delay
version of the ISA. I found that PIC code and -mno-delay seem to be
incompatible at the moment.

Delay-slot implementations are still most dominant, especially if you want
a mmu.
I have a long-term plan on doing a delay-slot-less version of
mor1kx-cappuccino (https://github.com/openrisc/mor1kx),
and there's this https://github.com/pgavin/carpe
I would suggest working on making the code delay-slot agnostic instead of
choosing one path.

Post by Matt Thomas
The problem is computing the GOT pointer doesn't take into account -mno-delay
l.jal 8
l.movhi r16,gotpchi(_GLOBAL_OFFSET_TABLE_-4)
l.ori r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+0)
l.add r16,r16,r9
The problem is for no-delay the l.jal should have an argument of 4 or the
l.movhi will never be executed since it was branched over. I think for -m
l.jal 4
l.movhi r16,gotpchi(_GLOBAL_OFFSET_TABLE_+0)
l.ori r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+4)
l.add r16,r16,r9

Yes, this is a deficiency in the implementation in gcc (llvm actually
handles this correctly).
But, that said, you will never be able to make the got pointer acquiring
work with -mcompat-delay, since the
l.jal 4
will be wrong for implementations that features delay-slot.
IOW, PIC code will need to always choose to either be delay or no-delay.

Post by Matt Thomas
I notice that r16 is being as the GOT pointer and r10 as the thread pointer
though there aren't document as such in the OpenRISC 1.1 Architecture.

Yes, we should update the arch spec ABI section with this...

Post by Matt Thomas
I was surprised to see that patterns for ffssi2, ctzsi2, and clzsi2 aren't
present for gcc given the l.ff1 and l.fl1 instructions.

True, the l.ff1 and l.fl1 are optional, so I guess that's why that haven't
been implemented.
I might take a look at adding support for that at some point, if someone
doesn't beat me to it.
llvm (can) make use of these though.

Post by Matt Thomas
Looking at the emitted gcc code, I see.
l.addi r1,r1,16
l.lwz r9,-4(r1) # SI load
l.lwz r1,-16(r1) # SI load
The load of r1 after the l.addi serves no useful purpose.

It's a known issue... and it's in my todo-pipeline to fix that as soon as
I'm done with what I'm currently working on.
The reason that code is emitted is to work-around some dwarf2 issue that
Christian Svensson noticed.
If you want to take a look at it yourself, this is the code that make it
happen.
https://github.com/openrisc/or1k-gcc/blob/or1k/gcc/config/or1k/or1k.c#L132-L136

Stefan

Peter Gavin

2014-08-21 03:22:22 UTC

Permalink

On Wed, Aug 20, 2014 at 11:11 PM, Stefan Kristiansson <

Post by Stefan Kristiansson
there's this https://github.com/pgavin/carpe

It's still very very much a work in progress. I've held off announcing it
here because it still needs a lot of work. I'm redoing the data cache at
the moment, and there's no MMU, and I haven't ported it to any FPGAs (it
synthesizes, though). But it works pretty well for simple stuff. Newlib
compiled stuff should work, but I have my own boot code I'm using for now,
and there's no I/O. It supports l.mul, l.div, l.f[fl]1, l.cmov, l.ror,
etc. Basically, the entire integer ISA should work. There are a few tiny
deviations from other implementations that I need to fix, but the code
emitted by GCC isn't affected. And there's no FPU support at all :)

-Pete

Geert Uytterhoeven

2014-08-21 07:33:23 UTC

Permalink

Hi Stefan,

On Thu, Aug 21, 2014 at 5:11 AM, Stefan Kristiansson

FPU is not that commonly implemented/used, but I doubt there are code in the
kernel that depends on FPU?

Isn't the FPU context saved/restored when switching tasks?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ***@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Stefan Kristiansson

2014-08-21 08:08:44 UTC

Permalink

Post by Geert Uytterhoeven
On Thu, Aug 21, 2014 at 5:11 AM, Stefan Kristiansson

Post by Stefan Kristiansson
FPU is not that commonly implemented/used, but I doubt there are code in

the

Post by Stefan Kristiansson
kernel that depends on FPU?

Isn't the FPU context saved/restored when switching tasks?

Good point, but since or1k doesn't have any special FPU registers (the GPRs
are used),
so AFAICT the only real context saving that would be needed is to save the
FPU status register.

Stefan