Discussion:
[OpenRISC] Porting FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline
BAndViG
2014-08-25 18:09:43 UTC
Permalink
Hello all!

I'm working to port FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline.
For testing proposes I ported "testfloat" program from ORPSoC v2 to or1k
newlib tool chain.
The initial and buggy :) Verilog is finished. The successfully tested
features are: "int32 to float32 conversion", "addition", "substruction",
exception handling and FPSCR reading/writing.
Knowing bugs are.
The "float32 to int32 conversion" fails with "rounding to inf+" mode while
converting 1.0f : the result is 2 (must be 1). Perhaps, the other bugs are
present, but I modified testing routine to stop testing process at the 1st
error.
The multiplier and divisor also generate erroneous results (not totally but
for some particular inputs).
And any comparison test leads to hang up of "testfloat". I tried to simulate
execution of floating point comparison on RTL with a simple program placed
into ROM. The test passed successfully (no pipeline hang up). Has anybody
got an idea how the bug could be found?

If somebody wants to participate in the activity or just review sources, the
Verilog could be found in https://github.com/bandvig/mor1kx/tree/withfpu

I haven't got source code of testfloat port for newlib placed in a public
version control system. So, if you need it, I'll send it in zip-archive
(~83KB) to e-mail you wish.

WBR
Andrey
Stefan Kristiansson
2014-08-26 03:25:38 UTC
Permalink
Post by BAndViG
Hello all!
I'm working to port FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline.
For testing proposes I ported "testfloat" program from ORPSoC v2 to or1k
newlib tool chain.
The initial and buggy :) Verilog is finished. The successfully tested
features are: "int32 to float32 conversion", "addition", "substruction",
exception handling and FPSCR reading/writing.
Knowing bugs are.
The "float32 to int32 conversion" fails with "rounding to inf+" mode while
converting 1.0f : the result is 2 (must be 1). Perhaps, the other bugs are
present, but I modified testing routine to stop testing process at the 1st
error.
The multiplier and divisor also generate erroneous results (not totally
but for some particular inputs).
And any comparison test leads to hang up of "testfloat". I tried to
simulate execution of floating point comparison on RTL with a simple
program placed into ROM. The test passed successfully (no pipeline hang
up). Has anybody got an idea how the bug could be found?
I didn't completely understand this, where does the test fails if you can't
reproduce in simulations, on real hw?
If so, what happens if you run the the exact same test in simulations?
Post by BAndViG
If somebody wants to participate in the activity or just review sources,
the Verilog could be found in https://github.com/bandvig/
mor1kx/tree/withfpu
I haven't got source code of testfloat port for newlib placed in a public
version control system. So, if you need it, I'll send it in zip-archive
(~83KB) to e-mail you wish.
Nice work so far! I'll definitely take a closer look at it.

Stefan
BAndViG
2014-08-26 18:07:32 UTC
Permalink
Regarding hangs up on floating point comparison.

I run “testfloat” tool on HW (atlys board, BOOTROM_SPI_FLASH + U-Boot). I put two printfs in function test_ab_float32_z_flag(...) around place where the comparison is called:

// call software implemented comparison
trueZ = trueFunction( testCases_a_float32, testCases_b_float32 );
trueFlags = *trueFlagsPtr;
// call hardware implemented comparison
printf(" a: %08X b: %08X\r\n",testCases_a_float32,testCases_b_float32); // #1
or1k_reset_fpcsr(); // restore FPEE, clear fpu exception flags, keep rounding mode
(void) testFlagsFunctionPtr();
testZ = testFunction( testCases_a_float32, testCases_b_float32 ); // calls syst_float32_eq/le/lt
testFlags = testFlagsFunctionPtr();
printf(" tZ: %d tF: %d\r\n",testZ,testFlags); // #2

Only 1st printf appears. Any kind (eq/lt/le) of comparison hangs up. So I decided to run RTL simulation. I prepared a simplest test bench which includes mor1kx core, wishbone interconnect from atlys project, but all peripherals (including DDR) are replaced with dummies excluding ROM. Instead of BOOTROM_SPI_FLASH I placed my simplest comparison test. The test’s source code is:

l.movhi r0, 0
l.movhi r3, hi(0x3d820800)
l.ori r3, r3, lo(0x3d820800)
l.movhi r4, hi(0x5e93fffe)
l.ori r4, r4, lo(0x5e93fffe)
l.jal syst_float32_eq
endless_cycle:
l.nop 0x1
l.j endless_cycle
l.nop
syst_float32_eq:
lf.sfeq.s r3, r4
l.bnf f32eq_exit
l.addi r11, r0, 0
l.addi r11, r0, 1
f32eq_exit:
l.jr r9
l.nop

It operates correctly (executes l.addi r11, r0, 0, doesn’t execute l.addi r11, r0, 1 and goes to endless_cycle).

As I’m not very familiar with Verilog simulation tools yet, it is quite difficult to me to run RTL simulation with real program. So I’m planning to get somewhere (perhaps from MinSoC) a model of simple SRAM with Wishbone face and install it into my test bench on DDR’s place. It could be good experience for such Verilog newbie as me, but It will take some time .

Andrey


From: Stefan Kristiansson
Sent: Tuesday, August 26, 2014 7:25 AM
To: BAndViG
Cc: openrisc
Subject: Re: [OpenRISC] Porting FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline

On Mon, Aug 25, 2014 at 9:09 PM, BAndViG <***@mail.ru> wrote:

Hello all!

I'm working to port FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline.
For testing proposes I ported "testfloat" program from ORPSoC v2 to or1k newlib tool chain.
The initial and buggy :) Verilog is finished. The successfully tested features are: "int32 to float32 conversion", "addition", "substruction", exception handling and FPSCR reading/writing.
Knowing bugs are.
The "float32 to int32 conversion" fails with "rounding to inf+" mode while converting 1.0f : the result is 2 (must be 1). Perhaps, the other bugs are present, but I modified testing routine to stop testing process at the 1st error.
The multiplier and divisor also generate erroneous results (not totally but for some particular inputs).
And any comparison test leads to hang up of "testfloat". I tried to simulate execution of floating point comparison on RTL with a simple program placed into ROM. The test passed successfully (no pipeline hang up). Has anybody got an idea how the bug could be found?



I didn't completely understand this, where does the test fails if you can't reproduce in simulations, on real hw?
If so, what happens if you run the the exact same test in simulations?

If somebody wants to participate in the activity or just review sources, the Verilog could be found in https://github.com/bandvig/mor1kx/tree/withfpu

I haven't got source code of testfloat port for newlib placed in a public version control system. So, if you need it, I'll send it in zip-archive (~83KB) to e-mail you wish.



Nice work so far! I'll definitely take a closer look at it.

Stefan
Matt Thomas
2014-08-27 01:00:51 UTC
Permalink
I noticed that the OpenRISC V1.1 Specification doesn't list anything analogous
to the PowerPC SPRG registers. These are real useful for storing temporaries
on exceptions or storing pointers to system structures (especially per-cpu ones
on MP systems).

There are the ISR0-7 registers but those are defined to be readonly so they aren't
very useful.

Now if the fast context switching stuff is widely implemented, I probably don't
need them. But that I'm not sure of (the openrisc 1200 doesn't implement them).

I'd be happy that if fast context switching isn't implemented that an extra
partial set (4 is enough, 8 is better) of GPR SPRs would be made available.
Peter Gavin
2014-08-27 01:39:41 UTC
Permalink
Post by Matt Thomas
I'd be happy that if fast context switching isn't implemented that an extra
partial set (4 is enough, 8 is better) of GPR SPRs would be made available.
Finishing the fast context switching/shadow register stuff is probably the
best way to go with this, IMO.

The problem with using SPRs for temporary data is that accessing them is
slow, because the SPR number must be calculated and decoded in order to
determine which SPR is being accessed. My implementation actually flushes
the pipeline on mtspr instructions, and my guess is that other pipelines
are similar, because the mtspr can change important pipeline state in a way
that is not predictable. (I could have avoided the flush in some cases,
but I didn't think the extra logic was worth it.) And the register being
written to by mtspr is probably determined too late to easily bypass the
result to earlier pipe stages (unless you have a really deep pipeline).
So, depending on pipeline depth and architecture, an mtspr instruction can
effectively take several cycles to execute. The result is that it probably
won't be any faster than just going through main memory, assuming a
reasonable cache hit rate.

BTW I think I remember finding some problems with fast context switching as
defined in the spec. It's been a while since I looked at that so I can't
recall what those problems are now. And IIRC, there are *no*
implementations that support it, not even or1ksim.

-Pete
Matt Thomas
2014-08-27 01:56:40 UTC
Permalink
Post by Matt Thomas
I'd be happy that if fast context switching isn't implemented that an extra
partial set (4 is enough, 8 is better) of GPR SPRs would be made available.
Finishing the fast context switching/shadow register stuff is probably the best way to go with this, IMO.
I agree, mostly. But it seems incomplete.
Post by Matt Thomas
The problem with using SPRs for temporary data is that accessing them is slow, because the SPR number must be calculated and decoded in order to determine which SPR is being accessed. My implementation actually flushes the pipeline on mtspr instructions, and my guess is that other pipelines are similar, because the mtspr can change important pipeline state in a way that is not predictable. (I could have avoided the flush in some cases, but I didn't think the extra logic was worth it.) And the register being written to by mtspr is probably determined too late to easily bypass the result to earlier pipe stages (unless you have a really deep pipeline). So, depending on pipeline depth and architecture, an mtspr instruction can effectively take several cycles to execute. The result is that it probably won't be any faster than just going through main memory, assuming a reasonable cache hit rate.
Without SPRGs, not trashing registers on exception becomes more difficult. Because r0 is not fixed as 0, the very first thing the exception handler must do is l.xor r0, r0, r0 to make sure r0 is zero. Or I suppose you can use r0 as an early temporary and then set it to 0 later. Or you can reserve two registers for exclusive kernel use like MIPS does. It just gets nasty.
Post by Matt Thomas
BTW I think I remember finding some problems with fast context switching as defined in the spec. It's been a while since I looked at that so I can't recall what those problems are now. And IIRC, there are *no* implementations that support it, not even or1ksim.
Stefan Kristiansson
2014-08-27 03:24:11 UTC
Permalink
Post by Matt Thomas
Post by Matt Thomas
I'd be happy that if fast context switching isn't implemented that an extra
partial set (4 is enough, 8 is better) of GPR SPRs would be made available.
Finishing the fast context switching/shadow register stuff is probably the best way to go with this, IMO.
I agree, mostly. But it seems incomplete.
This has been discussed in detail earlier, and the conclusion was then
as Peter suggested, to use the context switching/shadow reg stuff.
http://lists.openrisc.net/pipermail/openrisc/2014-May/002159.html

To sum it up, you can exploit the shadowed gprs without using the full
set of fast context switch features.
Reading and writing the shadowed GPRs is already supported in or1ksim,
and mor1kx has support for this as well.
By exploiting the shadowed GPRs like this, you get the exact
functionality you are asking for (with the added bonus that you have
32 'SPRG' registers instead of 4-8).

Relying on this feature of course limit yourself to a smaller set of
implementations, but that would of course be even more true for 'SPRG'
registers.
Post by Matt Thomas
Post by Matt Thomas
The problem with using SPRs for temporary data is that accessing them is slow, because the SPR number must be calculated and decoded in order to determine which SPR is being accessed. My implementation actually flushes the pipeline on mtspr instructions, and my guess is that other pipelines are similar, because the mtspr can change important pipeline state in a way that is not predictable. (I could have avoided the flush in some cases, but I didn't think the extra logic was worth it.) And the register being written to by mtspr is probably determined too late to easily bypass the result to earlier pipe stages (unless you have a really deep pipeline). So, depending on pipeline depth and architecture, an mtspr instruction can effectively take several cycles to execute. The result is that it probably won't be any faster than just going through main memory, assuming a reasonable cache hit rate.
I'm not sure that it *has* to be slow, mor1kx executes all of them in
1-2 cycles.
I'm also not sure if any of the SPRs that would change pipeline state
in an unpredictable way is actually allowed to be written to?
PC and the *current* register file for instance is stated as undefined
behavior if read/written to from a program.
Post by Matt Thomas
Without SPRGs, not trashing registers on exception becomes more difficult. Because r0 is not fixed as 0, the very first thing the exception handler must do is l.xor r0, r0, r0 to make sure r0 is zero. Or I suppose you can use r0 as an early temporary and then set it to 0 later. Or you can reserve two registers for exclusive kernel use like MIPS does. It just gets nasty.
r0 *may* not be fixed to zero, you can't rely on it not being fixed,
so you can't use it as a temporary.
Our Linux port resorts to a hack where it uses the the memory area
between 0x0-0x100 as a temporary storage.
This of course only works for uniprocessor systems, and because of
that we had the previous discussion.

Side note, to be more friendly to RTL simulations, it's better to use
l.movhi r0,0 or l.andi r0, r0, 0 to make sure r0 is cleared.
Post by Matt Thomas
Post by Matt Thomas
BTW I think I remember finding some problems with fast context switching as defined in the spec. It's been a while since I looked at that so I can't recall what those problems are now. And IIRC, there are *no* implementations that support it, not even or1ksim.
As noted above, even though I think it's correct that no
implementation implements the full fast context switching, there are
implementations that implements an useful subset of it.

Stefan
Peter Gavin
2014-08-27 05:03:58 UTC
Permalink
On Tue, Aug 26, 2014 at 11:24 PM, Stefan Kristiansson <
Post by Stefan Kristiansson
I'm not sure that it *has* to be slow, mor1kx executes all of them in
1-2 cycles.
I'm also not sure if any of the SPRs that would change pipeline state
in an unpredictable way is actually allowed to be written to?
PC and the *current* register file for instance is stated as undefined
behavior if read/written to from a program.
I don't believe the manual states that writing the GPR or NPC with mtspr is
undefined behavior, however, I may have missed it buried in there
somewhere. Other potentially state-changing mtsprs would be SR, or course,
and the TLB/ATB stuff. I found it simplest and safest just to not worry
about it and always flush on mtspr, since the manual isn't 100% clear about
what should happen. The impact is extremely minor IMO, since m[ft]spr
should really only be used in OS code, and I don't think they should be
considered fast instructions.

The kernel handles this by inserting lots of nops after mtsprs that change
TLB state or enable the IC/DC, for example:

enable_mmu:
/*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
l.mfspr r30,r0,SPR_SR
l.movhi r28,hi(SPR_SR_DME | SPR_SR_IME)
l.ori r28,r28,lo(SPR_SR_DME | SPR_SR_IME)
l.or r30,r30,r28
l.mtspr r0,r30,SPR_SR
l.nop
l.nop
l.nop
/* lots more nops follow */

I think this is a mistake, and that it would be better to spell out exactly
when the state change occurs. These nops are functionally equivalent to
flushing, on short, in-order pipelines. But simply inserting nops like
this will be insufficient if someone decides to build an aggressive,
speculative, dynamically scheduled pipeline based on OpenRISC. Such a
pipeline could elide all those nops away, and then begin executing the next
load instruction before the TLB is actually enabled. But if the pipeline
flushes itself, those nops aren't even necessary.

-Pete
Stefan Kristiansson
2014-08-27 05:27:26 UTC
Permalink
Post by Peter Gavin
On Tue, Aug 26, 2014 at 11:24 PM, Stefan Kristiansson
Post by Stefan Kristiansson
I'm not sure that it *has* to be slow, mor1kx executes all of them in
1-2 cycles.
I'm also not sure if any of the SPRs that would change pipeline state
in an unpredictable way is actually allowed to be written to?
PC and the *current* register file for instance is stated as undefined
behavior if read/written to from a program.
I don't believe the manual states that writing the GPR or NPC with mtspr is
undefined behavior, however, I may have missed it buried in there somewhere.
It's on page 32:

"4.10 Next and Previous Program Counter (NPC and PPC)
The Program Counter registers represent the address just executed and
the address instruction just to be executed.
These and the GPR registers mapped into SPR space should only be used
for debugging purposes by an external debugger. Applications should
use the l.jal instruction to obtain the current program counter and
arithmethic instructions to obtain GPR register values."

So it's not explicitly stating it's undefined behavior, just that
you're not allowed to use them.
Post by Peter Gavin
Other potentially state-changing mtsprs would be SR, or course, and the
TLB/ATB stuff. I found it simplest and safest just to not worry about it
and always flush on mtspr, since the manual isn't 100% clear about what
should happen. The impact is extremely minor IMO, since m[ft]spr should
really only be used in OS code, and I don't think they should be considered
fast instructions.
The kernel handles this by inserting lots of nops after mtsprs that change
/*
* enable dmmu & immu
* SR[5] = 0, SR[6] = 0, 6th and 7th bit of SR set to 0
*/
l.mfspr r30,r0,SPR_SR
l.movhi r28,hi(SPR_SR_DME | SPR_SR_IME)
l.ori r28,r28,lo(SPR_SR_DME | SPR_SR_IME)
l.or r30,r30,r28
l.mtspr r0,r30,SPR_SR
l.nop
l.nop
l.nop
/* lots more nops follow */
I think this is a mistake, and that it would be better to spell out exactly
when the state change occurs. These nops are functionally equivalent to
flushing, on short, in-order pipelines. But simply inserting nops like this
will be insufficient if someone decides to build an aggressive, speculative,
dynamically scheduled pipeline based on OpenRISC. Such a pipeline could
elide all those nops away, and then begin executing the next load
instruction before the TLB is actually enabled. But if the pipeline flushes
itself, those nops aren't even necessary.
Yes, that's not the cleanest way to enable the MMU.
The 'correct' way is to setup ESR and then issue an l.rfe instruction
(and the kernel use this approach in other places).
Regardless, I agree, those nops serve no purpose, implementations
should be able to function properly without them (and if they do that
by special casing SR accesses or flushing the pipeline on each mtspr
instruction, that's beside the point) and I doubt they are actually
even
needed.

But we digress, I agree on your main point, the SPR accesses might be
slow. It's just an implementation detail.
However, it's not possible (at least not in the multicore case) to use
the main memory as a scratch area, so the motivation to have something
else to save state to is not really about speed.

Stefan
Matt Thomas
2014-08-27 08:16:21 UTC
Permalink
Post by Stefan Kristiansson
But we digress, I agree on your main point, the SPR accesses might be
slow. It's just an implementation detail.
However, it's not possible (at least not in the multicore case) to use
the main memory as a scratch area, so the motivation to have something
else to save state to is not really about speed.
It's nice to have the same method available for uniprocessors too.

I'm going by my experience writing the exception handling code
for NetBSD on PowerPC. The SPRGs are required for MP support.

BTW, the CXR SPR is described but its SPR number was never given.
Didn't find it in or1ksim either. Does it have a number?
Since it's undefined it should return 0, right?
Stefan Kristiansson
2014-08-28 06:14:17 UTC
Permalink
Post by Matt Thomas
Post by Stefan Kristiansson
But we digress, I agree on your main point, the SPR accesses might be
slow. It's just an implementation detail.
However, it's not possible (at least not in the multicore case) to use
the main memory as a scratch area, so the motivation to have something
else to save state to is not really about speed.
It's nice to have the same method available for uniprocessors too.
I'm going by my experience writing the exception handling code
for NetBSD on PowerPC. The SPRGs are required for MP support.
BTW, the CXR SPR is described but its SPR number was never given.
Didn't find it in or1ksim either. Does it have a number?
Since it's undefined it should return 0, right?
Yes, that looks like one of these problems Peter spoke about.
Also, I don't really see the purpose of the CXR SPR, AFAICT the same
info/functionality can be obtained from the CID field in the SR SPR.
However, you don't need neither of them to manually access the shadow gpr files.

Stefan
Sébastien Bourdeauducq
2014-08-27 01:28:28 UTC
Permalink
Post by BAndViG
I'm working to port FPU from OpenRISC-1200 to mor1kx-cappuccino pipeline.
You may want to have a look at the floating point pipelines I have
developed for Milkymist SoC:
https://github.com/m-labs/milkymist/tree/master/cores/pfpu

I have not implemented all the pesky details of IEEE754 (which I didn't
need), but unlike the OR1200 excuse for a FPU it shows you can do FP in
a simple, fast and resource-efficient way.

Sébastien
BAndViG
2014-08-30 18:57:05 UTC
Permalink
Thank you, Sébastien
I downloaded your sources and of course I'll have a look on your
implementation, especially, OpenRISC-1200 implementation looks buggy (namely
for multiplication, division and f2i conversions) despite of report about
millions tests were performed :).

Andrey

-----Исходное сообщение-----
From: Sébastien Bourdeauducq
Sent: Wednesday, August 27, 2014 5:28 AM
To: ***@lists.openrisc.net
Subject: Re: [OpenRISC] Porting FPU from OpenRISC-1200 to mor1kx-cappuccino
pipeline

You may want to have a look at the floating point pipelines I have
developed for Milkymist SoC:
https://github.com/m-labs/milkymist/tree/master/cores/pfpu

Sébastien
BAndViG
2014-08-31 15:37:35 UTC
Permalink
Status update.
f2i conversion is fixed. Now all testfloat's f2i and i2f tests passes.
The Verilog could be found in
https://github.com/bandvig/mor1kx/tree/withfpu

WBR
Andrey


-----Исходное сообщение-----
From: BAndViG
Sent: Monday, August 25, 2014 10:09 PM
To: ***@lists.openrisc.net
Subject: [OpenRISC] Porting FPU from OpenRISC-1200 to
mor1kx-cappuccinopipeline
...
Knowing bugs are.
The "float32 to int32 conversion" fails with "rounding to inf+" mode while
converting 1.0f : the result is 2 (must be 1). Perhaps, the other bugs are
present, but I modified testing routine to stop testing process at the 1st
error.
The multiplier and divisor also generate erroneous results (not totally but
for some particular inputs).
And any comparison test leads to hang up of "testfloat". I tried to simulate
execution of floating point comparison on RTL with a simple program placed
into ROM. The test passed successfully (no pipeline hang up). Has anybody
got an idea how the bug could be found?
...
Sébastien Bourdeauducq
2014-08-31 15:38:52 UTC
Permalink
Post by BAndViG
Status update.
f2i conversion is fixed. Now all testfloat's f2i and i2f tests passes.
The Verilog could be found in
https://github.com/bandvig/mor1kx/tree/withfpu
Have you tried synthesizing this thing? Last time I did, that FPU was
extremely slow and extremely bloated even by OR1200 standards, making it
completely unusable and good for a one-way trip to the depths of the
trashbin. I recommend doing some reality-checking before touching this
code...

Sébastien
BAndViG
2014-08-31 17:04:33 UTC
Permalink
Yes, I have. As I wrote previously, I run "testfloat" on SoC generated for
Atlys board (based on Xilinx Spartan-6 FPGA). In the SoC the core operates
on 50 MHz. I'm not sure I understand you mean talking "extremely slow". Do
you mean that too many cycles per operation? As far as I understand the
multiplier and divisor implement serial algorithms.
Regarding the code. As I see you distribute your code under GPL v3. I prefer
BSD-like or LGPL-like to be able to link with proprietary modules. So, I'm
going to use your implementation for consultation only. By the way, the
another source for consultation I have is GPLed OpenSPARC T1/T2.

Generally speaking, personally I'am interested in a BSD-like (LGPL-like)
licensed core with performance of ARM Cortex-A8/A9 level (not in "yet
another super tiny controller"), with quite powerful 32/64 bits FPU, but
without fancy DSP-like/Vector instructions (I prefer to design DSP
functionality in hardware) written on Verilog. I haven't found a core that
meet all of the conditions. So, I've decided to take participation in OR1K
"upgrade". Perhaps, it means that OR1200 FPU have to be redesigned ...
almost completely.

Andrey


-----Исходное сообщение-----
From: Sébastien Bourdeauducq
Sent: Sunday, August 31, 2014 7:38 PM
To: ***@lists.openrisc.net
Subject: Re: [OpenRISC] Porting FPU from OpenRISC-1200
tomor1kx-cappuccinopipeline
Post by BAndViG
Status update.
f2i conversion is fixed. Now all testfloat's f2i and i2f tests passes.
The Verilog could be found in
https://github.com/bandvig/mor1kx/tree/withfpu
Have you tried synthesizing this thing? Last time I did, that FPU was
extremely slow and extremely bloated even by OR1200 standards, making it
completely unusable and good for a one-way trip to the depths of the
trashbin. I recommend doing some reality-checking before touching this
code...

Sébastien
Sébastien Bourdeauducq
2014-09-01 01:40:18 UTC
Permalink
Post by BAndViG
Yes, I have. As I wrote previously, I run "testfloat" on SoC generated
for Atlys board (based on Xilinx Spartan-6 FPGA). In the SoC the core
operates on 50 MHz.
That's still slow, though I remembered it to be worse than that. A
reasonable frequency on Spartan-6 is 83MHz.
Post by BAndViG
Regarding the code. As I see you distribute your code under GPL v3. I
prefer BSD-like or LGPL-like to be able to link with proprietary
modules. So, I'm going to use your implementation for consultation only.
Well, the code I've released more recently is BSD, and I wouldn't mind
re-licensing that under BSD too...

Sébastien
BAndViG
2014-09-02 17:55:09 UTC
Permalink
By the way, the longest post PAR path reported by ISE is not related to FPU:

Source:
mor1kx0/mor1kx_cpu0/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/ctrl_alu_result_o_14
(FF)
Destination:
mor1kx0/mor1kx_cpu0/cappuccino.mor1kx_cpu/mor1kx_fetch_cappuccino/icache_gen.mor1kx_icache/way_memories[0].way_data_ram/Mram_mem1
(RAM)
Requirement: 20.000ns
Data Path Delay: 19.281ns (Levels of Logic = 10)

The same to other long paths.
So, the clock value of 50MHz isn't limited by FPU logic.

Andrey


-----Исходное сообщение-----
Post by BAndViG
Yes, I have. As I wrote previously, I run "testfloat" on SoC generated
for Atlys board (based on Xilinx Spartan-6 FPGA). In the SoC the core
operates on 50 MHz.
That's still slow, though I remembered it to be worse than that. A
reasonable frequency on Spartan-6 is 83MHz.
Stefan Kristiansson
2014-09-03 02:56:18 UTC
Permalink
Post by BAndViG
mor1kx0/mor1kx_cpu0/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/ctrl_alu_result_o_14
(FF)
mor1kx0/mor1kx_cpu0/cappuccino.mor1kx_cpu/mor1kx_fetch_cappuccino/icache_gen.mor1kx_icache/way_memories[0].way_data_ram/Mram_mem1
(RAM)
Requirement: 20.000ns
Data Path Delay: 19.281ns (Levels of Logic = 10)
The same to other long paths.
So, the clock value of 50MHz isn't limited by FPU logic.
Well, that path is probably caused by the FPU logic, which goes
through ctrl_alu_result.

Stefan
BAndViG
2014-09-02 18:12:22 UTC
Permalink
Status update.

After f2i conversion fix, the comparision (eq//le/lt) had become operational
(indirectly fixed :)). No required source update.

The rest buggy tests are multiplication and division.

I have a question relative FPU functionality implemented in OR-1200 (I
haven't found information in available description).
Should OR-1200 FPU support denormalized numbers?
I found the numbers was incorrectly processed in f2i (I fixed it, because it
wasn't difficult). Now I see that the 1st error in multiplier is again due
to denormalized input:

test #584
operand_a: 0x00000001 operand_b: 0x41FFFE20
expected: value: 0x00000020 flags: ...ux
output: value: 0x0007FFF1 flags: ....

Andrey



-----Исходное сообщение-----
From: BAndViG
Sent: Sunday, August 31, 2014 7:37 PM
To: ***@lists.openrisc.net
Subject: Re: [OpenRISC] Porting FPU from OpenRISC-1200
tomor1kx-cappuccinopipeline

Status update.
f2i conversion is fixed. Now all testfloat's f2i and i2f tests passes.
The Verilog could be found in
https://github.com/bandvig/mor1kx/tree/withfpu

WBR
Andrey


-----Исходное сообщение-----
From: BAndViG
Sent: Monday, August 25, 2014 10:09 PM
To: ***@lists.openrisc.net
Subject: [OpenRISC] Porting FPU from OpenRISC-1200 to
mor1kx-cappuccinopipeline
...
Knowing bugs are.
The "float32 to int32 conversion" fails with "rounding to inf+" mode while
converting 1.0f : the result is 2 (must be 1). Perhaps, the other bugs are
present, but I modified testing routine to stop testing process at the 1st
error.
The multiplier and divisor also generate erroneous results (not totally but
for some particular inputs).
And any comparison test leads to hang up of "testfloat". I tried to simulate
execution of floating point comparison on RTL with a simple program placed
into ROM. The test passed successfully (no pipeline hang up). Has anybody
got an idea how the bug could be found?
...
BAndViG
2014-09-14 10:35:14 UTC
Permalink
Almost all bugs are fixed. The only 'testfloat' to hardware difference left
is generation underflow flag.
The latest commit (4f9d080 of 14-sep-2014) to upstream (openrisc/mor1kx) is
merged into my FPU branch (https://github.com/bandvig/mor1kx/tree/withfpu).
The resulted state (commit 1315933) has been tagged by 'fpu32_v1.0' label.

WBR
Andrey

-----Исходное сообщение-----
From: BAndViG
Sent: Sunday, August 31, 2014 7:37 PM
To: ***@lists.openrisc.net
Subject: Re: [OpenRISC] Porting FPU from OpenRISC-1200
tomor1kx-cappuccinopipeline

Status update.
f2i conversion is fixed. Now all testfloat's f2i and i2f tests passes.
The Verilog could be found in
https://github.com/bandvig/mor1kx/tree/withfpu

WBR
Andrey
Sébastien Bourdeauducq
2014-09-14 14:57:42 UTC
Permalink
Cool! What are the area and frequency? How many cycles do the floating
point operations take?

Sébastien
Post by BAndViG
Almost all bugs are fixed. The only 'testfloat' to hardware difference
left is generation underflow flag.
The latest commit (4f9d080 of 14-sep-2014) to upstream (openrisc/mor1kx)
is merged into my FPU branch
(https://github.com/bandvig/mor1kx/tree/withfpu). The resulted state
(commit 1315933) has been tagged by 'fpu32_v1.0' label.
WBR
Andrey
BAndViG
2014-09-14 18:41:16 UTC
Permalink
I didn't synthesize FPU separately. I think that separate design digits are
useful only if we are going to build a macro-cell. Otherwise, most part of
data delay is introduced not by logic but by routing. For example, the
longest FPU related way reported by post PAR static timing is:

Source: clkgen0/wb_rst_shr_15 (FF)
Destination: mor1kx0/mor1kx_cpu/cappuccino.mor1kx_cpu/
mor1kx_execute_alu/fpu_enabled_in_execute_alu.fpu_arith/
fpu_post_norm_mul/s_frac2a_34 (FF)
Requirement: 20.000ns
Data Path Delay: 18.718ns (Levels of Logic = 1)
...
Total 18.718ns (1.126ns logic, 17.592ns route)
(6.0% logic, 94.0% route)

94% are routing! But constrain (50 MHz) is satisfied, so router haven't to
do anything more.
By the way the 50 MHz wasn't set by me. It is default value for Atlys SoC.

About cycling. As the FPU is just VHDL->Verilog conversion of FPU100
(http://opencores.org/project,fpu100), the cycling is equal to original
design:
Add/Sub: 7
Mul with serial implementation: 35 (now implemented for OR1200 and mor1kx)
Mul with 'parallel' implementation: 12 (implemented for original project
only)
Div: 35 (serial implementation only)

By the way the original design was able to run on 100MHz in case of
synthesized alone for Cyclone I–EP1C6Q240C.
The Altera Quartus II v.5 reported the following number of logic elements:
Addition unit: 684
Multiplication unit: 1530 (!!! parallel !!! The serial one implemented for
OR1K should be smaller.)
Division unit: 928
Square-root unit: 919 (not ported to OR1K)
Top unit: 326
_______________________________
Total: 4387

I don't see any reason that converted FPU itself should be slower .

Andrey


-----Исходное сообщение-----
From: Sébastien Bourdeauducq
Sent: Sunday, September 14, 2014 6:57 PM
To: ***@lists.openrisc.net
Subject: Re: [OpenRISC] Porting FPU from
OpenRISC-1200tomor1kx-cappuccinopipeline

Cool! What are the area and frequency? How many cycles do the floating
point operations take?

Sébastien
Post by BAndViG
Almost all bugs are fixed. The only 'testfloat' to hardware difference
left is generation underflow flag.
The latest commit (4f9d080 of 14-sep-2014) to upstream (openrisc/mor1kx)
is merged into my FPU branch
(https://github.com/bandvig/mor1kx/tree/withfpu). The resulted state
(commit 1315933) has been tagged by 'fpu32_v1.0' label.
WBR
Andrey
Sébastien Bourdeauducq
2014-09-15 01:56:48 UTC
Permalink
Post by BAndViG
I didn't synthesize FPU separately. I think that separate design
digits are useful only if we are going to build a macro-cell.
They are useful to estimate the impact that design options have on
performance and resource utilisation, e.g. to choose whether to use your
FPU or software floating point. The increase in resource usage of mor1kx
after the FPU is enabled is definitely a relevant number.
Post by BAndViG
94% are routing! But constrain (50 MHz) is satisfied, so router
haven't to do anything more. By the way the 50 MHz wasn't set by me.
It is default value for Atlys SoC.
Ok, that's rather slow... have you tried pushing it?

If you believe the Opencores description, you should be surprised that
the Cyclone-I, a low-end FPGA from the early 2000s, would be twice as
fast as the Spartan-6, a low-end FPGA from the late 2000s.
Post by BAndViG
About cycling. As the FPU is just VHDL->Verilog conversion of FPU100
(http://opencores.org/project,fpu100), the cycling is equal to
original design: Add/Sub: 7 Mul with serial implementation: 35 (now
implemented for OR1200 and mor1kx)
Serial multipliers often do not make sense anymore as modern FPGAs have
dedicated "DSP" blocks that can do multiplications efficiently with only
a few cycles of latency.
Post by BAndViG
Mul with 'parallel' implementation: 12 (implemented for original
project only) Div: 35 (serial implementation only)
By the way the original design was able to run on 100MHz in case of
synthesized alone for Cyclone I–EP1C6Q240C.
That's what the Opencores description says, but most of what is claimed
on Opencores is very optimistic, to say the least.
Post by BAndViG
The Altera Quartus II v.5 reported the following number of logic
elements: Addition unit: 684 Multiplication unit: 1530 (!!!
parallel !!! The serial one implemented for OR1K should be smaller.)
On a FPGA with hard multipliers, both should have approximately the same
size. However, the serial one will have those ludicrous 35 cycles of
delay...
Post by BAndViG
Division unit: 928 Square-root unit: 919 (not ported to
OR1K) Top unit: 326 _______________________________
Total: 4387
Bloated. That's more than mor1kx itself, which already isn't quite
resource-efficient. That combined with low performance is one of the
typical plagues that frustratingly makes most Opencores projects useless...

Sébastien
Stefan Kristiansson
2014-09-15 05:21:07 UTC
Permalink
Post by Sébastien Bourdeauducq
Post by BAndViG
The Altera Quartus II v.5 reported the following number of logic
elements: Addition unit: 684 Multiplication unit: 1530 (!!!
parallel !!! The serial one implemented for OR1K should be smaller.)
On a FPGA with hard multipliers, both should have approximately the same
size. However, the serial one will have those ludicrous 35 cycles of
delay...
I think that an implementation with a hard multiplier will be smaller
than a serial one.
Post by Sébastien Bourdeauducq
Post by BAndViG
Division unit: 928 Square-root unit: 919 (not ported to
OR1K) Top unit: 326 _______________________________
Total: 4387
Bloated. That's more than mor1kx itself, which already isn't quite
resource-efficient. That combined with low performance is one of the
typical plagues that frustratingly makes most Opencores projects useless...
Sure, but in my world the design cycles are:
make it work, make it fast, then make it small.
mor1kx is on an overall level at the 'make it small' cycle, and while
there's still a lot to do in that area.
As you know, we've made some good progress in that area lately (about
33% size decrease in a minimal mor1kx cappuccino setup).
I think it's a mistake to have a mindset that everything that is in
any of the earlier design stages should automatically be discarded as
"useless" and it's then better to start off from scratch with
something of your own (regardless how tempting that might be ;)).
That said, there are of course cases where you come to a point where
you realize that things are beyond repair (or the effort to repair
becomes to large), that was one of the reasons for mor1kx (another was
the license of or1200).

Stefan

Loading...