[OpenRISC] mor1kx and counting clock cycles?

Discussion:

Julius Baxter

2014-09-07 23:06:45 UTC

Permalink

From: "Julius Baxter" <***@gmail.com>
Date: Sep 8, 2014 12:04 AM
Subject: Re: mor1kx and counting clock cycles?

Hi Julius,
I discovered the amazing OpenRISC, and more specifically mor1kx, when

browsing the internet for open source cores.

I managed to use 'fusesoc' to simulate, using verilator, a SoC with a

mor1kx processor.

I have a set of C DSP kernels that I compiled to the OR1000 ISA using

the or1k-gnu toolchain.

I run the compiled kernels on the simulated mor1kx using the folloing
fusesoc sim --sim verilator mor1kx-generic --elf-load

/home/ricardo/mintsoc/tests/vltsim-test/dsp_mul32

Success! Got NOP_EXIT. Exiting (84104)
Simulation ended at PC = 000031b4 (84105)
At this point I'm so impressed that there is such a thing as a open

source SoC that works, that I want to include metrics taken on the 'mor1kx'
cores in a report that I'm doing as part of my PhD work.

I just need to know how can I know the cycles taken by the 'mor1kx'

core for executing each of my compiled kernels.

Is there a OpenRISC intruction (that accesses some special register?) I

can use to know how many clock cycles were used untill the end of the
execution of each test kernel?

Or is there any other way to get this information? (maybe verilator can

report me cycle counts... I'm not an expert in verilator).

Hi Ricardo,
That's a very good question. Basically there isn't an official

instruction, but I think it would be a very good exercise to add the
instrumentation to take cycle counts. It could be done via the use of the
l.nop (no op) instructions which allow an arbitrary immediate to be passed
in their encoding which the simulated model and it's infrastructure can
then so something based upon.

I believe or1ksim might already have support for some cycle counter l.nop

instructions, and the l.nop immediate is likely already defined (check an
OR1k spr-defs.h file). So you could certainly add such a feature to the
testbench in fusesoc to enable cycle counters.

I'm CCing the OpenRISC lists as it might be the case that some verilog

testbench infrastructure, like you want to add, might already exist.

Best regards and good luck with this very fine project!

Thanks for the kind words!
Cheers
Julius

Stefan Kristiansson

2014-09-08 06:49:14 UTC

Permalink

I just need to know how can I know the cycles taken by the 'mor1kx' core
for executing each of my compiled kernels.
Is there a OpenRISC intruction (that accesses some special register?) I
can use to know how many clock cycles were used untill the end of the
execution of each test kernel?
Or is there any other way to get this information? (maybe verilator can
report me cycle counts... I'm not an expert in verilator).

If the main interest is to get the cycle count in verilator
simulations, then it's easy to extract from this loop:
https://github.com/openrisc/orpsoc-cores/blob/master/systems/mor1kx-generic/bench/verilator/tb.cpp#L91-L109
Each iteration is a clock edge, so to get the total cycles that a full
simulation takes just divide the number of iterations with 2.
If you need to have more fine-grained measurements, then l.nop hacks
as Julius mentioned is probably the way to go.
It should be straight forward to use the NOP_EXIT logic in that file
as a template for that.

(HINT: tbUtils->getTime() actually returns the number of iteration of
that and not any actual "time", so can use that as the counter)

Stefan

Stefan Kristiansson

2014-09-08 18:43:06 UTC

Permalink

Thanks! :-)
Success! Got NOP_EXIT. Exiting (608812)
Simulation ended at PC = 000031dc (608813)
So basicaly I just have to divide 608813 by 2 to get the clock cycles?
Does this provide a cycle accurate result?

Yes, this is the accurate number of cycles (the divided by 2 result)
of the entire simulation run.

What is the different between doing this and "more fine-grained
measurements" using the l.nop hacks?
If the other method is already cycle accurate then what would I gain by
going with the second method?

I just meant that if you for instance wanted to measure the number of
clock cycles spent in some particular area of your code instead of the
total sum of cycles, then you would have to add some extra
functionality.
I.e. let's say you wanted to measure number of cycles spent in foo(),
then you could do:
asm("l.nop 5")
foo();
asm("l.nop 6")
under the assumption that you add the needed functionality for
NOP_CNT_RESET and NOP_GET_TICKS in tb.cpp

If the total sum of cycles is what you need, then the number above is
sufficient.

Stefan

Stefan Kristiansson

2014-09-09 03:11:56 UTC

Permalink

Other question that maybe is not be that smart... but I'm a little new to
this hardware simulation stuff.
Can I also use fusesoc (with some adaptation) to work with MiSoC?
https://github.com/m-labs/misoc

migen/misoc and fusesoc/orpsoc-cores does in some areas try to solve
the same problem, so using them together is perhaps not the right way
forward.
Adapting the verilator testbench from the orpsoc-cores testbenches to
misoc is probably a more reasonable thing to do.

I would also like to be able to use verilator to simulate the LatticeMico32
softcore.

Sebastien might have some insight on how to best get a misoc based
verilator simulation running.

I also think Jose and his associates did some tests and comparisons on
both OpenRISC (mor1kx/or1200) and LatticeMico32.
I know that he ran verilator on the operisc side at least, and I
wouldn't be surprised if they did the same with LM32.

I added Sebastien and Jose to CC to make them (more) aware of this conversation.

Stefan

Sébastien Bourdeauducq

2014-09-09 03:28:05 UTC

Permalink

I would love to be able to benchmark the LatticeMico32.

Don't forget to include the LUT count, something that your simulations
will not show.

Stefan Kristiansson

2014-09-10 09:34:12 UTC

Permalink

Fortunatelly I bought a DE0-NANO some time ago, wich to my luck is
compatible with MiSoC.
I will test the LM32 there ;-)
I'm a little bit confused about some stuff.
What is the number of pipeline stages of the ESPRESSO, PRONTO ESPRESSO and
CAPPUCCINO?

cappuccino has a 6 stage pipeline (addr, fetch, decode, execute, mem, wb)
I'm pretty sure (Julius should be certain) that (pronto)espresso has a
2 stage pipeline (fetch, execute)

In some online post people say the ESPRESSO and PRONTO EXPRESSO have a
2-stage pipeline, and in other posts they say it has a 3-stage pipeline.
How many stages do these mor1kx cores have in the implementations from the
'orpsoc-cores' github repository?
Where can I see that information?

Post by SÃ©bastien Bourdeauducq

I would love to be able to benchmark the LatticeMico32.

Don't forget to include the LUT count, something that your simulations
will not show.

BAndViG

2014-09-14 18:49:41 UTC

Permalink

By the way, I run coremark today on atlys board SoC (latest mor1kx
cappuccino pipeline) and measured 99.800399 Iterations/sec, that means
99.800399/ 50 MHz = 1,996. What is exactly value for LM32?

Andrey

-----Исходное сообщение-----
From: Sébastien Bourdeauducq
Sent: Tuesday, September 09, 2014 7:28 AM
To: Ricardo Nobre ; Stefan Kristiansson
Cc: openrisc ; ***@lists.opencores.org ; Julius Baxter ; Jose de Sousa
Subject: Re: [OpenRISC] mor1kx and counting clock cycles?

I would love to be able to benchmark the LatticeMico32.

Don't forget to include the LUT count, something that your simulations
will not show.

Sébastien Bourdeauducq

2014-09-15 02:16:41 UTC

Permalink

Post by BAndViG
By the way, I run coremark today on atlys board SoC (latest mor1kx
cappuccino pipeline) and measured 99.800399 Iterations/sec, that means
99.800399/ 50 MHz = 1,996. What is exactly value for LM32?

I've done tests in May/June with LM32 and mor1kx at 83 1/3MHz, and I
measured 133 iterations/s in both cases, so the score I got was 1.6.
This is my mor1kx configuration:
https://github.com/m-labs/misoc/blob/master/misoclib/mor1kx/__init__.py

This is slower than yours - are you using a faster configuration, or did
mor1kx improve in the meantime?

Sébastien

Stefan Kristiansson

2014-09-15 06:19:22 UTC

Permalink

I've done tests in May/June with LM32 and mor1kx at 83 1/3MHz, and I
measured 133 iterations/s in both cases, so the score I got was 1.6.
https://github.com/m-labs/misoc/blob/master/misoclib/mor1kx/__init__.py
This is slower than yours - are you using a faster configuration, or did
mor1kx improve in the meantime?

Compiler flags might make a difference, IIRC I've got 142
iterations/sec with your setup/SoC and -O2 -mhard-mul -mhard-div.
Also, from prior tests I have done, the memory foot-print of coremarks
is somewhere between 4K and 8K, so you will notice increases in the
coremark score by increasing the cache size up to 8K.
So to get fair comparison results, you'd need to increase the cache
size from 4K -> 8K, unfortunately it seems that LM32 can't meet the
timing in your (ppro) SoC when doing that (mor1kx does).

coremark numbers aside, there are a couple of features that mor1kx
have that lm32 lack that could make an advantage performance-wise in
other situations.
1) Support for wrapping burst cache refills. I.e. it will stall until
the whole cache line is filled, but rather proceed when the requested
address is fetched from memory.
2) Store buffer. Both LM32 and mor1kx has write-through caches, but
mor1kx will not stall until the memory access has finished if the
store buffer is enabled.
3) Multiway (>2) caches with LRU replacement strategy.

Stefan

Stefan Kristiansson

2014-09-15 06:23:20 UTC

Permalink

On Mon, Sep 15, 2014 at 9:19 AM, Stefan Kristiansson

Post by Stefan Kristiansson
1) Support for wrapping burst cache refills. I.e. it will stall until
the whole cache line is filled, but rather proceed when the requested
address is fetched from memory.

That should of course have been: "it (mor1kx) will *not* stall"

Stefan

BAndViG

2014-09-16 19:09:40 UTC

Permalink

My configuration actually isn't created by me. It is default configuration
for Atlys SoC (I haven't change it). The configuration is:

.FEATURE_DEBUGUNIT("ENABLED"),
.FEATURE_CMOV("ENABLED"),
.FEATURE_INSTRUCTIONCACHE("ENABLED"),
.OPTION_ICACHE_BLOCK_WIDTH(5),
.OPTION_ICACHE_SET_WIDTH(8),
.OPTION_ICACHE_WAYS(4),
.OPTION_ICACHE_LIMIT_WIDTH(32),
.FEATURE_IMMU("ENABLED"),
.OPTION_IMMU_SET_WIDTH(7),
.FEATURE_DATACACHE("ENABLED"),
.OPTION_DCACHE_BLOCK_WIDTH(5),
.OPTION_DCACHE_SET_WIDTH(8),
.OPTION_DCACHE_WAYS(4),
.OPTION_DCACHE_LIMIT_WIDTH(31),
.FEATURE_DMMU("ENABLED"),
.OPTION_DMMU_SET_WIDTH(7),
.OPTION_PIC_TRIGGER("LATCHED_LEVEL"),
.IBUS_WB_TYPE("B3_REGISTERED_FEEDBACK"),
.DBUS_WB_TYPE("B3_REGISTERED_FEEDBACK"),
.OPTION_CPU0("CAPPUCCINO"),
.OPTION_RESET_PC(32'hf0000100),
.FEATURE_MULTIPLIER("THREESTAGE")

FEATURE_DIVIDER = "SERIAL", // default setting for mor1kx

Compiler
flags: -O2 -funroll-loops -fgcse-sm -mboard=atlys -mhard-div -mhard-mul -DPERFORMANCE_RUN=1

At least the cache configuration is more powerful (I don't know about
multiplier & divider implementation in LM32) . Additionally to Stefan's
list of mor1kx features, the mor1kx includes branch prediction engine as
default and not configurable part. Could LM32 be configured with such
feature?

However, I don't think that it is unfair. If somebody wants to achieve
better score, he/she have to pay for it. Personally, I would be happy with
coremark's score lying in region 2.4-2.7 (more is better for me). I foreseen
a pipeline with such performance could be even more bloated :).

P.S. Talking about 50 MHz constrain. It looks like just nobody have tried to
set it faster and play with router's settings to find corner value.

Andrey

-----Исходное сообщение-----
From: Sébastien Bourdeauducq
Sent: Monday, September 15, 2014 6:16 AM
To: BAndViG ; Ricardo Nobre ; Stefan Kristiansson ; List on OpenRISC.net ;
***@lists.opencores.org ; Julius Baxter ; Jose de Sousa
Subject: Re: [OpenRISC] mor1kx and counting clock cycles?