Discussion:
[OpenRISC] [RFC] or1k atomic operation instructions
Stefan Kristiansson
2014-04-10 20:09:46 UTC
Permalink
So, I brought up the issue of the OpenRISC 1000 ISA missing support for atomic
operations for the v1.0 architecture proposals and we had a quick discussion
about it at orconf2012. The conclusion then was that we needed something
more concrete than just "we should add it" and I promised to look into it.
As I always try to keep my promises, albeit sometimes slowly, here we are.
(In all honesty the real reason why I started to look into this is that I've
started a musl port, and to forge on with the 'atomic syscall' path with that
just felt wrong.)

Anyways, I've done some research since, and I did the discovery that at a point
in history (up to around 2005), there has been atomic operation instructions
described (although rather vaguely) in the arch specification.
I found the old arch specs here:
http://opencores.org/websvn,listing?repname=or1k&path=%2For1k%2Ftrunk%2Fdocs%2F#path_or1k_trunk_docs_

To be more precise, this was what was said about atomicity:
"7.3 Atomicity
A memory access is atomic if it is always performed in its entirerty with no
visible fragmentation. Atomic memory accesses are specifically required to
implement software semaphores and other shared structures in systems where two
different processes on the same processor, or two different processors in a
multiprocessor environment, access the same memory location with intent to
modify it.
The OpenRISC 1000 architecture provides two dedicated instructions that together
perform an atomic read-modify-write operation.
l.lwa rD, I(rA)
l.swa I(rA), rB
Instruction l.lwa loads single word from memory, creating a reservation for a
subsequent conditional store operation. A special register, invisible to the
programmer, is used to hold the address of the memory location, which is used in
the atomic read-modify-write operation.
The reservation for a subsequent l.swa is cancelled if another master reads the
same memory location (snoop hit), another l.lwa is executed or if the software
explicitly clears the reservation register.
If a reservation is still valid when the corresponding l.swa is executed, l.swa
stores general-purpose register rB into the memory.
If reservation was cancelled, l.swa is executed as no operation."

There are a couple of things that are left undefined in the text above,
but it gave me a base to start off from (if nothing else, the names of the
instructions).
What I am proposing is revise the text above and bringing it back to the arch
specification.
And I am proposing that the revised text should contain the following bullet
points (with remarks and opening for discussions):

- A load link can be broken by either:
1) another l.lwa instruction
2) another l.swa instrucion
3) another store to the linked address
4) a context switch (exception)

- The granularity of the link is a word.
(I'm certainly open for discussions on this one, e.g. a cacheline could make
sense too)

- The result (1 for success and 0 for fail) of the store conditional is stored
in the source register of the l.swa instruction.
I.e. 'rB' in 'l.swa I(rA), rB'.
(I was in a split mind between choosing the flag bit, the carry bit or
the l.swa source register. The reason I choose the register is because the
flag is easily a critical path in the rtl implementations, the carry bit
requires l.addc which isn't always included (despite being a mandatory
instruction))

As a proof of concept, I've implemented the behaviour described above in
binutils and or1ksim and wrote a set of tests for or1ksim to ensure it's
behaviour.
The 6-bit opcode I used for l.lwa is 0x1b and the opcode for l.swa is 0x33.
I'm not going to post the patches for binutils and or1ksim just yet,
because I wanted to let people to raise their voices before doing that.
But if there are no objections to what I propose here, I'll probably move
forward with what I have.
However, if someone is anyways interested in looking at the patches I've
made them public here:
https://github.com/skristiansson/or1k-src/commit/e7d1ef5f9c2f698f4e41cd6e3e739df1201fe18c
(The binutils patch still needs some work, the cgen simulator will not grok
the atomicity property of them like that)
and here:
https://github.com/skristiansson/or1ksim/commit/3afc310f4e6c7aa50dd823ab36e2fb365a1a0de7

Stefan
Peter Gavin
2014-04-10 20:47:55 UTC
Permalink
Hi Stefan,

I'd say this looks good.

On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson <
Post by Stefan Kristiansson
- The granularity of the link is a word.
(I'm certainly open for discussions on this one, e.g. a cacheline could make
sense too)
I think a single word is fine. Doing the whole cacheline would be more
complex, wouldn't it? Plus if the whole cache line was linked it would
mean code that uses atomic instructions needs to know the cache line size,
which might be annoying.
Post by Stefan Kristiansson
- The result (1 for success and 0 for fail) of the store conditional is stored
in the source register of the l.swa instruction.
I.e. 'rB' in 'l.swa I(rA), rB'.
(I was in a split mind between choosing the flag bit, the carry bit or
the l.swa source register. The reason I choose the register is because the
flag is easily a critical path in the rtl implementations, the carry bit
requires l.addc which isn't always included (despite being a mandatory
instruction))
The only nit I really have is that writing the result to a register doesn't
fit the rest of the ISA. Is the F flag timing path on the mork1x tighter
than the forwarding path needed to pass the result of l.swa to earlier
pipeline stages would be? Because if the result can't be forwarded, there
would have to be a pipeline bubble. If either solution causes a pipeline
bubble, I'd prefer just to put the result in the F flag.

-Pete
Stefan Kristiansson
2014-04-11 02:33:17 UTC
Permalink
Post by Peter Gavin
On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson
Post by Stefan Kristiansson
- The result (1 for success and 0 for fail) of the store conditional is stored
in the source register of the l.swa instruction.
I.e. 'rB' in 'l.swa I(rA), rB'.
(I was in a split mind between choosing the flag bit, the carry bit or
the l.swa source register. The reason I choose the register is because the
flag is easily a critical path in the rtl implementations, the carry bit
requires l.addc which isn't always included (despite being a mandatory
instruction))
The only nit I really have is that writing the result to a register doesn't
fit the rest of the ISA. Is the F flag timing path on the mork1x tighter
than the forwarding path needed to pass the result of l.swa to earlier
pipeline stages would be? Because if the result can't be forwarded, there
would have to be a pipeline bubble. If either solution causes a pipeline
bubble, I'd prefer just to put the result in the F flag.
Hmm, yeah, good point. I guess the only real way to find out is to
try, I'll experiment with both solutions.
I agree that the F flag would be more elegant from an ISA point of
view, and it also have the nice property that a rmw loop would be
tighter:
1:
l.lwa r3, 0(r4)
l.addi r3,r3,1
l.swa 0(r4), r3
l.bf 1b
l.nop

Stefan
Peter Gavin
2014-04-11 02:40:29 UTC
Permalink
On Thu, Apr 10, 2014 at 10:33 PM, Stefan Kristiansson <
Post by Stefan Kristiansson
I agree that the F flag would be more elegant from an ISA point of
view, and it also have the nice property that a rmw loop would be
l.lwa r3, 0(r4)
l.addi r3,r3,1
l.swa 0(r4), r3
l.bf 1b
l.nop
Yes, that was my thought exactly. :)

-Pete
Stefan Kristiansson
2014-04-11 12:16:27 UTC
Permalink
On Fri, Apr 11, 2014 at 5:33 AM, Stefan Kristiansson
Post by Stefan Kristiansson
Post by Peter Gavin
The only nit I really have is that writing the result to a register doesn't
fit the rest of the ISA. Is the F flag timing path on the mork1x tighter
than the forwarding path needed to pass the result of l.swa to earlier
pipeline stages would be? Because if the result can't be forwarded, there
would have to be a pipeline bubble. If either solution causes a pipeline
bubble, I'd prefer just to put the result in the F flag.
Hmm, yeah, good point. I guess the only real way to find out is to
try, I'll experiment with both solutions.
Ok, I did some experimenting and using the F flag wasn't as bad as I
had expected,
it actually turned out to be more straight forward than using the
source register.

So, F flag it is then.

Stefan
Stefan Kristiansson
2014-04-11 02:33:19 UTC
Permalink
Post by Peter Gavin
On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson
Post by Stefan Kristiansson
- The result (1 for success and 0 for fail) of the store conditional is stored
in the source register of the l.swa instruction.
I.e. 'rB' in 'l.swa I(rA), rB'.
(I was in a split mind between choosing the flag bit, the carry bit or
the l.swa source register. The reason I choose the register is because the
flag is easily a critical path in the rtl implementations, the carry bit
requires l.addc which isn't always included (despite being a mandatory
instruction))
The only nit I really have is that writing the result to a register doesn't
fit the rest of the ISA. Is the F flag timing path on the mork1x tighter
than the forwarding path needed to pass the result of l.swa to earlier
pipeline stages would be? Because if the result can't be forwarded, there
would have to be a pipeline bubble. If either solution causes a pipeline
bubble, I'd prefer just to put the result in the F flag.
Hmm, yeah, good point. I guess the only real way to find out is to
try, I'll experiment with both solutions.
I agree that the F flag would be more elegant from an ISA point of
view, and it also have the nice property that a rmw loop would be
tighter:
1:
l.lwa r3, 0(r4)
l.addi r3,r3,1
l.swa 0(r4), r3
l.bf 1b
l.nop

Stefan
Stefan Kristiansson
2014-04-11 02:38:23 UTC
Permalink
On Fri, Apr 11, 2014 at 5:33 AM, Stefan Kristiansson
Post by Stefan Kristiansson
Post by Peter Gavin
On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson
Post by Stefan Kristiansson
- The result (1 for success and 0 for fail) of the store conditional is stored
in the source register of the l.swa instruction.
I.e. 'rB' in 'l.swa I(rA), rB'.
(I was in a split mind between choosing the flag bit, the carry bit or
the l.swa source register. The reason I choose the register is because the
flag is easily a critical path in the rtl implementations, the carry bit
requires l.addc which isn't always included (despite being a mandatory
instruction))
The only nit I really have is that writing the result to a register doesn't
fit the rest of the ISA. Is the F flag timing path on the mork1x tighter
than the forwarding path needed to pass the result of l.swa to earlier
pipeline stages would be? Because if the result can't be forwarded, there
would have to be a pipeline bubble. If either solution causes a pipeline
bubble, I'd prefer just to put the result in the F flag.
Hmm, yeah, good point. I guess the only real way to find out is to
try, I'll experiment with both solutions.
I agree that the F flag would be more elegant from an ISA point of
view, and it also have the nice property that a rmw loop would be
l.lwa r3, 0(r4)
l.addi r3,r3,1
l.swa 0(r4), r3
l.bf 1b
l.nop
err, I of course meant l.bnf here ;)

Stefan
Geert Uytterhoeven
2014-04-11 09:26:39 UTC
Permalink
Hi Peter,
Post by Peter Gavin
On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson
Post by Stefan Kristiansson
- The granularity of the link is a word.
(I'm certainly open for discussions on this one, e.g. a cacheline could make
sense too)
I think a single word is fine. Doing the whole cacheline would be more
complex, wouldn't it? Plus if the whole cache line was linked it would mean
Why would it be more complex? Snooping the bus is done at the cacheline
level, right?
Post by Peter Gavin
code that uses atomic instructions needs to know the cache line size, which
might be annoying.
IIRC, on PPC it applies to the whole cacheline.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ***@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Stefan Kristiansson
2014-04-11 10:01:32 UTC
Permalink
On Fri, Apr 11, 2014 at 12:26 PM, Geert Uytterhoeven
Post by Geert Uytterhoeven
Hi Peter,
Post by Peter Gavin
On Thu, Apr 10, 2014 at 4:09 PM, Stefan Kristiansson
Post by Stefan Kristiansson
- The granularity of the link is a word.
(I'm certainly open for discussions on this one, e.g. a cacheline could make
sense too)
I think a single word is fine. Doing the whole cacheline would be more
complex, wouldn't it? Plus if the whole cache line was linked it would mean
Why would it be more complex? Snooping the bus is done at the cacheline
level, right?
I don't think it would be any more complex, it's just a matter of
masking out a couple of bits in the internal link address.
Post by Geert Uytterhoeven
Post by Peter Gavin
code that uses atomic instructions needs to know the cache line size, which
might be annoying.
IIRC, on PPC it applies to the whole cacheline.
What Peter is speaking about is the fact that the cache line size is
configurable on OR1K, but perhaps it would make sense to just set it
to the largest possible cache line (32 bytes).

Stefan
Peter Gavin
2014-04-11 18:20:32 UTC
Permalink
Post by Geert Uytterhoeven
Why would it be more complex? Snooping the bus is done at the cacheline
level, right?
Post by Peter Gavin
code that uses atomic instructions needs to know the cache line size,
which
Post by Peter Gavin
might be annoying.
IIRC, on PPC it applies to the whole cacheline.
Ok, fair enough :)

-Pete
Stefan Wallentowitz
2014-04-15 07:59:09 UTC
Permalink
Hi Stefan,

just one general question from my side: Will or1200 also be extended? If
not, how do you ensure compatibility?
Is it planned to have something like -matomic or so?

Bye,
Stefan
Stefan Kristiansson
2014-04-15 09:06:18 UTC
Permalink
On Tue, Apr 15, 2014 at 10:59 AM, Stefan Wallentowitz
Post by Stefan Wallentowitz
just one general question from my side: Will or1200 also be extended? If
not, how do you ensure compatibility?
Is it planned to have something like -matomic or so?
I have planned on adding the support into or1200 too at some point,
but it's not a top priority for me right this second.
I think compatibility could be handled on an OS level only, at least
as long as the instructions are emitted with handcrafted asm alone.

Stefan
Stefan Kristiansson
2014-04-21 20:28:12 UTC
Permalink
Update on this, I've added 'Atomicity' chapter and the l.lwa and l.swa
instructions to the arch spec, with the things that we discussed in
this thread.
The updated arch spec can be found here for review:
https://www.dropbox.com/s/yqyfelu2yrutzwt/openrisc-arch-1.1-rev0.pdf
or as .odt:
https://www.dropbox.com/s/bzez95ix1cl0g7g/openrisc-arch-1.1-rev0.odt

For convenience, this is the added text in 'plain text' (copied
straight from the .odt):

"7.3 Atomicity

A memory access is atomic if it is always performed in its entirety
with no visible fragmentation. Atomic memory accesses are specifically
required to implement software semaphores and other shared structures
in systems where two different processes on the same processor, or two
different processors in a multiprocessor environment, access the same
memory location with intent to modify it.

The OpenRISC 1000 architecture provides two dedicated instructions
that together perform an atomic read-modify-write operation.

l.lwa rD, I(rA)

l.swa I(rA), rB

Instruction l.lwa loads single word from memory, creating a
reservation for a subsequent conditional store operation. A special
register, invisible to the programmer, is used to hold the address of
the memory location, which is used in the atomic read-modify-write
operation.

The reservation for a subsequent l.swa is cancelled if another store
to the same memory location occur, another master writes the same
memory location (snoop hit), another l.swa (to any memory location) is
executed, another l.lwa is executed or a context switch (exception)
occur.

If a reservation is still valid when the corresponding l.swa is
executed, l.swa stores general-purpose register rB into the memory and
SR[F] is set.

If the reservation was cancelled, l.swa does not perform the store to
memory and SR[F] is cleared."

"l.lwa rD,I(rA)

Description:

The offset is sign-extended and added to the contents of
general-purpose register rA. The sum represents an effective address.
The single word in memory addressed by EA is loaded into the low-order
32 bits of general-purpose register rD. High-order bits of
general-purpose register rD are replaced with zero.

Two internal registers are set, one 1-bit flag register that will be
referred to as atomic_reserve and a 32/64-bit address register that
will be referred to as atomic_address.

The atomic_address register will be set to EA.

32-bit Implementation:

EA ← exts(Immediate) + rA[31:0]
rD[31:0] ← (EA)[31:0]

atomic_reserve ← 1

atomic_address ← EA

64-bit Implementation:

EA ← exts(Immediate) + rA[63:0]
rD[31:0] ← (EA)[31:0]
rD[63:32] ← 0

atomic_reserve ← 1

atomic_address ← EA

Exceptions:

TLB miss
Page fault
Bus error
Alignment"

"l.swa I(rA),rB

Description:

The offset is sign-extended and added to the contents of
general-purpose register rA. The sum represents an effective address.
The low-order 32 bits of general-purpose register rB are conditionally
stored to memory location addressed by EA. The 'atomic' condition
relies on that an atomic reserve to EA is still intact (i.e. that the
atomic_reserve internal register mentioned in the l.lwa instruction is
still set and the internal atomic_address register matches EA).

32-bit Implementation:

EA ← exts(Immediate) + rA[31:0]

if (atomic) (EA)[31:0] ← rB[31:0]

SR[F] ← atomic

64-bit Implementation:

EA ← exts(Immediate) + rA[63:0]
if (atomic) (EA)[31:0] ← rB[31:0]

SR[F] ← atomic

Exceptions:

TLB miss
Page fault
Bus error
Alignment"
Geert Uytterhoeven
2014-04-22 07:19:04 UTC
Permalink
Hi Stefan,

On Mon, Apr 21, 2014 at 10:28 PM, Stefan Kristiansson
Post by Stefan Kristiansson
The atomic_address register will be set to EA.
Shouldn't this be the physical address, for systems with MMU?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ***@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Stefan Kristiansson
2014-04-22 07:25:50 UTC
Permalink
On Tue, Apr 22, 2014 at 10:19 AM, Geert Uytterhoeven
Post by Peter Gavin
Hi Stefan,
On Mon, Apr 21, 2014 at 10:28 PM, Stefan Kristiansson
Post by Stefan Kristiansson
The atomic_address register will be set to EA.
Shouldn't this be the physical address, for systems with MMU?
Yes, of course (and indeed it is in the implementations I've made).
I'll change the description to reflect that.

Stefan
Stefan Kristiansson
2014-04-28 18:49:59 UTC
Permalink
On Mon, Apr 21, 2014 at 11:28 PM, Stefan Kristiansson
Post by Stefan Kristiansson
Update on this, I've added 'Atomicity' chapter and the l.lwa and l.swa
instructions to the arch spec, with the things that we discussed in
this thread.
https://www.dropbox.com/s/yqyfelu2yrutzwt/openrisc-arch-1.1-rev0.pdf
https://www.dropbox.com/s/bzez95ix1cl0g7g/openrisc-arch-1.1-rev0.odt
Update again.

I've added the suggested changes about a clarification that it's the
physical address that is stored as the "atomic_address".
I've uploaded the changes to the dropbox links above, but below is
instruction descriptions in plain text:

"l.lwa rD,I(rA)
Description:
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The single word in memory
addressed by EA is
loaded into the low-order 32 bits of general-purpose register rD.
High-order bits of
general-purpose register rD are replaced with zero.
Two internal registers are set, one 1-bit flag register that will be
referred to as
atomic_reserve and a 32/64-bit address register that will be referred
to as atomic_address.
The atomic_address register will be set to EA (translated to its
physical counterpart in
case the MMU is enabled)."

"l.swa I(rA),rB
Description:
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The low-order 32 bits of
general-purpose register rB
are conditionally stored to memory location addressed by EA. The
'atomic' condition
relies on that an atomic reserve to EA is still intact (i.e. that the
atomic_reserve internal
register mentioned in the l.lwa instruction is still set and the
internal atomic_address
register matches EA). When the MMU is enabled, the physical translation of EA is
compared to the internal atomic_address."

Stefan
Julius Baxter
2014-04-28 23:28:54 UTC
Permalink
On Mon, Apr 28, 2014 at 7:49 PM, Stefan Kristiansson
Post by Stefan Kristiansson
On Mon, Apr 21, 2014 at 11:28 PM, Stefan Kristiansson
Post by Stefan Kristiansson
Update on this, I've added 'Atomicity' chapter and the l.lwa and l.swa
instructions to the arch spec, with the things that we discussed in
this thread.
https://www.dropbox.com/s/yqyfelu2yrutzwt/openrisc-arch-1.1-rev0.pdf
https://www.dropbox.com/s/bzez95ix1cl0g7g/openrisc-arch-1.1-rev0.odt
Update again.
I've added the suggested changes about a clarification that it's the
physical address that is stored as the "atomic_address".
I've uploaded the changes to the dropbox links above, but below is
"l.lwa rD,I(rA)
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The single word in memory
addressed by EA is
loaded into the low-order 32 bits of general-purpose register rD.
High-order bits of
general-purpose register rD are replaced with zero.
Two internal registers are set, one 1-bit flag register that will be
referred to as
atomic_reserve and a 32/64-bit address register that will be referred
to as atomic_address.
The atomic_address register will be set to EA (translated to its
physical counterpart in
case the MMU is enabled)."
"l.swa I(rA),rB
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The low-order 32 bits of
general-purpose register rB
are conditionally stored to memory location addressed by EA. The
'atomic' condition
relies on that an atomic reserve to EA is still intact (i.e. that the
atomic_reserve internal
register mentioned in the l.lwa instruction is still set and the
internal atomic_address
register matches EA). When the MMU is enabled, the physical translation of EA is
compared to the internal atomic_address."
Looks good Stefan.

I like the idea of not bothering with a true SPR to store this state,
because you're checking it anyway when l.b[n]f'ing after the l.swa.
It's also a bit of state which you don't need to save during an
exception. My only concern is that it's not visible to a debugger, but
I'm not sure it really matters.

In terms of implementation, have you found the additional
infrastructure required (I presume exposure of the atomic access state
per processor to some snoop logic near the shared memory) to
complicate matters? Do these accesses appear like peripheral accesses
(that is, to volatile memory areas like anything with address bit 31
set, typically) to the caches, so that they are never cached?

Cheers

Julius
Stefan Kristiansson
2014-04-29 06:21:19 UTC
Permalink
Post by Julius Baxter
I like the idea of not bothering with a true SPR to store this state,
because you're checking it anyway when l.b[n]f'ing after the l.swa.
It's also a bit of state which you don't need to save during an
exception. My only concern is that it's not visible to a debugger, but
I'm not sure it really matters.
I don't think this is generally visible to debuggers on other CPUs, so
I also doubt it's an issue.
Does anyone else have any insight on this?
Post by Julius Baxter
In terms of implementation, have you found the additional
infrastructure required (I presume exposure of the atomic access state
per processor to some snoop logic near the shared memory) to
complicate matters?
AFAIK, you don't need to export the atomic access state, what you need
to do is expose the snoop addresses to the atomic access logic.
In the implementation I've done, snooping isn't done at all (i.e.,
it's only for single core), however as seen in the Stefan
Wallentowitz's message in the other thread, he has added this logic:
https://github.com/wallento/mor1kx/commit/2372e0b69eb1091f377a1ea32072758f219ce294
Post by Julius Baxter
Do these accesses appear like peripheral accesses
(that is, to volatile memory areas like anything with address bit 31
set, typically) to the caches, so that they are never cached?
No, in terms of load/store instructions, they act exactly the same as
l.lwz and l.sw.

Stefan
Jose Teixeira de Sousa
2014-04-29 10:31:26 UTC
Permalink
Speaking of debug, what will be the strategy for mor1kx:

--debug stub or
-- adbg (this was giving problems the last time I tried, and although it
works with or1200 the adbg's RTL needs a lot of rewriting... just ask
Verilator why)

?

We are sort of planning a Bluetooth Low energy chip using an open source
processor and LM32 appears a serious contender to mor1kx. OR1200 is out of
the question.











On Tue, Apr 29, 2014 at 7:21 AM, Stefan Kristiansson <
Post by Stefan Kristiansson
Post by Julius Baxter
I like the idea of not bothering with a true SPR to store this state,
because you're checking it anyway when l.b[n]f'ing after the l.swa.
It's also a bit of state which you don't need to save during an
exception. My only concern is that it's not visible to a debugger, but
I'm not sure it really matters.
I don't think this is generally visible to debuggers on other CPUs, so
I also doubt it's an issue.
Does anyone else have any insight on this?
Post by Julius Baxter
In terms of implementation, have you found the additional
infrastructure required (I presume exposure of the atomic access state
per processor to some snoop logic near the shared memory) to
complicate matters?
AFAIK, you don't need to export the atomic access state, what you need
to do is expose the snoop addresses to the atomic access logic.
In the implementation I've done, snooping isn't done at all (i.e.,
it's only for single core), however as seen in the Stefan
https://github.com/wallento/mor1kx/commit/2372e0b69eb1091f377a1ea32072758f219ce294
Post by Julius Baxter
Do these accesses appear like peripheral accesses
(that is, to volatile memory areas like anything with address bit 31
set, typically) to the caches, so that they are never cached?
No, in terms of load/store instructions, they act exactly the same as
l.lwz and l.sw.
Stefan
_______________________________________________
OpenRISC mailing list
http://lists.openrisc.net/listinfo/openrisc
--
Jose T. de Sousa, PhD
Office: +351 213 100 213
R. Alves Redol 9
1000-029 Lisboa
Portugal
Stefan Kristiansson
2014-04-30 03:54:55 UTC
Permalink
On Tue, Apr 29, 2014 at 9:56 AM, Stefan Wallentowitz
Post by Peter Gavin
Hi Stefan,
* The parts where you describe atomic_reserve and atomic_address might
be implementation-specific. I think LL/SC can also be implemented as
part of the cache (with "linked" cache line flags). Similarly, there
could be more than one atomic_reserve and atomic_address registers to
support several LL/SC operations (which I would prefer for a future
mor1kx implementation). Maybe we can think about a more generic
description like atomic_reserve[EA] and a note that states, that the
number of concurrent atomic_reserve is limited or so. I will also think
about and check other architecture specs.
Good point, I was afraid that my description was a bit to
implementation specific,
but I thought that there wasn't much options how to implement it.
I think your more generic 'atomic_reserve[EA]' might be a better option.
I haven't seen any references to several ll/sc being active at the
same time in any other arch specs, but I might not have looked
carefully enough.
Do you have an example?
Post by Peter Gavin
* Maybe we should add a note that refers to implementation descriptions
regarding further conditions under that an swa might fail (like when
implemented as part of the cache)
hmm, so what would those conditions be?
Post by Peter Gavin
* Finally, I want to bring up the memory model in this context. I think
the architecture manual lacks proper mentioning of the assumed
consistency. mor1kx implements sequential consistency I suppose and so
might some programs. I think it may be better to go with a more relaxed
consistency model, like total store order in Sparc. Independent of
whether it is changed, the memory model description may be added in this
revision, as the l.lwa and l.swa would be synchronization points in a
relaxed consistency model (as l.msync is then).
Yes, I agree that it needs to be defined better.
Perhaps that is slightly out of the scope of what I'm doing here
though, maybe you are up to defining something more concrete in the
future?
However, the note that l.swa/l.wa acts as memory barriers could be added,
because that's what we want, right?

Stefan
Stefan Kristiansson
2014-05-03 07:13:32 UTC
Permalink
On Wed, Apr 30, 2014 at 6:54 AM, Stefan Kristiansson
Post by Stefan Kristiansson
On Tue, Apr 29, 2014 at 9:56 AM, Stefan Wallentowitz
Post by Peter Gavin
Hi Stefan,
* The parts where you describe atomic_reserve and atomic_address might
be implementation-specific. I think LL/SC can also be implemented as
part of the cache (with "linked" cache line flags). Similarly, there
could be more than one atomic_reserve and atomic_address registers to
support several LL/SC operations (which I would prefer for a future
mor1kx implementation). Maybe we can think about a more generic
description like atomic_reserve[EA] and a note that states, that the
number of concurrent atomic_reserve is limited or so. I will also think
about and check other architecture specs.
Good point, I was afraid that my description was a bit to
implementation specific,
but I thought that there wasn't much options how to implement it.
I think your more generic 'atomic_reserve[EA]' might be a better option.
I haven't seen any references to several ll/sc being active at the
same time in any other arch specs, but I might not have looked
carefully enough.
Do you have an example?
I've updated the instruction descriptions with a more generic approach
as follows.

"l.lwa rD,I(rA)
Description:
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The single word in memory
addressed by EA is
loaded into the low-order 32 bits of general-purpose register rD.
High-order bits of
general-purpose register rD are replaced with zero.
An atomic reservation is placed on the address formed from EA. In case an MMU is
enabled, the physical translation of EA is used.

32-bit Implementation:
EA ← exts(Immediate) + rA[31:0]
rD[31:0] ← (EA)[31:0]
atomic_reserve[to_phys(EA)] ← 1

64-bit Implementation:
EA ← exts(Immediate) + rA[63:0]
rD[31:0] ← (EA)[31:0]
rD[63:32] ← 0
atomic_reserve[to_phys(EA)] ← 1

Exceptions:
TLB miss
Page fault
Bus error
Alignment"

"l.swa I(rA),rB
Description:
The offset is sign-extended and added to the contents of
general-purpose register rA. The
sum represents an effective address. The low-order 32 bits of
general-purpose register rB
are conditionally stored to memory location addressed by EA. The
'atomic' condition
relies on that an atomic reserve to EA is still intact. When the MMU
is enabled, the
physical translation of EA is used to do the address comparison.

32-bit Implementation:
EA ← exts(Immediate) + rA[31:0]
if (atomic) (EA)[31:0] ← rB[31:0]
SR[F] ← atomic

64-bit Implementation:
EA ← exts(Immediate) + rA[63:0]
if (atomic) (EA)[31:0] ← rB[31:0]
SR[F] ← atomic

Exceptions:
TLB miss
Page fault
Bus error
Alignment "
Post by Stefan Kristiansson
Post by Peter Gavin
* Finally, I want to bring up the memory model in this context. I think
the architecture manual lacks proper mentioning of the assumed
consistency. mor1kx implements sequential consistency I suppose and so
might some programs. I think it may be better to go with a more relaxed
consistency model, like total store order in Sparc. Independent of
whether it is changed, the memory model description may be added in this
revision, as the l.lwa and l.swa would be synchronization points in a
relaxed consistency model (as l.msync is then).
Yes, I agree that it needs to be defined better.
Perhaps that is slightly out of the scope of what I'm doing here
though, maybe you are up to defining something more concrete in the
future?
However, the note that l.swa/l.wa acts as memory barriers could be added,
because that's what we want, right?
So, I added the following to the Atomicity chapter:

"In implementations that use a weakly-ordered memory model, l.swa and l.lwa will
serve as synchronization points, similar to l.msync."

Stefan

Loading...