Skip to content

Commit 77372a3

Browse files
authored
Merge pull request #8 from helq/master
Adding small FAQ constructed out of Neil's answers to Elkin's questions
2 parents c4f79bb + 978929b commit 77372a3

2 files changed

Lines changed: 229 additions & 0 deletions

File tree

_posts/2021-12-21-usage-faq.md

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
---
2+
layout: post
3+
title: "Usage FAQ"
4+
author: Neil McGlohon and Elkin Cruz
5+
category: faq
6+
---
7+
8+
A quick list of Answered Questions for new people (and not so much) to ROSS.
9+
10+
- Are there any simple, example ROSS models to study and copy?
11+
12+
+ Absolute minimum template: [template-model](https://github.com/ROSS-org/template-model)
13+
+ A simple mailing model (with people sending letters and delivery cars transfering to
14+
one their corresponding destination): [ROSS-Mail-Model](https://github.com/nmcglohon/ROSS-Mail-Model)
15+
+ A Game of Life example: [ross-highlife](https://github.com/helq/ross-highlife)
16+
17+
- Can ROSS make use of CUDA to accelerate the execution of a model?
18+
19+
ROSS doesn't come with CUDA or GPUs support. Some optimizations might be possible at a
20+
per model basis. ROSS is written in pure C, which allows it to access `extern`
21+
variables at ease. Access to the GPUs can be implemented as an object (with no main
22+
function) and all functions `extern`ed to C.
23+
24+
- What is Phold precisely? It's the only model present by default in ROSS, it seems to be
25+
used for benchmarking, but it's pretty empty, so what the deal?
26+
27+
As you mentioned, phold is a benchmarking model. It's used for both strong scaling
28+
(set problem size, varied worker cardinality) and weak scaling (varied problem size,
29+
set worker cardinality).
30+
[See the wikipedia paragraph on this.](https://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling)
31+
32+
It
33+
generates random events to/from random LPs with some guidance as to
34+
how many remote (different PE) and local (same PE) events are
35+
generated. Local events won't ever trigger out-of-order event errors
36+
by themselves but remote events can because optimistic execution
37+
allows for PEs to process events at their own pace without strict
38+
synchronization. If we were to create some sort of optimization for
39+
ROSS that improves the engine's performance, a simple model like
40+
PHOLD is a good choice because we can measure the simulation's
41+
efficiency (a metric on how effective the simulation was at
42+
progressing - more rollbacks means lower efficiency).
43+
44+
- What is an LP? What is a PE?
45+
46+
An LP (Logical Process) is a single unit of computation, an actor. An LP receives
47+
messages, computes them, and optionally sends more messages to other LPs or itself.
48+
There can be millions or billions of LPs in a simulation. LPs are created at the start
49+
of a simulation on an assigned PE, they have a unique global ID across the simulation,
50+
and they are only destroyed once the simulation stops.
51+
52+
A physical process (PE) is a single core/MPI rank in charge of coordinating the
53+
simulation and communication with other PEs. PEs handle LPs and their mailboxes.
54+
55+
- Cool. What is a KP?
56+
57+
KPs (Kernel Process) are really just an organizational structure. A container of LPs
58+
contained themselves into a PE. Most importantly, they determine the granularity of
59+
rollbacks. When an out of order event is detected by a receiving LP ("I got an event
60+
that is timestamped to have occurred 'before' the last event that I have processed"),
61+
all LPs in the same KP as the receiving LP must rollback to the time before the
62+
triggering event is set to occur. Too few KPs and you'll be rolling back a lot of LPs
63+
very frequently. Too many KPs is not really a terrible problem but could maybe be
64+
memory inefficient - idk I haven't really tested that.
65+
66+
- I am getting a `Out of events in GVT! Consider increasing --extramem`. What is that?
67+
68+
The GVT (Global Virtual Time) is a mechanism used to keep all LPs in synch.
69+
70+
ROSS is built in C and is designed to be as fast as possible. So it allocates a bunch
71+
of memory up front and creates a "pool" of unused event structs that can be pulled
72+
from any time the simulation needs an event. It does not resize this on-the-fly (could
73+
be time expensive, possibly better to just error out and tell the user that they
74+
should allocate more). Because we may need to rollback the simulation, we need to
75+
'hold' all forward events after they've been processed until we know that it is safe
76+
to garbage collect them. The time when we can know this is during synchronization
77+
points: GVT Calculation. GVT is the minimum point in virtual time that all LPs can
78+
agree "has happened". Any events with timestamps up to that point can be safely
79+
garbage collected and their event structs returned to the pool.
80+
81+
If, however, we create so many events that we run out of empty events in the
82+
pool, the simulation will recognize this, end, and error out telling the user how
83+
to rectify the situation.
84+
85+
Why is memory not allocated dynamically then? There are two reasons for this. First,
86+
ROSS is designed to be as fast as possible, and dynamic allocation takes time. If we
87+
abstract this responsibility away from the user, it's likely that they won't think
88+
about it anymore and then when they create models that make a TON of events and have
89+
no idea why it's running so slowly - when it's ultimately because it's having to
90+
repeatedly malloc a bunch of events during the simulation instead of all up front.
91+
Second, and _more_ worse even, the model can be unstable in the creation of
92+
messages/events. It's quite simple to write a program that sends two messages per each
93+
message it receives. This program requires an unbounded pool which is not feasible in
94+
any system. If you keep receiving this error no matter how much extra memory you
95+
allocate for the simulation, you might have a model with an unstable message
96+
structure.
97+
98+
- I am getting a `Lookahead violation: decrease g_tw_lookahead`. What is that?
99+
100+
ROSS can make use of "three" modes of execution or [schedulers](https://ross-org.github.io/feature/schedulers.html).
101+
The conservative scheduler works by advancing at tiny synchronized steps in the
102+
simulation. For this, it is assumed that within a small window of time, a lookahead,
103+
no events will be created by the events to be processed. If an event is created with a
104+
an offset smaller than the lookahead, it will be in violation of the assumption and
105+
the execution will halt.
106+
107+
This means that models with zero-offset events cannot run using the conservative
108+
scheduler, unfortunately. You can either use the sequential, `--synch=1`, or
109+
optimistic, `--synch=3`, schedulers.
110+
111+
- What is the "AVL Tree" doing? What is its function in ROSS? and, why is it so important
112+
that its size is shown in the statistics after executing a ROSS model?
113+
114+
The AVL tree is an efficient way of keeping track of existing events that haven't yet
115+
been confirmed to have been "committed" to simulation history. In optimistic mode, we
116+
can't "commit" any event to simulation history until GVT is greater than the events
117+
timestamp. Until then, it's entirely possible for that event to be rolled back.
118+
Keeping track of the AVL tree size during the course of the simulation is kind of a
119+
way of monitoring event fanout (the rate at which events are being created/destroyed).
120+
121+
- Storing whole events seems rather wasteful. Why do we have to keep track of all of
122+
the memory of an event? Couldn't we just keep track of some signature of the event and
123+
save some space?
124+
125+
There are essentially two ways of accomplishing rollbacks:
126+
127+
1. Reverse Computation - If we know the exact context for the event and its
128+
computation is easily reversible. For example, "This was an event of type X, which
129+
means that it incremented the state property `s->y` by one". So we know the
130+
opposite would be to "decrement `s->y` by one". We don't need any more
131+
information! But it's not always this simple to revert the state.
132+
2. State Swapping - If we instead, at time of processing the forward event, encode
133+
its previous state into the message struct, we can during rollback fetch that
134+
saved data from the message and replace it back in the LP state. This is helpful
135+
if you have an event that does some destructive change like "set whatever value of
136+
`s->y` to equal zero". It is, in principle, impossible to recover information that
137+
has been destroyed. For all we know `s->y` could have been 23526 prior to our
138+
forward event that set it to zero. It could have been 1. If we saved that
139+
information in the message struct, then we can just look at the message struct
140+
during RC and say "oh we know that `s->y` was 23526 prior to setting it to zero".
141+
142+
Note: #2 requires that we don't overwrite any event data until we know we don't need
143+
it!
144+
145+
- Can we allocate dynamic memory for an event?
146+
147+
Yes and no. The gist of it is: if you are sending a message, don't allocate memory! If
148+
you are processing a message, yes you can but you have to remember to deallocate it.
149+
**Every** event processed forward is either rollbacked or removed (commited) at some
150+
point.
151+
152+
Very important: Be very careful about using dynamic memory encoded into messages. You
153+
might have already considered this, but when you `alloc()` something, you get a pointer
154+
to that memory. If we are saving that data into the message struct as dynamic memory,
155+
all that's really saved is just the pointer to that data. If this message was destined
156+
for an LP on a different PE (different processor). The LP that receives it will get
157+
the message, and if it tries to read the data located at the encoded pointer it will
158+
segfault, or worse, it will run with invalid data.
159+
160+
One usage of allocating dynamic memory is when storing the state of an LP at the
161+
forward computation step. There are two options for the LP's size: it might be defined
162+
only at runtime or it might be too large to save in the message struct. Later on, the
163+
allocated spaced created in the forward computation step must be freed in the reverse
164+
event handler or the commit handler. Basically, allocating space within message
165+
handling should never be used for any purpose other than in forward, reverse and
166+
commit handlers and even then, use caution.
167+
168+
- What is the purpose of the `commit_f` function? Is it to deallocate any memory after a
169+
reverse message is processed and is no longer needed (ie, behind the GVT)? (see
170+
previous question)
171+
172+
The `commit_f` function can be tricky and should be handled with care. Yes, the
173+
`commit_f` function can be used to free any dynamic memory that was allocated during a
174+
forward event handler into the message struct. (It should also be freed in the reverse
175+
event as well - a message will be passed through a commit callback XOR a reverse
176+
callback, never both). Doing this will prevent memory leaks.
177+
178+
However. While you technically have access to LP state during a `commit_f` function,
179+
you should refrain from touching it.
180+
181+
The reason comes with when the `commit_f` function is happening. Say you had a ROSS
182+
program where LPs received events of type ADD which increment some encoded LP state.
183+
And say an LP receives an ADD event timestamped at virtual time 5. Eventually ROSS
184+
gets around to its gvt calculation phase and everyone agrees that GVT is now at 6. But
185+
the LP that received that ADD event is now at say virtual time 8 (because we can
186+
optimistically execute events and LPs aren't in lock step with each other following
187+
GVT). When the `commit_f` handler is processing that ADD event, the LP believes that
188+
it is at time 8. But the event that we are committing was timestamped at 5. If we
189+
touch LP state now, we are not touching it at time 5, we're touching it at time 8. If
190+
we were to printout `tw_now(lp)` inside of the commit handler, it would read 8, not 5.
191+
192+
It's OK if the implications of this are not 100% clear, just be aware that `commit_f`
193+
should really only be reading data or freeing dynamic memory from events. And that the
194+
data to be read (unless it's data encoded in the message struct which is kept safe),
195+
will be data from virtual time 8, not 5.
196+
197+
- I've noticed that we can have multiple LP definitions, but can we have multiple message
198+
definitions? What if I want to send messages with different payloads and sizes? Do I
199+
resort to declaring a fix-sized message and play around with casting to different
200+
struct messages for different things? What about sending an arbitrarily sized message?
201+
202+
No, the system operates under only one message type. If you want anything fancy, maybe
203+
allocate memory on the fly and put a reference to it in the struct. Beware that you
204+
should never, ever allocate dynamic memory when creating a message (see two previous
205+
questions!).
206+
207+
ROSS expects a single event size so it knows how to allocate the event pool.
208+
Technically, when we are processing an event, we just get a pointer to the message
209+
struct. We can cast that to be whatever we want. But the size of transmitted messages
210+
must always be the max of any message struct that we'd be casting to.
211+
212+
- ROSS seems kinda limited. I want to allocate dynamic memory and a garbage collector,
213+
have arbitrarily sized messages, LP creation on the fly, reallocation of LPs, dynamic
214+
message pool, and ...
215+
216+
With ROSS, most things that make you think "oh that seems rather limiting" has a
217+
reason behind it: speed.
218+
219+
ROSS' claim is being incredibly scalable and fast. Having variable event sizes would
220+
complicate the pooling of events, requiring us to dynamically allocate events as they
221+
are created or create several different pools for every single event type
222+
(challenging to support cleanly and also very inefficient - will lead to more
223+
problems than it solves), and so on and on.

documentation.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,12 @@ They discuss best practices for designing and building new models.
3535
{% endfor %}
3636
</ul>
3737

38+
### FAQ
39+
40+
Once you are knee deep building models, you might have many unanswered questions.
41+
If that is your case, [this FAQ comes to your rescue](./faq/usage-faq.html). (Note: there
42+
are many, many more unanswered questions in the wild. If you find any, please ping us!)
43+
3844
## ROSS Features
3945

4046
These posts describe features of ROSS that model developers can take advantage of.

0 commit comments

Comments
 (0)