Merge pull request #8 from helq/master

nmcglohon · web-flow · commit 77372a37cb1d · 2021-12-23T11:14:38.000-05:00
Adding small FAQ constructed out of Neil's answers to Elkin's questions
diff --git a/_posts/2021-12-21-usage-faq.md b/_posts/2021-12-21-usage-faq.md
@@ -0,0 +1,223 @@
+---
+layout: post
+title: "Usage FAQ"
+author: Neil McGlohon and Elkin Cruz
+category: faq
+---
+
+A quick list of Answered Questions for new people (and not so much) to ROSS.
+
+- Are there any simple, example ROSS models to study and copy?
+
+    + Absolute minimum template: [template-model](https://github.com/ROSS-org/template-model)
+    + A simple mailing model (with people sending letters and delivery cars transfering to
+        one their corresponding destination): [ROSS-Mail-Model](https://github.com/nmcglohon/ROSS-Mail-Model)
+    + A Game of Life example: [ross-highlife](https://github.com/helq/ross-highlife)
+
+- Can ROSS make use of CUDA to accelerate the execution of a model?
+
+    ROSS doesn't come with CUDA or GPUs support. Some optimizations might be possible at a
+    per model basis. ROSS is written in pure C, which allows it to access `extern`
+    variables at ease. Access to the GPUs can be implemented as an object (with no main
+    function) and all functions `extern`ed to C.
+
+- What is Phold precisely? It's the only model present by default in ROSS, it seems to be
+    used for benchmarking, but it's pretty empty, so what the deal?
+
+    As you mentioned, phold is a benchmarking model. It's used for both strong scaling
+    (set problem size, varied worker cardinality) and weak scaling (varied problem size,
+    set worker cardinality).
+    [See the wikipedia paragraph on this.](https://en.wikipedia.org/wiki/Scalability#Weak_versus_strong_scaling)
+
+    It
+    generates random events to/from random LPs with some guidance as to
+    how many remote (different PE) and local (same PE) events are
+    generated. Local events won't ever trigger out-of-order event errors
+    by themselves but remote events can because optimistic execution
+    allows for PEs to process events at their own pace without strict
+    synchronization. If we were to create some sort of optimization for
+    ROSS that improves the engine's performance, a simple model like
+    PHOLD is a good choice because we can measure the simulation's
+    efficiency (a metric on how effective the simulation was at
+    progressing - more rollbacks means lower efficiency).
+
+- What is an LP? What is a PE?
+
+    An LP (Logical Process) is a single unit of computation, an actor. An LP receives
+    messages, computes them, and optionally sends more messages to other LPs or itself.
+    There can be millions or billions of LPs in a simulation. LPs are created at the start
+    of a simulation on an assigned PE, they have a unique global ID across the simulation,
+    and they are only destroyed once the simulation stops.
+
+    A physical process (PE) is a single core/MPI rank in charge of coordinating the
+    simulation and communication with other PEs. PEs handle LPs and their mailboxes.
+
+- Cool. What is a KP?
+
+    KPs (Kernel Process) are really just an organizational structure. A container of LPs
+    contained themselves into a PE. Most importantly, they determine the granularity of
+    rollbacks. When an out of order event is detected by a receiving LP ("I got an event
+    that is timestamped to have occurred 'before' the last event that I have processed"),
+    all LPs in the same KP as the receiving LP must rollback to the time before the
+    triggering event is set to occur. Too few KPs and you'll be rolling back a lot of LPs
+    very frequently. Too many KPs is not really a terrible problem but could maybe be
+    memory inefficient - idk I haven't really tested that.
+
+- I am getting a `Out of events in GVT! Consider increasing --extramem`. What is that?
+
+    The GVT (Global Virtual Time) is a mechanism used to keep all LPs in synch.
+
+    ROSS is built in C and is designed to be as fast as possible. So it allocates a bunch
+    of memory up front and creates a "pool" of unused event structs that can be pulled
+    from any time the simulation needs an event. It does not resize this on-the-fly (could
+    be time expensive, possibly better to just error out and tell the user that they
+    should allocate more). Because we may need to rollback the simulation, we need to
+    'hold' all forward events after they've been processed until we know that it is safe
+    to garbage collect them. The time when we can know this is during synchronization
+    points: GVT Calculation. GVT is the minimum point in virtual time that all LPs can
+    agree "has happened". Any events with timestamps up to that point can be safely
+    garbage collected and their event structs returned to the pool.
+
+    If, however, we create so many events that we run out of empty events in the
+    pool, the simulation will recognize this, end, and error out telling the user how
+    to rectify the situation.
+
+    Why is memory not allocated dynamically then? There are two reasons for this. First,
+    ROSS is designed to be as fast as possible, and dynamic allocation takes time. If we
+    abstract this responsibility away from the user, it's likely that they won't think
+    about it anymore and then when they create models that make a TON of events and have
+    no idea why it's running so slowly - when it's ultimately because it's having to
+    repeatedly malloc a bunch of events during the simulation instead of all up front.
+    Second, and _more_ worse even, the model can be unstable in the creation of
+    messages/events. It's quite simple to write a program that sends two messages per each
+    message it receives. This program requires an unbounded pool which is not feasible in
+    any system. If you keep receiving this error no matter how much extra memory you
+    allocate for the simulation, you might have a model with an unstable message
+    structure.
+
+- I am getting a `Lookahead violation: decrease g_tw_lookahead`. What is that?
+
+    ROSS can make use of "three" modes of execution or [schedulers](https://ross-org.github.io/feature/schedulers.html).
+    The conservative scheduler works by advancing at tiny synchronized steps in the
+    simulation. For this, it is assumed that within a small window of time, a lookahead,
+    no events will be created by the events to be processed. If an event is created with a
+    an offset smaller than the lookahead, it will be in violation of the assumption and
+    the execution will halt.
+
+    This means that models with zero-offset events cannot run using the conservative
+    scheduler, unfortunately. You can either use the sequential, `--synch=1`, or
+    optimistic, `--synch=3`, schedulers.
+
+- What is the "AVL Tree" doing? What is its function in ROSS? and, why is it so important
+    that its size is shown in the statistics after executing a ROSS model?
+
+    The AVL tree is an efficient way of keeping track of existing events that haven't yet
+    been confirmed to have been "committed" to simulation history. In optimistic mode, we
+    can't "commit" any event to simulation history until GVT is greater than the events
+    timestamp. Until then, it's entirely possible for that event to be rolled back.
+    Keeping track of the AVL tree size during the course of the simulation is kind of a
+    way of monitoring event fanout (the rate at which events are being created/destroyed).
+
+- Storing whole events seems rather wasteful. Why do we have to keep track of all of
+    the memory of an event? Couldn't we just keep track of some signature of the event and
+    save some space?
+
+    There are essentially two ways of accomplishing rollbacks:
+
+    1.  Reverse Computation - If we know the exact context for the event and its
+        computation is easily reversible. For example, "This was an event of type X, which
+        means that it incremented the state property `s->y` by one". So we know the
+        opposite would be to "decrement `s->y` by one". We don't need any more
+        information! But it's not always this simple to revert the state.
+    2.  State Swapping - If we instead, at time of processing the forward event, encode
+        its previous state into the message struct, we can during rollback fetch that
+        saved data from the message and replace it back in the LP state. This is helpful
+        if you have an event that does some destructive change like "set whatever value of
+        `s->y` to equal zero". It is, in principle, impossible to recover information that
+        has been destroyed. For all we know `s->y` could have been 23526 prior to our
+        forward event that set it to zero. It could have been 1. If we saved that
+        information in the message struct, then we can just look at the message struct
+        during RC and say "oh we know that `s->y` was 23526 prior to setting it to zero".
+
+    Note: #2 requires that we don't overwrite any event data until we know we don't need
+    it!
+
+- Can we allocate dynamic memory for an event?
+
+    Yes and no. The gist of it is: if you are sending a message, don't allocate memory! If
+    you are processing a message, yes you can but you have to remember to deallocate it.
+    **Every** event processed forward is either rollbacked or removed (commited) at some
+    point.
+
+    Very important: Be very careful about using dynamic memory encoded into messages. You
+    might have already considered this, but when you `alloc()` something, you get a pointer
+    to that memory. If we are saving that data into the message struct as dynamic memory,
+    all that's really saved is just the pointer to that data. If this message was destined
+    for an LP on a different PE (different processor). The LP that receives it will get
+    the message, and if it tries to read the data located at the encoded pointer it will
+    segfault, or worse, it will run with invalid data.
+
+    One usage of allocating dynamic memory is when storing the state of an LP at the
+    forward computation step. There are two options for the LP's size: it might be defined
+    only at runtime or it might be too large to save in the message struct. Later on, the
+    allocated spaced created in the forward computation step must be freed in the reverse
+    event handler or the commit handler. Basically, allocating space within message
+    handling should never be used for any purpose other than in forward, reverse and
+    commit handlers and even then, use caution.
+
+- What is the purpose of the `commit_f` function? Is it to deallocate any memory after a
+    reverse message is processed and is no longer needed (ie, behind the GVT)? (see
+    previous question)
+
+    The `commit_f` function can be tricky and should be handled with care. Yes, the
+    `commit_f` function can be used to free any dynamic memory that was allocated during a
+    forward event handler into the message struct. (It should also be freed in the reverse
+    event as well - a message will be passed through a commit callback XOR a reverse
+    callback, never both). Doing this will prevent memory leaks.
+
+    However. While you technically have access to LP state during a `commit_f` function,
+    you should refrain from touching it.
+
+    The reason comes with when the `commit_f` function is happening. Say you had a ROSS
+    program where LPs received events of type ADD which increment some encoded LP state.
+    And say an LP receives an ADD event timestamped at virtual time 5. Eventually ROSS
+    gets around to its gvt calculation phase and everyone agrees that GVT is now at 6. But
+    the LP that received that ADD event is now at say virtual time 8 (because we can
+    optimistically execute events and LPs aren't in lock step with each other following
+    GVT). When the `commit_f` handler is processing that ADD event, the LP believes that
+    it is at time 8. But the event that we are committing was timestamped at 5. If we
+    touch LP state now, we are not touching it at time 5, we're touching it at time 8. If
+    we were to printout `tw_now(lp)` inside of the commit handler, it would read 8, not 5.
+
+    It's OK if the implications of this are not 100% clear, just be aware that `commit_f`
+    should really only be reading data or freeing dynamic memory from events. And that the
+    data to be read (unless it's data encoded in the message struct which is kept safe),
+    will be data from virtual time 8, not 5.
+
+- I've noticed that we can have multiple LP definitions, but can we have multiple message
+    definitions? What if I want to send messages with different payloads and sizes? Do I
+    resort to declaring a fix-sized message and play around with casting to different
+    struct messages for different things? What about sending an arbitrarily sized message?
+
+    No, the system operates under only one message type. If you want anything fancy, maybe
+    allocate memory on the fly and put a reference to it in the struct. Beware that you
+    should never, ever allocate dynamic memory when creating a message (see two previous
+    questions!).
+
+    ROSS expects a single event size so it knows how to allocate the event pool.
+    Technically, when we are processing an event, we just get a pointer to the message
+    struct. We can cast that to be whatever we want. But the size of transmitted messages
+    must always be the max of any message struct that we'd be casting to.
+
+- ROSS seems kinda limited. I want to allocate dynamic memory and a garbage collector,
+     have arbitrarily sized messages, LP creation on the fly, reallocation of LPs, dynamic
+     message pool, and ...
+
+     With ROSS, most things that make you think "oh that seems rather limiting" has a
+     reason behind it: speed.
+
+     ROSS' claim is being incredibly scalable and fast. Having variable event sizes would
+     complicate the pooling of events, requiring us to dynamically allocate events as they
+     are created or create several different pools for every single event type
+     (challenging to support cleanly and also very inefficient - will lead to more
+     problems than it solves), and so on and on.
diff --git a/documentation.md b/documentation.md
@@ -35,6 +35,12 @@ They discuss best practices for designing and building new models.
 {% endfor %}
 </ul>
 
+### FAQ
+
+Once you are knee deep building models, you might have many unanswered questions.
+If that is your case, [this FAQ comes to your rescue](./faq/usage-faq.html). (Note: there
+are many, many more unanswered questions in the wild. If you find any, please ping us!)
+
 ## ROSS Features
 
 These posts describe features of ROSS that model developers can take advantage of.