Merge pull request #833 from cderici/handle-allwatcher-task-exceptions

jujubot · web-flow · commit d91ccb01de9b · 2023-05-02T13:45:33.000+02:00
#833 #### Description This one was a bit tricky. fixes #829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of #829, that exception is thrown in the background somewhere in ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`). Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well. #### QA Steps This should be rigorously tested, as it slightly changes a fundamental mechanism. We do need to make sure all the tests are passing for sure. For the manual QA, what I did was that I changed the body of the `model.get_config()` with `raise JujuError("FOO")` (artificially inducing an error seemingly coming from the api outside of the all_watcher loop). This creates the exact condition happening in the #829. You can also get two controllers and use pylibjuju while running a migration in the background, getting a `migration is in progress` error. Whichever it is, with the error in place, run the following and it should print "Error handled" in the stdout: ```python async def juju_stats(): m = Model() await m.connect() try: asyncio.run(juju_stats()) except JujuError: print("Error handled") ``` #### Notes & Discussion We might wanna also get this onto the other branches after we carefully test and land this onto `2.9` as requested by the #829 .
diff --git a/juju/model.py b/juju/model.py
@@ -759,7 +759,26 @@ async def _after_connect(self):
         # we've received all the model data, which might be
         # a whole load of unneeded data if all the client wants
         # to do is make one RPC call.
-        await self._watch_received.wait()
+        async def watch_received_waiter():
+            await self._watch_received.wait()
+        waiter = jasyncio.create_task(watch_received_waiter())
+
+        # If we just wait for the _watch_received event and the _all_watcher task
+        # fails (e.g. because API fails like migration is in progress), then
+        # we'll hang because the _watch_received will never be set
+        # Instead, we watch for two things, 1) _watch_received, 2) _all_watcher done
+        # If _all_watcher is done before the _watch_received, then we should see
+        # (and raise) an exception coming from the _all_watcher
+        # Otherwise (i.e. _watch_received is set), then we're good to go
+        done, pending = await jasyncio.wait([waiter, self._watcher_task],
+                                            return_when=jasyncio.FIRST_COMPLETED)
+        if self._watcher_task in done:
+            # Cancel the _watch_received.wait
+            waiter.cancel()
+            # If there's no exception, then why did the _all_watcher broke its loop?
+            if not self._watcher_task.exception():
+                raise JujuError("AllWatcher task is finished abruptly without an exception.")
+            raise self._watcher_task.exception()
 
         await self.get_info()
         self.uuid = self.info.uuid
@@ -771,6 +790,12 @@ async def disconnect(self):
         if not self._watch_stopped.is_set():
             log.debug('Stopping watcher task')
             self._watch_stopping.set()
+            # If the _all_watcher task is finished,
+            # check to see an exception, if yes, raise,
+            # otherwise we should see the _watch_stopped
+            # flag is set
+            if self._watcher_task.done() and self._watcher_task.exception():
+                raise self._watcher_task.exception()
             await self._watch_stopped.wait()
             self._watch_stopping.clear()
 
@@ -1040,6 +1065,7 @@ def _watch(self):
         See :meth:`add_observer` to register an onchange callback.
 
         """
+
         def _post_step(obj):
             # Once we get the model, ensure we're running in the correct state
             # as a post step.
@@ -1129,7 +1155,7 @@ async def _all_watcher():
         self._watch_received.clear()
         self._watch_stopping.clear()
         self._watch_stopped.clear()
-        jasyncio.ensure_future(_all_watcher())
+        self._watcher_task = jasyncio.create_task(_all_watcher())
 
     async def _notify_observers(self, delta, old_obj, new_obj):
         """Call observing callbacks, notifying them of a change in model state
@@ -2403,7 +2429,7 @@ async def _get_source_api(self, url, controller_name=None):
 
     async def wait_for_idle(self, apps=None, raise_on_error=True, raise_on_blocked=False,
                             wait_for_active=False, timeout=10 * 60, idle_period=15, check_freq=0.5,
-                            status=None, wait_for_units=1, wait_for_exact_units=-1):
+                            status=None, wait_for_units=None, wait_for_exact_units=-1):
         """Wait for applications in the model to settle into an idle state.
 
         :param apps (list[str]): Optional list of specific app names to wait on.
@@ -2452,6 +2478,8 @@ async def wait_for_idle(self, apps=None, raise_on_error=True, raise_on_blocked=F
             warnings.warn("wait_for_active is deprecated; use status", DeprecationWarning)
             status = "active"
 
+        _wait_for_units = wait_for_units if wait_for_units is not None else 1
+
         timeout = timedelta(seconds=timeout) if timeout is not None else None
         idle_period = timedelta(seconds=idle_period)
         start_time = datetime.now()
@@ -2501,12 +2529,14 @@ def _raise_for_status(entities, status):
                                     (wait_for_exact_units, len(app.units)))
                         continue
                 # If we have less # of units then required, then wait a bit more
-                elif len(app.units) < wait_for_units:
+                elif len(app.units) < _wait_for_units:
                     busy.append(app.name + " (not enough units yet - %s/%s)" %
-                                (len(app.units), wait_for_units))
+                                (len(app.units), _wait_for_units))
                     continue
-                elif len(units_ready) >= wait_for_units:
-                    # No need to keep looking, we have the desired number of units ready to go
+                # User wants to see a certain # of units, and we have enough
+                elif wait_for_units and len(units_ready) >= _wait_for_units:
+                    # So no need to keep looking, we have the desired number of units ready to go,
+                    # exit the loop. Don't return, though, we might still have some errors to raise
                     break
                 for unit in app.units:
                     if unit.machine is not None and unit.machine.status == "error":
@@ -2531,7 +2561,7 @@ def _raise_for_status(entities, status):
                         units_ready.add(unit.name)
                         now = datetime.now()
                         idle_start = idle_times.setdefault(unit.name, now)
-                        print(f'unit {unit.name} is waiting since : {idle_start} -- now : {now} -- waiting for : {now - idle_start}')
+
                         if now - idle_start < idle_period:
                             busy.append("{} [{}] {}: {}".format(unit.name,
                                                                 unit.agent_status,
diff --git a/tests/bundle/bundle.yaml b/tests/bundle/bundle.yaml
@@ -1,32 +1,12 @@
-series: xenial
+series: jammy
 applications:
-  wordpress:
-    charm: "wordpress"
-    series: "xenial"
-    channel: "candidate"
+  grafana:
+    charm: "grafana"
+    channel: stable
     num_units: 1
-    annotations:
-      "gui-x": "339.5"
-      "gui-y": "-171"
-    to:
-      - "0"
-  mysql:
-    charm: "mysql"
-    series: "trusty"
-    channel: "candidate"
+  prometheus:
+    charm: "prometheus"
+    channel: stable
     num_units: 1
-    annotations:
-      "gui-x": "79.5"
-      "gui-y": "-142"
-    to:
-      - "1"
 relations:
-  - - "wordpress:db"
-    - "mysql:db"
-machines:
-  "0":
-    series: xenial
-    constraints: "arch=amd64 cores=1 cpu-power=100 mem=1740 root-disk=8192"
-  "1":
-    series: trusty
-    constraints: "arch=amd64 cores=1 cpu-power=100 mem=1740 root-disk=8192"
+  - ["prometheus:grafana-source", "grafana:grafana-source"]
diff --git a/tests/bundle/mini-bundle.yaml b/tests/bundle/mini-bundle.yaml
@@ -1,11 +1,12 @@
+series: jammy
 applications:
-  ghost:
-    charm: "ghost"
+  grafana:
+    charm: "grafana"
     channel: stable
     num_units: 1
-  mysql:
-    charm: "mysql"
-    channel: candidate
+  prometheus:
+    charm: "prometheus"
+    channel: stable
     num_units: 1
 relations:
-  - ["ghost", "mysql"]
+  - ["prometheus:grafana-source", "grafana:grafana-source"]
diff --git a/tests/integration/bundle/bundle-include-base64.yaml b/tests/integration/bundle/bundle-include-base64.yaml
@@ -1,17 +1,15 @@
-series: xenial
 applications:
-  ghost:
-    charm: "ghost"
-    num_units: 1
-  mysql:
-    charm: "mysql"
-    channel: "candidate"
-    series: "trusty"
+  helloa:
+    charm: "hello-juju"
+    name: "helloa"
+    channel: stable
     num_units: 1
     options:
-      max-connections: 2
-      tuning-level: include-base64://config-base64.yaml
+      application-repo: include-base64://config-base64.yaml
+  hellob:
+    charm: "hello-juju"
+    name: "hellob"
+    channel: stable
+    num_units: 1
   test:
-    charm: "../charm"
-relations:
-  - ["ghost", "mysql"]
+    charm: "../charm"
diff --git a/tests/integration/bundle/bundle-include-file.yaml b/tests/integration/bundle/bundle-include-file.yaml
@@ -1,15 +1,15 @@
 applications:
-  ghost:
-    charm: "ghost"
+  helloa:
+    charm: "hello-juju"
+    name: "helloa"
     channel: stable
     num_units: 1
     options:
       config: include-file://config1.yaml
-  mysql:
-    charm: "mysql"
-    channel: candidate
+  hellob:
+    charm: "hello-juju"
+    name: "hellob"
+    channel: stable
     num_units: 1
   test:
-    charm: "../charm"
-relations:
-  - ["ghost", "mysql"]
+    charm: "../charm"
diff --git a/tests/integration/bundle/bundle.yaml b/tests/integration/bundle/bundle.yaml
@@ -1,14 +1,14 @@
-series: xenial
+series: jammy
 applications:
-  ghost:
-    charm: "ghost"
+  grafana:
+    charm: "grafana"
     channel: stable
     num_units: 1
-  mysql:
-    charm: "mysql"
+  prometheus:
+    charm: "prometheus"
     channel: stable
     num_units: 1
   test:
     charm: "./tests/integration/charm"
 relations:
-  - ["ghost", "mysql"]
+  - ["prometheus:grafana-source", "grafana:grafana-source"]
diff --git a/tests/integration/bundle/config-base64.yaml b/tests/integration/bundle/config-base64.yaml
@@ -1 +1 @@
-ZmFzdA==
+aHR0cDovL215LWp1anUuY29t
diff --git a/tests/integration/bundle/config1.yaml b/tests/integration/bundle/config1.yaml
@@ -1,3 +1,3 @@
-ghost:
-  url: "http://my-ghost.blg"
-  port: 2369
+helloa:
+  application-repo: "http://my-juju.com"
+  port: 666
diff --git a/tests/integration/bundle/test-overlays/bundle-with-overlay-multi.yaml b/tests/integration/bundle/test-overlays/bundle-with-overlay-multi.yaml
@@ -1,14 +1,14 @@
 applications:
   ghost:
-    charm: "ghost"
+    charm: "prometheus"
     channel: stable
     num_units: 1
   mysql:
-    charm: "mysql"
-    channel: candidate
+    charm: "prometheus"
+    channel: stable
     num_units: 1
 relations:
-  - ["ghost", "mysql"]    
+  - ["ghost:grafana-source", "mysql:grafana-source"]
 --- # overlay.yaml
 description: Overlay to remove the ghost app and the relation
 applications:
diff --git a/tests/integration/bundle/test-overlays/test-multi-overlay.yaml b/tests/integration/bundle/test-overlays/test-multi-overlay.yaml
@@ -10,6 +10,6 @@ description: Another overlay for test multi-overlay
 applications:
   memcached:
   mysql:
-    charm: "mysql"
-    channel: candidate
+    charm: "prometheus"
+    channel: stable
     num_units: 1
diff --git a/tests/integration/bundle/test-overlays/test-overlay.yaml b/tests/integration/bundle/test-overlays/test-overlay.yaml
@@ -2,12 +2,10 @@ description: An overlay for tests
 applications:
   ntp:
   ghost:
-    charm: "ghost"
+    charm: "prometheus"
     channel: stable
     num_units: 1
   mysql:
-    charm: "mysql"
-    channel: candidate
-    num_units: 1  
-relations:
-  - ["ghost", "mysql"]
+    charm: "prometheus"
+    channel: stable
+    num_units: 1
diff --git a/tests/integration/bundle/test-overlays/test-overlay2.yaml b/tests/integration/bundle/test-overlays/test-overlay2.yaml
@@ -1,7 +1,7 @@
 description: An overlay for the wiki-simple bundle to remove mysql and add memcached
 applications:
   test:
-  mysql:
+  prometheus:
   memcached:
     charm: "memcached"
     channel: stable
diff --git a/tests/integration/bundle/test-overlays/test-overlay3.yaml b/tests/integration/bundle/test-overlays/test-overlay3.yaml
@@ -1,6 +1,3 @@
-description: Another overlay to remove memcached and add back the mysql and relate
+description: Another overlay to remove memcached
 applications:
-  memcached:
-  ghost:
-    options:
-      config: include-file://config1.yaml
+  memcached:
diff --git a/tests/integration/test_application.py b/tests/integration/test_application.py
@@ -180,6 +180,8 @@ async def test_upgrade_charm_switch_channel(event_loop):
 @base.bootstrapped
 @pytest.mark.asyncio
 async def test_upgrade_local_charm(event_loop):
+    # Skip temporarily due to a known problem:
+    pytest.skip('cannot upgrade application "ubuntu" to charm "local:focal/ubuntu-0": required storage "files" removed')
     async with base.CleanModel() as model:
         tests_dir = Path(__file__).absolute().parent
         charm_path = tests_dir / 'upgrade-charm'
diff --git a/tests/integration/test_model.py b/tests/integration/test_model.py