This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Troubleshooting

Troubleshooting for FSM

1: Application Container Lifecycle
2: Error Codes
3: Prometheus
4: Grafana
5: Uninstall
6: Traffic Troubleshooting

6.1: Iptables Redirection
6.2: Permissive Traffic Policy Mode
6.3: Ingress
6.4: Egress Troubleshooting

1 - Application Container Lifecycle

Troubleshooting application container lifecycle

Since FSM injects application pods that are a part of the service mesh with a long-running sidecar proxy and sets up traffic redirection rules to route all traffic to/from pods via the sidecar proxy, in some circumstances existing application containers might not startup or shutdown as expected.

When the application container depends on network connectivity at startup

Application containers that depend on network connectivity at startup are likely to experience issues once the Pipy sidecar proxy container and the fsm-init init container are injected into the application pod by FSM. This is because upon sidecar injection, all TCP based network traffic from application containers are routed to the sidecar proxy and subject to service mesh traffic policies. This implies that for application traffic to be routed as it would without the sidecar proxy container injected, FSM controller must first program the sidecar proxy on the application pod to allow such traffic. Without the Pipy sidecar proxy being configured, all traffic from application containers will be dropped.

When FSM is configured with permissive traffic policy mode enabled, FSM will program wildcard traffic policy rules on the Pipy sidecar proxy to allow every pod to access all services that are a part of the mesh. When FSM is configured with SMI traffic policy mode enabled, explicit SMI policies must be configured to enable communication between applications in the mesh.

Regardless of the traffic policy mode, application containers that depend on network connectivity at startup can experience problems starting up if they are not resilient to delays in the network being ready. With the Pipy proxy sidecar injected, the network is deemed ready only when the sidecar proxy has been programmed by FSM controller to allow application traffic to flow through the network.

It is recommended that application containers be resilient enough to the initial bootstrapping phase of the Pipy proxy sidecar in the application pod.

It is important to note that the container’s restart policy also influences the startup of application containers. If an application container’s startup policy is set to Never and it depends on network connectivity to be ready at startup time, it is possible the container fails to access the network until the Pipy proxy sidecar is ready to allow the application container access to the network, thereby resulting in the application container to exit and never recover from a failed startup. For this reason, it is recommended not to use a container restart policy of Never if your application container depends on network connectivity at startup.

Kubernetes issue 65502: Support startup dependencies between containers on the same pod

2 - Error Codes

Troubleshooting control plane error codes

Error Code Descriptions

If error codes are present in the FSM error logs or detected from the FSM error code metrics, the fsm support error-info cli tool can be used gain more information about the error code.

The following table is generated by running fsm support error-info.

+------------+----------------------------------------------------------------------------------+
| ERROR CODE |                                   DESCRIPTION                                    |
+------------+----------------------------------------------------------------------------------+
| E1000      | An invalid command line argument was passed to the application.                  |
+------------+----------------------------------------------------------------------------------+
| E1001      | The specified log level could not be set in the system.                          |
+------------+----------------------------------------------------------------------------------+
| E1002      | The fsm-controller k8s pod resource was not able to be retrieved by the system.  |
+------------+----------------------------------------------------------------------------------+
| E1003      | The fsm-injector k8s pod resource was not able to be retrieved by the system.    |
+------------+----------------------------------------------------------------------------------+
| E1004      | The Ingress client created by the fsm-controller to monitor Ingress resources    |
|            | failed to start.                                                                 |
+------------+----------------------------------------------------------------------------------+
| E1005      | The Reconciler client to monitor updates and deletes to FSM's CRDs and mutating  |
|            | webhook failed to start.                                                         |
+------------+----------------------------------------------------------------------------------+
| E2000      | An error was encountered while attempting to deduplicate traffic matching        |
|            | attributes (destination port, protocol, IP address etc.) used for matching       |
|            | egress traffic. The applied egress policies could be conflicting with each       |
|            | other, and the system was unable to process affected egress policies.            |
+------------+----------------------------------------------------------------------------------+
| E2001      | An error was encountered while attempting to deduplicate upstream clusters       |
|            | associated with the egress destination. The applied egress policies could be     |
|            | conflicting with each other, and the system was unable to process affected       |
|            | egress policies.                                                                 |
+------------+----------------------------------------------------------------------------------+
| E2002      | An invalid IP address range was specified in the egress policy. The IP address   |
|            | range must be specified as as a CIDR notation IP address and prefix length, like |
|            | "192.0.2.0/24", as defined in RFC 4632. The invalid IP address range was ignored |
|            | by the system.                                                                   |
+------------+----------------------------------------------------------------------------------+
| E2003      | An invalid match was specified in the egress policy. The specified match was     |
|            | ignored by the system while applying the egress policy.                          |
+------------+----------------------------------------------------------------------------------+
| E2004      | The SMI HTTPRouteGroup resource specified as a match in an egress policy was not |
|            | found. Please verify that the specified SMI HTTPRouteGroup resource exists in    |
|            | the same namespace as the egress policy referencing it as a match.               |
+------------+----------------------------------------------------------------------------------+
| E2005      | The SMI HTTPRouteGroup resources specified as a match in an SMI TrafficTarget    |
|            | policy was unable to be retrieved by the system. The associated SMI              |
|            | TrafficTarget policy was ignored by the system. Please verify that the matches   |
|            | specified for the Traffictarget resource exist in the same namespace as the      |
|            | TrafficTarget policy referencing the match.                                      |
+------------+----------------------------------------------------------------------------------+
| E2006      | The SMI HTTPRouteGroup resource is invalid as it does not have any matches       |
|            | specified. The SMI HTTPRouteGroup policy was ignored by the system.              |
+------------+----------------------------------------------------------------------------------+
| E2007      | There are multiple SMI traffic split policies associated with the same           |
|            | apex(root) service specified in the policies. The system does not support        |
|            | this scenario so onlt the first encountered policy is processed by the system,   |
|            | subsequent policies referring the same apex service are ignored.                 |
+------------+----------------------------------------------------------------------------------+
| E2008      | There was an error adding a route match to an outbound traffic policy            |
|            | representation within the system. The associated route was ignored by the        |
|            | system.                                                                          |
+------------+----------------------------------------------------------------------------------+
| E2009      | The inbound TrafficTargets composed of their routes for a given destination      |
|            | ServiceIdentity could not be configured.                                         |
+------------+----------------------------------------------------------------------------------+
| E2010      | An applied SMI TrafficTarget policy has an invalid destination kind.             |
+------------+----------------------------------------------------------------------------------+
| E2011      | An applied SMI TrafficTarget policy has an invalid source kind.                  |
+------------+----------------------------------------------------------------------------------+
| E3000      | The system found 0 endpoints to be reached when the service's FQDN was resolved. |
+------------+----------------------------------------------------------------------------------+
| E3001      | A Kubernetes resource could not be marshalled.                                   |
+------------+----------------------------------------------------------------------------------+
| E3002      | A Kubernetes resource could not be unmarshalled.                                 |
+------------+----------------------------------------------------------------------------------+
| E4000      | The Kubernetes secret containing the certificate could not be retrieved by the   |
|            | system.                                                                          |
+------------+----------------------------------------------------------------------------------+
| E4001      | The certificate specified by name could not be obtained by key from the secret's |
|            | data.                                                                            |
+------------+----------------------------------------------------------------------------------+
| E4002      | The private key specified by name could not be obtained by key from the secret's |
|            | data.                                                                            |
+------------+----------------------------------------------------------------------------------+
| E4003      | The certificate expiration specified by name could not be obtained by key from   |
|            | the secret's data.                                                               |
+------------+----------------------------------------------------------------------------------+
| E4004      | The certificate expiration obtained from the secret's data by name could not be  |
|            | parsed.                                                                          |
+------------+----------------------------------------------------------------------------------+
| E4005      | The secret containing a certificate could not be created by the system.          |
+------------+----------------------------------------------------------------------------------+
| E4006      | A private key failed to be generated.                                            |
+------------+----------------------------------------------------------------------------------+
| E4007      | The specified private key could be be could not be converted from a DER encoded  |
|            | key to a PEM encoded key.                                                        |
+------------+----------------------------------------------------------------------------------+
| E4008      | The certificate request fails to be created when attempting to issue a           |
|            | certificate.                                                                     |
+------------+----------------------------------------------------------------------------------+
| E4009      | When creating a new certificate authority, the root certificate could not be     |
|            | obtained by the system.                                                          |
+------------+----------------------------------------------------------------------------------+
| E4010      | The specified certificate could not be converted from a DER encoded certificate  |
|            | to a PEM encoded certificate.                                                    |
+------------+----------------------------------------------------------------------------------+
| E4011      | The specified PEM encoded certificate could not be decoded.                      |
+------------+----------------------------------------------------------------------------------+
| E4012      | The specified PEM privateKey for the certificate authority's root certificate    |
|            | could not be decoded.                                                            |
+------------+----------------------------------------------------------------------------------+
| E4013      | An unspecified error occurred when issuing a certificate from the certificate    |
|            | manager.                                                                         |
+------------+----------------------------------------------------------------------------------+
| E4014      | An error occurred when creating a certificate to issue from the certificate      |
|            | manager.                                                                         |
+------------+----------------------------------------------------------------------------------+
| E4015      | The certificate authority privided when issuing a certificate was invalid.       |
+------------+----------------------------------------------------------------------------------+
| E4016      | The specified certificate could not be rotated.                                  |
+------------+----------------------------------------------------------------------------------+
| E4100      | Failed parsing object into PubSub message.                                       |
+------------+----------------------------------------------------------------------------------+
| E4150      | Failed initial cache sync for config.flomesh.io informer.                               |
+------------+----------------------------------------------------------------------------------+
| E4151      | Failed to cast object to MeshConfig.                                             |
+------------+----------------------------------------------------------------------------------+
| E4152      | Failed to fetch MeshConfig from cache with specific key.                         |
+------------+----------------------------------------------------------------------------------+
| E4153      | Failed to marshal MeshConfig into other format.                                  |
+------------+----------------------------------------------------------------------------------+
| E5000      | A XDS resource could not be marshalled.                                          |
+------------+----------------------------------------------------------------------------------+
| E5001      | The XDS certificate common name could not be parsed. The CN should be of the     |
|            | form <proxy-UUID>.<kind>.<proxy-identity>.                                       |
+------------+----------------------------------------------------------------------------------+
| E5002      | The proxy UUID obtained from parsing the XDS certificate's common name did not   |
|            | match the fsm-proxy-uuid label value for any pod. The pod associated with the    |
|            | specified Pipy proxy could not be found.                                        |
+------------+----------------------------------------------------------------------------------+
| E5003      | A pod in the mesh belongs to more than one service. By Open Service Mesh         |
|            | convention the number of services a pod can belong to is 1. This is a limitation |
|            | we set in place in order to make the mesh easy to understand and reason about.   |
|            | When a pod belongs to more than one service XDS will not program the Pipy       |
|            | proxy, leaving it out of the mesh.                                               |
+------------+----------------------------------------------------------------------------------+
| E5004      | The Pipy proxy data structure created by ADS to reference an Pipy proxy        |
|            | sidecar from a pod's fsm-proxy-uuid label could not be configured.               |
+------------+----------------------------------------------------------------------------------+
| E5005      | A GRPC connection failure occurred and the ADS is no longer able to receive      |
|            | DiscoveryRequests.                                                               |
+------------+----------------------------------------------------------------------------------+
| E5006      | The DiscoveryResponse configured by ADS failed to send to the Pipy proxy.       |
+------------+----------------------------------------------------------------------------------+
| E5007      | The resources to be included in the DiscoveryResponse could not be generated.    |
+------------+----------------------------------------------------------------------------------+
| E5008      | The aggregated resources generated for a DiscoveryResponse failed to be          |
|            | configured as a new snapshot in the Pipy xDS Aggregate Discovery Services       |
|            | cache.                                                                           |
+------------+----------------------------------------------------------------------------------+
| E5009      | The Aggregate Discovery Server (ADS) created by the FSM controller failed to     |
|            | start.                                                                           |
+------------+----------------------------------------------------------------------------------+
| E5010      | The ServiceAccount referenced in the NodeID does not match the ServiceAccount    |
|            | specified in the proxy certificate. The proxy was not allowed to be a part of    |
|            | the mesh.                                                                        |
+------------+----------------------------------------------------------------------------------+
| E5011      | The gRPC stream was closed by the proxy and no DiscoveryRequests can be          |
|            | received. The Stream Agreggated Resource server was terminated for the specified |
|            | proxy.                                                                           |
+------------+----------------------------------------------------------------------------------+
| E5012      | The sidecar proxy has not completed the initialization phase and it is not ready   |
|            | to receive broadcast updates from control plane related changes. New versions    |
|            | should not be pushed if the first request has not be received. The broadcast     |
|            | update was ignored for that proxy.                                               |
+------------+----------------------------------------------------------------------------------+
| E5013      | The TypeURL of the resource being requested in the DiscoveryRequest is invalid.  |
+------------+----------------------------------------------------------------------------------+
| E5014      | The version of the DiscoveryRequest could not be parsed by ADS.                  |
+------------+----------------------------------------------------------------------------------+
| E5015      | A proxy egress cluster which routes traffic to its original destination could   |
|            | not be configured. When a Host is not specified in the cluster config, the       |
|            | original destination is used.                                                    |
+------------+----------------------------------------------------------------------------------+
| E5016      | A proxy egress cluster that routes traffic based on the specified Host resolved |
|            | using DNS could not be configured.                                               |
+------------+----------------------------------------------------------------------------------+
| E5017      | A proxy cluster that corresponds to a specified upstream service could not be   |
|            | configured.                                                                      |
+------------+----------------------------------------------------------------------------------+
| E5018      | The meshed services corresponding a specified Pipy proxy could not be listed.   |
+------------+----------------------------------------------------------------------------------+
| E5019      | Multiple Pipy clusters with the same name were configured. The duplicate        |
|            | clusters will not be sent to the Pipy proxy in a ClusterDiscovery response.     |
+------------+----------------------------------------------------------------------------------+
| E5020      | The application protocol specified for a port is not supported for ingress       |
|            | traffic. The XDS filter chain for ingress traffic to the port was not created.   |
+------------+----------------------------------------------------------------------------------+
| E5021      | An XDS filter chain could not be constructed for ingress.                        |
+------------+----------------------------------------------------------------------------------+
| E5022      | A traffic policy rule could not be configured as an RBAC rule on the proxy.      |
|            | The corresponding rule was ignored by the system.                                |
+------------+----------------------------------------------------------------------------------+
| E5023      | The SDS certificate resource could not be unmarshalled. The                      |
|            | corresponding certificate resource was ignored by the system.                    |
+------------+----------------------------------------------------------------------------------+
| E5024      | An XDS secret containing a TLS certificate could not be retrieved.               |
|            | The corresponding secret request was ignored by the system.                      |
+------------+----------------------------------------------------------------------------------+
| E5025      | The SDS secret does not correspond to a MeshService.                             |
+------------+----------------------------------------------------------------------------------+
| E5026      | The SDS secret does not correspond to a ServiceAccount.                          |
+------------+----------------------------------------------------------------------------------+
| E5027      | The identity obtained from the SDS certificate request does not match the        |
|            | The corresponding secret request was ignored by the system.                      |
+------------+----------------------------------------------------------------------------------+
| E5028      | The SDS secret does not correspond to a MeshService.                             |
+------------+----------------------------------------------------------------------------------+
| E5029      | The SDS secret does not correspond to a ServiceAccount.                          |
+------------+----------------------------------------------------------------------------------+
| E5030      | The identity obtained from the SDS certificate request does not match the        |
|            | identity of the proxy. The corresponding certificate request was ignored         |
|            | by the system.                                                                   |
+------------+----------------------------------------------------------------------------------+
| E6100      | A protobuf ProtoMessage could not be converted into YAML.                        |
+------------+----------------------------------------------------------------------------------+
| E6101      | The mutating webhook certificate could not be parsed.                            |
|            | The mutating webhook HTTP server was not started.                                |
+------------+----------------------------------------------------------------------------------+
| E6102      | The sidecar injection webhook HTTP server failed to start.                       |
+------------+----------------------------------------------------------------------------------+
| E6103      | An AdmissionRequest could not be decoded.                                        |
+------------+----------------------------------------------------------------------------------+
| E6104      | The timeout from an AdmissionRequest could not be parsed.                        |
+------------+----------------------------------------------------------------------------------+
| E6105      | The AdmissionRequest's header was invalid. The content type obtained from the    |
|            | header is not supported.                                                         |
+------------+----------------------------------------------------------------------------------+
| E6106      | The AdmissionResponse could not be written.                                      |
+------------+----------------------------------------------------------------------------------+
| E6107      | The AdmissionRequest was empty.                                                  |
+------------+----------------------------------------------------------------------------------+
| E6108      | It could not be determined if the pod specified in the AdmissionRequest is       |
|            | enabled for sidecar injection.                                                   |
+------------+----------------------------------------------------------------------------------+
| E6109      | It could not be determined if the namespace specified in the                     |
|            | AdmissionRequest is enabled for sidecar injection.                               |
+------------+----------------------------------------------------------------------------------+
| E6110      | The port exclusions for a pod could not be obtained. No                          |
|            | port exclusions are added to the init container's spec.                          |
+------------+----------------------------------------------------------------------------------+
| E6111      | The AdmissionRequest body could not be read.                                     |
+------------+----------------------------------------------------------------------------------+
| E6112      | The AdmissionRequest body was nil.                                               |
+------------+----------------------------------------------------------------------------------+
| E6113      | The MutatingWebhookConfiguration could not be created.                           |
+------------+----------------------------------------------------------------------------------+
| E6114      | The MutatingWebhookConfiguration could not be updated.                           |
+------------+----------------------------------------------------------------------------------+
| E6700      | An error occurred when shutting down the validating webhook HTTP server.         |
+------------+----------------------------------------------------------------------------------+
| E6701      | The validating webhook HTTP server failed to start.                              |
+------------+----------------------------------------------------------------------------------+
| E6702      | The validating webhook certificate could not be parsed.                          |
|            | The validating webhook HTTP server was not started.                              |
+------------+----------------------------------------------------------------------------------+
| E6703      | The ValidatingWebhookConfiguration could not be created.                         |
+------------+----------------------------------------------------------------------------------+
| E7000      | An error occurred while reconciling the updated CRD to its original state.       |
+------------+----------------------------------------------------------------------------------+
| E7001      | An error occurred while reconciling the deleted CRD.                             |
+------------+----------------------------------------------------------------------------------+
| E7002      | An error occurred while reconciling the updated mutating webhook to its original |
|            | state.                                                                           |
+------------+----------------------------------------------------------------------------------+
| E7003      | An error occurred while reconciling the deleted mutating webhook.                |
+------------+----------------------------------------------------------------------------------+
| E7004      | An error occurred while while reconciling the updated validating webhook to its  |
|            | original state.                                                                  |
+------------+----------------------------------------------------------------------------------+
| E7005      | An error occurred while reconciling the deleted validating webhook.              |
+------------+----------------------------------------------------------------------------------+

Information for a specific error code can be obtained by running fsm support error-info <error-code>. For example:

fsm support error-info E1000

+------------+-----------------------------------------------------------------+
| ERROR CODE |                           DESCRIPTION                           |
+------------+-----------------------------------------------------------------+
| E1000      |  An invalid command line argument was passed to the             |
|            | application.                                                    |
+------------+-----------------------------------------------------------------+

3 - Prometheus

Troubleshooting Prometheus integration

Prometheus is unreachable

If a Prometheus instance installed with FSM can’t be reached, perform the following steps to identify and resolve any issues.

Verify a Prometheus Pod exists.
When installed with fsm install --set=fsm.deployPrometheus=true, a Prometheus Pod named something like fsm-prometheus-5794755b9f-rnvlr should exist in the namespace of the other FSM control plane components which named fsm-system by default.
If no such Pod is found, verify the FSM Helm chart was installed with the fsm.deployPrometheus parameter set to true with helm:
```
$ helm get values -a <mesh name> -n <FSM namespace>
```
If the parameter is set to anything but true, reinstall FSM with the --set=fsm.deployPrometheus=true flag on fsm install.

Verify the Prometheus Pod is healthy.

The Prometheus Pod identified above should be both in a Running state and have all containers ready, as shown in the kubectl get output:

$ # Assuming FSM is installed in the fsm-system namespace:
$ kubectl get pods -n fsm-system -l app=fsm-prometheus
NAME                              READY   STATUS    RESTARTS   AGE
fsm-prometheus-5794755b9f-67p6r   1/1     Running   0          27m

If the Pod is not showing as Running or its containers ready, use kubectl describe to look for other potential issues:

$ # Assuming FSM is installed in the fsm-system namespace:
$ kubectl describe pods -n fsm-system -l app=fsm-prometheus

Once the Prometheus Pod is found to be healthy, Prometheus should be reachable.

Metrics are not showing up in Prometheus

If Prometheus is found not to be scraping metrics for any Pods, perform the following steps to identify and resolve any issues.

Verify application Pods are working as expected.
If workloads running in the mesh are not functioning properly, metrics scraped from those Pods may not look correct. For example, if metrics showing traffic to Service A from Service B are missing, ensure the services are communicating successfully.
To help further troubleshoot these kinds of issues, see the traffic troubleshooting guide.
Verify the Pods whose metrics are missing have an Pipy sidecar injected.
Only Pods with an Pipy sidecar container are expected to have their metrics scraped by Prometheus. Ensure each Pod is running a container from an image with flomesh/pipy in its name:
```
$ kubectl get po -n <pod namespace> <pod name> -o jsonpath='{.spec.containers[*].image}'
mynamespace/myapp:v1.0.0 flomesh/pipy:0.50.0
```
Verify the proxy’s endpoint being scraped by Prometheus is working as expected.
Each Pipy proxy exposes an HTTP endpoint that shows metrics generated by that proxy and is scraped by Prometheus. Check to see if the expected metrics are shown by making a request to the endpoint directly.
For each Pod whose metrics are missing, use kubectl to forward the Pipy proxy admin interface port and check the metrics:
```
$ kubectl port-forward -n <pod namespace> <pod name> 15000
```
Go to http://localhost:15000/stats/prometheus in a browser to check the metrics generated by that Pod. If Prometheus does not seem to be accounting for these metrics, move on to the next step to ensure Prometheus is configured properly.
Verify the intended namespaces have been enrolled in metrics collection.
For each namespace that contains Pods which should have metrics scraped, ensure the namespace is monitored by the intended FSM instance with fsm mesh list.
Next, check to make sure the namespace is annotated with flomesh.io/metrics: enabled:
```
$ # Assuming FSM is installed in the fsm-system namespace:
$ kubectl get namespace <namespace> -o jsonpath='{.metadata.annotations.flomesh\.io/metrics}'
enabled
```
If no such annotation exists on the namespace or it has a different value, fix it with fsm:
```
$ fsm metrics enable --namespace <namespace>
Metrics successfully enabled in namespace [<namespace>]
```
If custom metrics are not being scraped, verify they have been enabled.
Custom metrics are currently disable by default and enabled when the fsm.featureFlags.enableWASMStats parameter is set to true. Verify the current FSM instance has this parameter set for a mesh named <fsm-mesh-name> in the <fsm-namespace> namespace:
```
$ helm get values -a <fsm-mesh-name> -n <fsm-namespace>
```
Note: replace <fsm-mesh-name> with the name of the fsm mesh and <fsm-namespace> with the namespace where fsm was installed.
If fsm.featureFlags.enableWASMStats is set to a different value, reinstall FSM and pass --set fsm.featureFlags.enableWASMStats to fsm install.

4 - Grafana

Troubleshooting Grafana integration

Grafana is unreachable

If a Grafana instance installed with FSM can’t be reached, perform the following steps to identify and resolve any issues.

Verify a Grafana Pod exists.
When installed with fsm install --set=fsm.deployGrafana=true, a Grafana Pod named something like fsm-grafana-7c88b9687d-tlzld should exist in the namespace of the other FSM control plane components which named fsm-system by default.
If no such Pod is found, verify the FSM Helm chart was installed with the fsm.deployGrafana parameter set to true with helm:
```
$ helm get values -a <mesh name> -n <FSM namespace>
```
If the parameter is set to anything but true, reinstall FSM with the --set=fsm.deployGrafana=true flag on fsm install.

Verify the Grafana Pod is healthy.

The Grafana Pod identified above should be both in a Running state and have all containers ready, as shown in the kubectl get output:

$ # Assuming FSM is installed in the fsm-system namespace:
$ kubectl get pods -n fsm-system -l app=fsm-grafana
NAME                           READY   STATUS    RESTARTS   AGE
fsm-grafana-7c88b9687d-tlzld   1/1     Running   0          58s

If the Pod is not showing as Running or its containers ready, use kubectl describe to look for other potential issues:

$ # Assuming FSM is installed in the fsm-system namespace:
$ kubectl describe pods -n fsm-system -l app=fsm-grafana

Once the Grafana Pod is found to be healthy, Grafana should be reachable.

Dashboards show no data in Grafana

If data appears to be missing from the Grafana dashboards, perform the following steps to identify and resolve any issues.

Verify Prometheus is installed and healthy.
Because Grafana queries Prometheus for data, ensure Prometheus is working as expected. See the Prometheus troubleshooting guide for more details.
Verify Grafana can communicate with Prometheus.
Start by opening the Grafana UI in a browser:
```
$ fsm dashboard
[+] Starting Dashboard forwarding
[+] Issuing open browser http://localhost:3000
```
Login (default username/password is admin/admin) and navigate to the data source settings. For each data source that may not be working, click it to see its configuration. At the bottom of the page is a “Save & Test” button that will verify the settings.
If an error occurs, verify the Grafana configuration to ensure it is correctly pointing to the intended Prometheus instance. Make changes in the Grafana settings as necessary until the “Save & Test” check shows no errors:
More details about configuring data sources can be found in Grafana’s docs.

For other possible issues, see Grafana’s troubleshooting documentation.

5 - Uninstall

Troubleshooting FSM uninstall

If for any reason, fsm uninstall mesh (as documented in the uninstall guide) fails, you may manually delete FSM resources as detailed below.

Set environment variables for your mesh:

export fsm_namespace=fsm-system # Replace fsm-system with the namespace where FSM is installed
export mesh_name=fsm # Replace fsm with the FSM mesh name
export fsm_version=<fsm version>
export fsm_ca_bundle=<fsm ca bundle>

Delete FSM control plane deployments:

kubectl delete deployment -n $fsm_namespace fsm-bootstrap
kubectl delete deployment -n $fsm_namespace fsm-controller
kubectl delete deployment -n $fsm_namespace fsm-injector

If FSM was installed alongside Prometheus, Grafana, or Jaeger, delete those deployments:

kubectl delete deployment -n $fsm_namespace fsm-prometheus
kubectl delete deployment -n $fsm_namespace fsm-grafana
kubectl delete deployment -n $fsm_namespace jaeger

If FSM was installed with the FSM Multicluster Gateway, delete it by running the following:

kubectl delete deployment -n $fsm_namespace fsm-multicluster-gateway

Delete FSM secrets, the meshconfig, and webhook configurations:

Warning: Ensure that no resources in the cluster depend on the following resources before proceeding.

kubectl delete secret -n $fsm_namespace $fsm_ca_bundle mutating-webhook-cert-secret validating-webhook-cert-secret crd-converter-cert-secret
kubectl delete meshconfig -n $fsm_namespace fsm-mesh-config
kubectl delete mutatingwebhookconfiguration -l app.kubernetes.io/name=flomesh.io,app.kubernetes.io/instance=$mesh_name,app.kubernetes.io/version=$fsm_version,app=fsm-injector
kubectl delete validatingwebhookconfiguration -l app.kubernetes.io/name=flomesh.io,app.kubernetes.io/instance=mesh_name,app.kubernetes.io/version=$fsm_version,app=fsm-controller

To delete FSM and SMI CRDs from the cluster, run the following.

Warning: Deletion of a CRD will cause all custom resources corresponding to that CRD to also be deleted.

kubectl delete crd meshconfigs.config.flomesh.io
kubectl delete crd multiclusterservices.config.flomesh.io
kubectl delete crd egresses.policy.flomesh.io
kubectl delete crd ingressbackends.policy.flomesh.io
kubectl delete crd httproutegroups.specs.smi-spec.io
kubectl delete crd tcproutes.specs.smi-spec.io
kubectl delete crd traffictargets.access.smi-spec.io
kubectl delete crd trafficsplits.split.smi-spec.io

6 - Traffic Troubleshooting

FSM Traffic Troubleshooting Guide

6.1 - Iptables Redirection

Troubleshooting Iptables interception and redirection

When traffic redirection is not working as expected

1. Confirm the pod has the Pipy sidecar container injected

The application pod should be injected with the Pipy proxy sidecar for traffic redirection to work as expected. Confirm this by ensuring the application pod is running and has the Pipy proxy sidecar container in ready state.

kubectl get pod test-58d4f8ff58-wtz4f -n test
NAME                                READY   STATUS    RESTARTS   AGE
test-58d4f8ff58-wtz4f               2/2     Running   0          32s

2. Confirm FSM’s init container has finished runnning successfully

FSM’s init container fsm-init is responsible for initializing individual application pods in the service mesh with traffic redirection rules to proxy application traffic via the Pipy proxy sidecar. The traffic redirection rules are set up using a set of iptables commands that run before any application containers in the pod are running.

Confirm FSM’s init container has finished running successfully by running kubectl describe on the application pod, and verifying the fsm-init container has terminated with an exit code of 0. The container’s State property provides this information.

kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name:         test-58d4f8ff58-wtz4f
Namespace:    test
...
...
Init Containers:
  fsm-init:
    Container ID:  containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
    Image:         flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
    Image ID:      docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      iptables -t nat -N PROXY_INBOUND && iptables -t nat -N PROXY_IN_REDIRECT && iptables -t nat -N PROXY_OUTPUT && iptables -t nat -N PROXY_REDIRECT && iptables -t nat -A PROXY_REDIRECT -p tcp -j REDIRECT --to-port 15001 && iptables -t nat -A PROXY_REDIRECT -p tcp --dport 15000 -j ACCEPT && iptables -t nat -A OUTPUT -p tcp -j PROXY_OUTPUT && iptables -t nat -A PROXY_OUTPUT -m owner --uid-owner 1500 -j RETURN && iptables -t nat -A PROXY_OUTPUT -d 127.0.0.1/32 -j RETURN && iptables -t nat -A PROXY_OUTPUT -j PROXY_REDIRECT && iptables -t nat -A PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003 && iptables -t nat -A PREROUTING -p tcp -j PROXY_INBOUND && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15010 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15901 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15902 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15903 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp -j PROXY_IN_REDIRECT
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 22 Mar 2021 09:26:14 -0700
      Finished:     Mon, 22 Mar 2021 09:26:14 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)

When outbound IP range exclusions are configured

By default, all traffic using TCP as the underlying transport protocol are redirected via the Pipy proxy sidecar container. This means all TCP based outbound traffic from applications are redirected and routed via the Pipy proxy sidecar based on service mesh policies. When outbound IP range exclusions are configured, traffic belonging to these IP ranges will not be proxied to the Pipy sidecar.

If outbound IP ranges are configured to be excluded but being subject to service mesh policies, verify they are configured as expected.

1. Confirm outbound IP ranges are correctly configured in the `fsm-mesh-config` MeshConfig resource

Confirm the outbound IP ranges to be excluded are set correctly:

# Assumes FSM is installed in the fsm-system namespace
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.outboundIPRangeExclusionList}{"\n"}'
["1.1.1.1/32","2.2.2.2/24"]

The output shows the IP ranges that are excluded from outbound traffic redirection, ["1.1.1.1/32","2.2.2.2/24"] in the example above.

2. Confirm outbound IP ranges are included in init container spec

When outbound IP range exclusions are configured, FSM’s fsm-injector service reads this configuration from the fsm-mesh-config MeshConfig resource and programs iptables rules corresponding to these ranges so that they are excluded from outbound traffic redirection via the Pipy sidecar proxy.

Confirm FSM’s fsm-init init container spec has rules corresponding to the configured outbound IP ranges to exclude.

kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name:         test-58d4f8ff58-wtz4f
Namespace:    test
...
...
Init Containers:
  fsm-init:
    Container ID:  containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
    Image:         flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
    Image ID:      docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      iptables -t nat -N PROXY_INBOUND && iptables -t nat -N PROXY_IN_REDIRECT && iptables -t nat -N PROXY_OUTPUT && iptables -t nat -N PROXY_REDIRECT && iptables -t nat -A PROXY_REDIRECT -p tcp -j REDIRECT --to-port 15001 && iptables -t nat -A PROXY_REDIRECT -p tcp --dport 15000 -j ACCEPT && iptables -t nat -A OUTPUT -p tcp -j PROXY_OUTPUT && iptables -t nat -A PROXY_OUTPUT -m owner --uid-owner 1500 -j RETURN && iptables -t nat -A PROXY_OUTPUT -d 127.0.0.1/32 -j RETURN && iptables -t nat -A PROXY_OUTPUT -j PROXY_REDIRECT && iptables -t nat -A PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003 && iptables -t nat -A PREROUTING -p tcp -j PROXY_INBOUND && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15010 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15901 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15902 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15903 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp -j PROXY_IN_REDIRECT && iptables -t nat -I PROXY_OUTPUT -d 1.1.1.1/32 -j RETURN && && iptables -t nat -I PROXY_OUTPUT -d 2.2.2.2/24 -j RETURN
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 22 Mar 2021 09:26:14 -0700
      Finished:     Mon, 22 Mar 2021 09:26:14 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)

In the example above, the following iptables commands are responsible for explicitly ignoring the configured outbound IP ranges (1.1.1.1/32 and 2.2.2.2/24) from being redirected to the Pipy proxy sidecar.

iptables -t nat -I PROXY_OUTPUT -d 1.1.1.1/32 -j RETURN
iptables -t nat -I PROXY_OUTPUT -d 2.2.2.2/24 -j RETURN

When outbound port exclusions are configured

By default, all traffic using TCP as the underlying transport protocol are redirected via the Pipy proxy sidecar container. This means all TCP based outbound traffic from applications are redirected and routed via the Pipy proxy sidecar based on service mesh policies. When outbound port exclusions are configured, traffic belonging to these ports will not be proxied to the Pipy sidecar.

If outbound ports are configured to be excluded but being subject to service mesh policies, verify they are configured as expected.

1. Confirm global outbound ports are correctly configured in the `fsm-mesh-config` MeshConfig resource

Confirm the outbound ports to be excluded are set correctly:

# Assumes FSM is installed in the fsm-system namespace
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.outboundPortExclusionList}{"\n"}'
[6379,7070]

The output shows the ports that are excluded from outbound traffic redirection, [6379,7070] in the example above.

2. Confirm pod level outbound ports are correctly annotated on the pod

Confirm the outbound ports to be excluded on a pod are set correctly:

kubectl get pod POD_NAME -o jsonpath='{.metadata.annotations}' -n POD_NAMESPACE'
map[flomesh.io/outbound-port-exclusion-list:8080]

The output shows the ports that are excluded from outbound traffic redirection on the pod, 8080 in the example above.

3. Confirm outbound ports are included in init container spec

When outbound port exclusions are configured, FSM’s fsm-injector service reads this configuration from the fsm-mesh-config MeshConfig resource and from the annotations on the pod, and programs iptables rules corresponding to these ranges so that they are excluded from outbound traffic redirection via the Pipy sidecar proxy.

Confirm FSM’s fsm-init init container spec has rules corresponding to the configured outbound ports to exclude.

kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name:         test-58d4f8ff58-wtz4f
Namespace:    test
...
...
Init Containers:
  fsm-init:
    Container ID:  containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
    Image:         flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
    Image ID:      docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      iptables-restore --noflush <<EOF
      # FSM sidecar interception rules
      *nat
      :fsm_PROXY_INBOUND - [0:0]
      :fsm_PROXY_IN_REDIRECT - [0:0]
      :fsm_PROXY_OUTBOUND - [0:0]
      :fsm_PROXY_OUT_REDIRECT - [0:0]
      -A fsm_PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003
      -A PREROUTING -p tcp -j fsm_PROXY_INBOUND
      -A fsm_PROXY_INBOUND -p tcp --dport 15010 -j RETURN
      -A fsm_PROXY_INBOUND -p tcp --dport 15901 -j RETURN
      -A fsm_PROXY_INBOUND -p tcp --dport 15902 -j RETURN
      -A fsm_PROXY_INBOUND -p tcp --dport 15903 -j RETURN
      -A fsm_PROXY_INBOUND -p tcp --dport 15904 -j RETURN
      -A fsm_PROXY_INBOUND -p tcp -j fsm_PROXY_IN_REDIRECT
      -I fsm_PROXY_INBOUND -i net1 -j RETURN
      -I fsm_PROXY_INBOUND -i net2 -j RETURN
      -A fsm_PROXY_OUT_REDIRECT -p tcp -j REDIRECT --to-port 15001
      -A fsm_PROXY_OUT_REDIRECT -p tcp --dport 15000 -j ACCEPT
      -A OUTPUT -p tcp -j fsm_PROXY_OUTBOUND
      -A fsm_PROXY_OUTBOUND -o lo ! -d 127.0.0.1/32 -m owner --uid-owner 1500 -j fsm_PROXY_IN_REDIRECT
      -A fsm_PROXY_OUTBOUND -o lo -m owner ! --uid-owner 1500 -j RETURN
      -A fsm_PROXY_OUTBOUND -m owner --uid-owner 1500 -j RETURN
      -A fsm_PROXY_OUTBOUND -d 127.0.0.1/32 -j RETURN
      -A fsm_PROXY_OUTBOUND -o net1 -j RETURN
      -A fsm_PROXY_OUTBOUND -o net2 -j RETURN
      -A fsm_PROXY_OUTBOUND -j fsm_PROXY_OUT_REDIRECT
      COMMIT
      EOF

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 22 Mar 2021 09:26:14 -0700
      Finished:     Mon, 22 Mar 2021 09:26:14 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)

In the example above, the following iptables commands are responsible for explicitly ignoring the configured outbound ports (6379, 7070 and 8080) from being redirected to the Pipy proxy sidecar.

iptables -t nat -I PROXY_OUTPUT -p tcp --match multiport --dports 6379,7070,8080 -j RETURN

6.2 - Permissive Traffic Policy Mode

Troubleshooting permissive traffic policy

When permissive traffic policy mode is not working as expected

1. Confirm permissive traffic policy mode is enabled

Confirm permissive traffic policy mode is enabled by verifying the value for the enablePermissiveTrafficPolicyMode key in the fsm-mesh-config custom resource. fsm-mesh-config MeshConfig resides in the namespace FSM control plane namespace (fsm-system by default).

# Returns true if permissive traffic policy mode is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.enablePermissiveTrafficPolicyMode}{"\n"}'
true

The above command must return a boolean string (true or false) indicating if permissive traffic policy mode is enabled.

2. Inspect FSM controller logs for errors

# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')

Errors will be logged with the level key in the log message set to error:

{"level":"error","component":"...","time":"...","file":"...","message":"..."}

3. Confirm the Pipy configuration

Use the fsm verify connectivity command to validate that the pods can communicate using a Kubernetes service.

For example, to verify if the pod curl-7bb5845476-zwxbt in the namespace curl can direct traffic to the pod httpbin-69dc7d545c-n7pjb in the httpbin namespace using the httpbin Kubernetes service:

fsm verify connectivity --from-pod curl/curl-7bb5845476-zwxbt --to-pod httpbin/httpbin-69dc7d545c-n7pjb --to-service httpbin
---------------------------------------------
[+] Context: Verify if pod "curl/curl-7bb5845476-zwxbt" can access pod "httpbin/httpbin-69dc7d545c-n7pjb" for service "httpbin/httpbin"
Status: Success

---------------------------------------------

The Status field in the output will indicate Success when the verification succeeds.

6.3 - Ingress

Troubleshooting ingress traffic

When Ingress is not working as expected

1. Confirm global ingress configuration is set as expected.

# Returns true if HTTPS ingress is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.useHTTPSIngress}{"\n"}'
false

If the output of this command is false this means that HTTP ingress is enabled and HTTPS ingress is disabled. To disable HTTP ingress and enable HTTPS ingress, use the following command:

# Replace fsm-system with fsm-controller's namespace if using a non default namespace
kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"useHTTPSIngress":true}}}'  --type=merge

Likewise, to enable HTTP ingress and disable HTTPS ingress, run:

# Replace fsm-system with fsm-controller's namespace if using a non default namespace
kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"useHTTPSIngress":false}}}'  --type=merge

2. Inspect FSM controller logs for errors

# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')

Errors will be logged with the level key in the log message set to error:

{"level":"error","component":"...","time":"...","file":"...","message":"..."}

3. Confirm that the ingress resource has been successfully deployed

kubectl get ingress <ingress-name> -n <ingress-namespace>

6.4 - Egress Troubleshooting

Egress Troubleshooting Guide

When Egress is not working as expected

1. Confirm egress is enabled

Confirm egress is enabled by verifying the value for the enableEgress key in the fsm-mesh-config MeshConfig custom resource. fsm-mesh-config resides in the namespace FSM control plane namespace (fsm-system by default).

# Returns true if egress is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.enableEgress}{"\n"}'
true

The above command must return a boolean string (true or false) indicating if egress is enabled.

2. Inspect FSM controller logs for errors

# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')

Errors will be logged with the level key in the log message set to error:

{"level":"error","component":"...","time":"...","file":"...","message":"..."}

3. Confirm the Pipy configuration

Check that egress is enabled in the configuration used by the Pod’s sidecar.

{
  "Spec": {
    "SidecarLogLevel": "error",
    "Traffic": {
      "EnableEgress": true
    }
  }
}

Troubleshooting

1 - Application Container Lifecycle

When the application container depends on network connectivity at startup

Related issues (work in progress)

2 - Error Codes

Error Code Descriptions

3 - Prometheus

Prometheus is unreachable

Metrics are not showing up in Prometheus

4 - Grafana

Grafana is unreachable

Dashboards show no data in Grafana

5 - Uninstall

6 - Traffic Troubleshooting

Table of Contents

6.1 - Iptables Redirection

When traffic redirection is not working as expected

1. Confirm the pod has the Pipy sidecar container injected

2. Confirm FSM’s init container has finished runnning successfully

When outbound IP range exclusions are configured

1. Confirm outbound IP ranges are correctly configured in the fsm-mesh-config MeshConfig resource

2. Confirm outbound IP ranges are included in init container spec

When outbound port exclusions are configured

1. Confirm global outbound ports are correctly configured in the fsm-mesh-config MeshConfig resource

2. Confirm pod level outbound ports are correctly annotated on the pod

3. Confirm outbound ports are included in init container spec

6.2 - Permissive Traffic Policy Mode

When permissive traffic policy mode is not working as expected

1. Confirm permissive traffic policy mode is enabled

2. Inspect FSM controller logs for errors

3. Confirm the Pipy configuration

6.3 - Ingress

When Ingress is not working as expected

1. Confirm global ingress configuration is set as expected.

2. Inspect FSM controller logs for errors

3. Confirm that the ingress resource has been successfully deployed

6.4 - Egress Troubleshooting

When Egress is not working as expected

1. Confirm egress is enabled

2. Inspect FSM controller logs for errors

3. Confirm the Pipy configuration

1. Confirm outbound IP ranges are correctly configured in the `fsm-mesh-config` MeshConfig resource

1. Confirm global outbound ports are correctly configured in the `fsm-mesh-config` MeshConfig resource