This is the multi-page printable view of this section. Click here to print.
Troubleshooting
1 - Application Container Lifecycle
Since FSM injects application pods that are a part of the service mesh with a long-running sidecar proxy and sets up traffic redirection rules to route all traffic to/from pods via the sidecar proxy, in some circumstances existing application containers might not startup or shutdown as expected.
When the application container depends on network connectivity at startup
Application containers that depend on network connectivity at startup are likely to experience issues once the Pipy sidecar proxy container and the fsm-init
init container are injected into the application pod by FSM. This is because upon sidecar injection, all TCP based network traffic from application containers are routed to the sidecar proxy and subject to service mesh traffic policies. This implies that for application traffic to be routed as it would without the sidecar proxy container injected, FSM controller must first program the sidecar proxy on the application pod to allow such traffic. Without the Pipy sidecar proxy being configured, all traffic from application containers will be dropped.
When FSM is configured with permissive traffic policy mode enabled, FSM will program wildcard traffic policy rules on the Pipy sidecar proxy to allow every pod to access all services that are a part of the mesh. When FSM is configured with SMI traffic policy mode enabled, explicit SMI policies must be configured to enable communication between applications in the mesh.
Regardless of the traffic policy mode, application containers that depend on network connectivity at startup can experience problems starting up if they are not resilient to delays in the network being ready. With the Pipy proxy sidecar injected, the network is deemed ready only when the sidecar proxy has been programmed by FSM controller to allow application traffic to flow through the network.
It is recommended that application containers be resilient enough to the initial bootstrapping phase of the Pipy proxy sidecar in the application pod.
It is important to note that the container’s restart policy also influences the startup of application containers. If an application container’s startup policy is set to Never
and it depends on network connectivity to be ready at startup time, it is possible the container fails to access the network until the Pipy proxy sidecar is ready to allow the application container access to the network, thereby resulting in the application container to exit and never recover from a failed startup. For this reason, it is recommended not to use a container restart policy of Never
if your application container depends on network connectivity at startup.
Related issues (work in progress)
- Kubernetes issue 65502: Support startup dependencies between containers on the same pod
2 - Error Codes
Error Code Descriptions
If error codes are present in the FSM error logs or detected from the FSM error code metrics, the fsm support error-info
cli tool can be used gain more information about the error code.
The following table is generated by running fsm support error-info
.
+------------+----------------------------------------------------------------------------------+
| ERROR CODE | DESCRIPTION |
+------------+----------------------------------------------------------------------------------+
| E1000 | An invalid command line argument was passed to the application. |
+------------+----------------------------------------------------------------------------------+
| E1001 | The specified log level could not be set in the system. |
+------------+----------------------------------------------------------------------------------+
| E1002 | The fsm-controller k8s pod resource was not able to be retrieved by the system. |
+------------+----------------------------------------------------------------------------------+
| E1003 | The fsm-injector k8s pod resource was not able to be retrieved by the system. |
+------------+----------------------------------------------------------------------------------+
| E1004 | The Ingress client created by the fsm-controller to monitor Ingress resources |
| | failed to start. |
+------------+----------------------------------------------------------------------------------+
| E1005 | The Reconciler client to monitor updates and deletes to FSM's CRDs and mutating |
| | webhook failed to start. |
+------------+----------------------------------------------------------------------------------+
| E2000 | An error was encountered while attempting to deduplicate traffic matching |
| | attributes (destination port, protocol, IP address etc.) used for matching |
| | egress traffic. The applied egress policies could be conflicting with each |
| | other, and the system was unable to process affected egress policies. |
+------------+----------------------------------------------------------------------------------+
| E2001 | An error was encountered while attempting to deduplicate upstream clusters |
| | associated with the egress destination. The applied egress policies could be |
| | conflicting with each other, and the system was unable to process affected |
| | egress policies. |
+------------+----------------------------------------------------------------------------------+
| E2002 | An invalid IP address range was specified in the egress policy. The IP address |
| | range must be specified as as a CIDR notation IP address and prefix length, like |
| | "192.0.2.0/24", as defined in RFC 4632. The invalid IP address range was ignored |
| | by the system. |
+------------+----------------------------------------------------------------------------------+
| E2003 | An invalid match was specified in the egress policy. The specified match was |
| | ignored by the system while applying the egress policy. |
+------------+----------------------------------------------------------------------------------+
| E2004 | The SMI HTTPRouteGroup resource specified as a match in an egress policy was not |
| | found. Please verify that the specified SMI HTTPRouteGroup resource exists in |
| | the same namespace as the egress policy referencing it as a match. |
+------------+----------------------------------------------------------------------------------+
| E2005 | The SMI HTTPRouteGroup resources specified as a match in an SMI TrafficTarget |
| | policy was unable to be retrieved by the system. The associated SMI |
| | TrafficTarget policy was ignored by the system. Please verify that the matches |
| | specified for the Traffictarget resource exist in the same namespace as the |
| | TrafficTarget policy referencing the match. |
+------------+----------------------------------------------------------------------------------+
| E2006 | The SMI HTTPRouteGroup resource is invalid as it does not have any matches |
| | specified. The SMI HTTPRouteGroup policy was ignored by the system. |
+------------+----------------------------------------------------------------------------------+
| E2007 | There are multiple SMI traffic split policies associated with the same |
| | apex(root) service specified in the policies. The system does not support |
| | this scenario so onlt the first encountered policy is processed by the system, |
| | subsequent policies referring the same apex service are ignored. |
+------------+----------------------------------------------------------------------------------+
| E2008 | There was an error adding a route match to an outbound traffic policy |
| | representation within the system. The associated route was ignored by the |
| | system. |
+------------+----------------------------------------------------------------------------------+
| E2009 | The inbound TrafficTargets composed of their routes for a given destination |
| | ServiceIdentity could not be configured. |
+------------+----------------------------------------------------------------------------------+
| E2010 | An applied SMI TrafficTarget policy has an invalid destination kind. |
+------------+----------------------------------------------------------------------------------+
| E2011 | An applied SMI TrafficTarget policy has an invalid source kind. |
+------------+----------------------------------------------------------------------------------+
| E3000 | The system found 0 endpoints to be reached when the service's FQDN was resolved. |
+------------+----------------------------------------------------------------------------------+
| E3001 | A Kubernetes resource could not be marshalled. |
+------------+----------------------------------------------------------------------------------+
| E3002 | A Kubernetes resource could not be unmarshalled. |
+------------+----------------------------------------------------------------------------------+
| E4000 | The Kubernetes secret containing the certificate could not be retrieved by the |
| | system. |
+------------+----------------------------------------------------------------------------------+
| E4001 | The certificate specified by name could not be obtained by key from the secret's |
| | data. |
+------------+----------------------------------------------------------------------------------+
| E4002 | The private key specified by name could not be obtained by key from the secret's |
| | data. |
+------------+----------------------------------------------------------------------------------+
| E4003 | The certificate expiration specified by name could not be obtained by key from |
| | the secret's data. |
+------------+----------------------------------------------------------------------------------+
| E4004 | The certificate expiration obtained from the secret's data by name could not be |
| | parsed. |
+------------+----------------------------------------------------------------------------------+
| E4005 | The secret containing a certificate could not be created by the system. |
+------------+----------------------------------------------------------------------------------+
| E4006 | A private key failed to be generated. |
+------------+----------------------------------------------------------------------------------+
| E4007 | The specified private key could be be could not be converted from a DER encoded |
| | key to a PEM encoded key. |
+------------+----------------------------------------------------------------------------------+
| E4008 | The certificate request fails to be created when attempting to issue a |
| | certificate. |
+------------+----------------------------------------------------------------------------------+
| E4009 | When creating a new certificate authority, the root certificate could not be |
| | obtained by the system. |
+------------+----------------------------------------------------------------------------------+
| E4010 | The specified certificate could not be converted from a DER encoded certificate |
| | to a PEM encoded certificate. |
+------------+----------------------------------------------------------------------------------+
| E4011 | The specified PEM encoded certificate could not be decoded. |
+------------+----------------------------------------------------------------------------------+
| E4012 | The specified PEM privateKey for the certificate authority's root certificate |
| | could not be decoded. |
+------------+----------------------------------------------------------------------------------+
| E4013 | An unspecified error occurred when issuing a certificate from the certificate |
| | manager. |
+------------+----------------------------------------------------------------------------------+
| E4014 | An error occurred when creating a certificate to issue from the certificate |
| | manager. |
+------------+----------------------------------------------------------------------------------+
| E4015 | The certificate authority privided when issuing a certificate was invalid. |
+------------+----------------------------------------------------------------------------------+
| E4016 | The specified certificate could not be rotated. |
+------------+----------------------------------------------------------------------------------+
| E4100 | Failed parsing object into PubSub message. |
+------------+----------------------------------------------------------------------------------+
| E4150 | Failed initial cache sync for config.flomesh.io informer. |
+------------+----------------------------------------------------------------------------------+
| E4151 | Failed to cast object to MeshConfig. |
+------------+----------------------------------------------------------------------------------+
| E4152 | Failed to fetch MeshConfig from cache with specific key. |
+------------+----------------------------------------------------------------------------------+
| E4153 | Failed to marshal MeshConfig into other format. |
+------------+----------------------------------------------------------------------------------+
| E5000 | A XDS resource could not be marshalled. |
+------------+----------------------------------------------------------------------------------+
| E5001 | The XDS certificate common name could not be parsed. The CN should be of the |
| | form <proxy-UUID>.<kind>.<proxy-identity>. |
+------------+----------------------------------------------------------------------------------+
| E5002 | The proxy UUID obtained from parsing the XDS certificate's common name did not |
| | match the fsm-proxy-uuid label value for any pod. The pod associated with the |
| | specified Pipy proxy could not be found. |
+------------+----------------------------------------------------------------------------------+
| E5003 | A pod in the mesh belongs to more than one service. By Open Service Mesh |
| | convention the number of services a pod can belong to is 1. This is a limitation |
| | we set in place in order to make the mesh easy to understand and reason about. |
| | When a pod belongs to more than one service XDS will not program the Pipy |
| | proxy, leaving it out of the mesh. |
+------------+----------------------------------------------------------------------------------+
| E5004 | The Pipy proxy data structure created by ADS to reference an Pipy proxy |
| | sidecar from a pod's fsm-proxy-uuid label could not be configured. |
+------------+----------------------------------------------------------------------------------+
| E5005 | A GRPC connection failure occurred and the ADS is no longer able to receive |
| | DiscoveryRequests. |
+------------+----------------------------------------------------------------------------------+
| E5006 | The DiscoveryResponse configured by ADS failed to send to the Pipy proxy. |
+------------+----------------------------------------------------------------------------------+
| E5007 | The resources to be included in the DiscoveryResponse could not be generated. |
+------------+----------------------------------------------------------------------------------+
| E5008 | The aggregated resources generated for a DiscoveryResponse failed to be |
| | configured as a new snapshot in the Pipy xDS Aggregate Discovery Services |
| | cache. |
+------------+----------------------------------------------------------------------------------+
| E5009 | The Aggregate Discovery Server (ADS) created by the FSM controller failed to |
| | start. |
+------------+----------------------------------------------------------------------------------+
| E5010 | The ServiceAccount referenced in the NodeID does not match the ServiceAccount |
| | specified in the proxy certificate. The proxy was not allowed to be a part of |
| | the mesh. |
+------------+----------------------------------------------------------------------------------+
| E5011 | The gRPC stream was closed by the proxy and no DiscoveryRequests can be |
| | received. The Stream Agreggated Resource server was terminated for the specified |
| | proxy. |
+------------+----------------------------------------------------------------------------------+
| E5012 | The sidecar proxy has not completed the initialization phase and it is not ready |
| | to receive broadcast updates from control plane related changes. New versions |
| | should not be pushed if the first request has not be received. The broadcast |
| | update was ignored for that proxy. |
+------------+----------------------------------------------------------------------------------+
| E5013 | The TypeURL of the resource being requested in the DiscoveryRequest is invalid. |
+------------+----------------------------------------------------------------------------------+
| E5014 | The version of the DiscoveryRequest could not be parsed by ADS. |
+------------+----------------------------------------------------------------------------------+
| E5015 | A proxy egress cluster which routes traffic to its original destination could |
| | not be configured. When a Host is not specified in the cluster config, the |
| | original destination is used. |
+------------+----------------------------------------------------------------------------------+
| E5016 | A proxy egress cluster that routes traffic based on the specified Host resolved |
| | using DNS could not be configured. |
+------------+----------------------------------------------------------------------------------+
| E5017 | A proxy cluster that corresponds to a specified upstream service could not be |
| | configured. |
+------------+----------------------------------------------------------------------------------+
| E5018 | The meshed services corresponding a specified Pipy proxy could not be listed. |
+------------+----------------------------------------------------------------------------------+
| E5019 | Multiple Pipy clusters with the same name were configured. The duplicate |
| | clusters will not be sent to the Pipy proxy in a ClusterDiscovery response. |
+------------+----------------------------------------------------------------------------------+
| E5020 | The application protocol specified for a port is not supported for ingress |
| | traffic. The XDS filter chain for ingress traffic to the port was not created. |
+------------+----------------------------------------------------------------------------------+
| E5021 | An XDS filter chain could not be constructed for ingress. |
+------------+----------------------------------------------------------------------------------+
| E5022 | A traffic policy rule could not be configured as an RBAC rule on the proxy. |
| | The corresponding rule was ignored by the system. |
+------------+----------------------------------------------------------------------------------+
| E5023 | The SDS certificate resource could not be unmarshalled. The |
| | corresponding certificate resource was ignored by the system. |
+------------+----------------------------------------------------------------------------------+
| E5024 | An XDS secret containing a TLS certificate could not be retrieved. |
| | The corresponding secret request was ignored by the system. |
+------------+----------------------------------------------------------------------------------+
| E5025 | The SDS secret does not correspond to a MeshService. |
+------------+----------------------------------------------------------------------------------+
| E5026 | The SDS secret does not correspond to a ServiceAccount. |
+------------+----------------------------------------------------------------------------------+
| E5027 | The identity obtained from the SDS certificate request does not match the |
| | The corresponding secret request was ignored by the system. |
+------------+----------------------------------------------------------------------------------+
| E5028 | The SDS secret does not correspond to a MeshService. |
+------------+----------------------------------------------------------------------------------+
| E5029 | The SDS secret does not correspond to a ServiceAccount. |
+------------+----------------------------------------------------------------------------------+
| E5030 | The identity obtained from the SDS certificate request does not match the |
| | identity of the proxy. The corresponding certificate request was ignored |
| | by the system. |
+------------+----------------------------------------------------------------------------------+
| E6100 | A protobuf ProtoMessage could not be converted into YAML. |
+------------+----------------------------------------------------------------------------------+
| E6101 | The mutating webhook certificate could not be parsed. |
| | The mutating webhook HTTP server was not started. |
+------------+----------------------------------------------------------------------------------+
| E6102 | The sidecar injection webhook HTTP server failed to start. |
+------------+----------------------------------------------------------------------------------+
| E6103 | An AdmissionRequest could not be decoded. |
+------------+----------------------------------------------------------------------------------+
| E6104 | The timeout from an AdmissionRequest could not be parsed. |
+------------+----------------------------------------------------------------------------------+
| E6105 | The AdmissionRequest's header was invalid. The content type obtained from the |
| | header is not supported. |
+------------+----------------------------------------------------------------------------------+
| E6106 | The AdmissionResponse could not be written. |
+------------+----------------------------------------------------------------------------------+
| E6107 | The AdmissionRequest was empty. |
+------------+----------------------------------------------------------------------------------+
| E6108 | It could not be determined if the pod specified in the AdmissionRequest is |
| | enabled for sidecar injection. |
+------------+----------------------------------------------------------------------------------+
| E6109 | It could not be determined if the namespace specified in the |
| | AdmissionRequest is enabled for sidecar injection. |
+------------+----------------------------------------------------------------------------------+
| E6110 | The port exclusions for a pod could not be obtained. No |
| | port exclusions are added to the init container's spec. |
+------------+----------------------------------------------------------------------------------+
| E6111 | The AdmissionRequest body could not be read. |
+------------+----------------------------------------------------------------------------------+
| E6112 | The AdmissionRequest body was nil. |
+------------+----------------------------------------------------------------------------------+
| E6113 | The MutatingWebhookConfiguration could not be created. |
+------------+----------------------------------------------------------------------------------+
| E6114 | The MutatingWebhookConfiguration could not be updated. |
+------------+----------------------------------------------------------------------------------+
| E6700 | An error occurred when shutting down the validating webhook HTTP server. |
+------------+----------------------------------------------------------------------------------+
| E6701 | The validating webhook HTTP server failed to start. |
+------------+----------------------------------------------------------------------------------+
| E6702 | The validating webhook certificate could not be parsed. |
| | The validating webhook HTTP server was not started. |
+------------+----------------------------------------------------------------------------------+
| E6703 | The ValidatingWebhookConfiguration could not be created. |
+------------+----------------------------------------------------------------------------------+
| E7000 | An error occurred while reconciling the updated CRD to its original state. |
+------------+----------------------------------------------------------------------------------+
| E7001 | An error occurred while reconciling the deleted CRD. |
+------------+----------------------------------------------------------------------------------+
| E7002 | An error occurred while reconciling the updated mutating webhook to its original |
| | state. |
+------------+----------------------------------------------------------------------------------+
| E7003 | An error occurred while reconciling the deleted mutating webhook. |
+------------+----------------------------------------------------------------------------------+
| E7004 | An error occurred while while reconciling the updated validating webhook to its |
| | original state. |
+------------+----------------------------------------------------------------------------------+
| E7005 | An error occurred while reconciling the deleted validating webhook. |
+------------+----------------------------------------------------------------------------------+
Information for a specific error code can be obtained by running fsm support error-info <error-code>
. For example:
fsm support error-info E1000
+------------+-----------------------------------------------------------------+
| ERROR CODE | DESCRIPTION |
+------------+-----------------------------------------------------------------+
| E1000 | An invalid command line argument was passed to the |
| | application. |
+------------+-----------------------------------------------------------------+
3 - Prometheus
Prometheus is unreachable
If a Prometheus instance installed with FSM can’t be reached, perform the following steps to identify and resolve any issues.
Verify a Prometheus Pod exists.
When installed with
fsm install --set=fsm.deployPrometheus=true
, a Prometheus Pod named something likefsm-prometheus-5794755b9f-rnvlr
should exist in the namespace of the other FSM control plane components which namedfsm-system
by default.If no such Pod is found, verify the FSM Helm chart was installed with the
fsm.deployPrometheus
parameter set totrue
withhelm
:$ helm get values -a <mesh name> -n <FSM namespace>
If the parameter is set to anything but
true
, reinstall FSM with the--set=fsm.deployPrometheus=true
flag onfsm install
.Verify the Prometheus Pod is healthy.
The Prometheus Pod identified above should be both in a Running state and have all containers ready, as shown in the
kubectl get
output:$ # Assuming FSM is installed in the fsm-system namespace: $ kubectl get pods -n fsm-system -l app=fsm-prometheus NAME READY STATUS RESTARTS AGE fsm-prometheus-5794755b9f-67p6r 1/1 Running 0 27m
If the Pod is not showing as Running or its containers ready, use
kubectl describe
to look for other potential issues:$ # Assuming FSM is installed in the fsm-system namespace: $ kubectl describe pods -n fsm-system -l app=fsm-prometheus
Once the Prometheus Pod is found to be healthy, Prometheus should be reachable.
Metrics are not showing up in Prometheus
If Prometheus is found not to be scraping metrics for any Pods, perform the following steps to identify and resolve any issues.
Verify application Pods are working as expected.
If workloads running in the mesh are not functioning properly, metrics scraped from those Pods may not look correct. For example, if metrics showing traffic to Service A from Service B are missing, ensure the services are communicating successfully.
To help further troubleshoot these kinds of issues, see the traffic troubleshooting guide.
Verify the Pods whose metrics are missing have an Pipy sidecar injected.
Only Pods with an Pipy sidecar container are expected to have their metrics scraped by Prometheus. Ensure each Pod is running a container from an image with
flomesh/pipy
in its name:$ kubectl get po -n <pod namespace> <pod name> -o jsonpath='{.spec.containers[*].image}' mynamespace/myapp:v1.0.0 flomesh/pipy:0.50.0
Verify the proxy’s endpoint being scraped by Prometheus is working as expected.
Each Pipy proxy exposes an HTTP endpoint that shows metrics generated by that proxy and is scraped by Prometheus. Check to see if the expected metrics are shown by making a request to the endpoint directly.
For each Pod whose metrics are missing, use
kubectl
to forward the Pipy proxy admin interface port and check the metrics:$ kubectl port-forward -n <pod namespace> <pod name> 15000
Go to http://localhost:15000/stats/prometheus in a browser to check the metrics generated by that Pod. If Prometheus does not seem to be accounting for these metrics, move on to the next step to ensure Prometheus is configured properly.
Verify the intended namespaces have been enrolled in metrics collection.
For each namespace that contains Pods which should have metrics scraped, ensure the namespace is monitored by the intended FSM instance with
fsm mesh list
.Next, check to make sure the namespace is annotated with
flomesh.io/metrics: enabled
:$ # Assuming FSM is installed in the fsm-system namespace: $ kubectl get namespace <namespace> -o jsonpath='{.metadata.annotations.flomesh\.io/metrics}' enabled
If no such annotation exists on the namespace or it has a different value, fix it with
fsm
:$ fsm metrics enable --namespace <namespace> Metrics successfully enabled in namespace [<namespace>]
If custom metrics are not being scraped, verify they have been enabled.
Custom metrics are currently disable by default and enabled when the
fsm.featureFlags.enableWASMStats
parameter is set totrue
. Verify the current FSM instance has this parameter set for a mesh named<fsm-mesh-name>
in the<fsm-namespace>
namespace:$ helm get values -a <fsm-mesh-name> -n <fsm-namespace>
Note: replace
<fsm-mesh-name>
with the name of the fsm mesh and<fsm-namespace>
with the namespace where fsm was installed.If
fsm.featureFlags.enableWASMStats
is set to a different value, reinstall FSM and pass--set fsm.featureFlags.enableWASMStats
tofsm install
.
4 - Grafana
Grafana is unreachable
If a Grafana instance installed with FSM can’t be reached, perform the following steps to identify and resolve any issues.
Verify a Grafana Pod exists.
When installed with
fsm install --set=fsm.deployGrafana=true
, a Grafana Pod named something likefsm-grafana-7c88b9687d-tlzld
should exist in the namespace of the other FSM control plane components which namedfsm-system
by default.If no such Pod is found, verify the FSM Helm chart was installed with the
fsm.deployGrafana
parameter set totrue
withhelm
:$ helm get values -a <mesh name> -n <FSM namespace>
If the parameter is set to anything but
true
, reinstall FSM with the--set=fsm.deployGrafana=true
flag onfsm install
.Verify the Grafana Pod is healthy.
The Grafana Pod identified above should be both in a Running state and have all containers ready, as shown in the
kubectl get
output:$ # Assuming FSM is installed in the fsm-system namespace: $ kubectl get pods -n fsm-system -l app=fsm-grafana NAME READY STATUS RESTARTS AGE fsm-grafana-7c88b9687d-tlzld 1/1 Running 0 58s
If the Pod is not showing as Running or its containers ready, use
kubectl describe
to look for other potential issues:$ # Assuming FSM is installed in the fsm-system namespace: $ kubectl describe pods -n fsm-system -l app=fsm-grafana
Once the Grafana Pod is found to be healthy, Grafana should be reachable.
Dashboards show no data in Grafana
If data appears to be missing from the Grafana dashboards, perform the following steps to identify and resolve any issues.
Verify Prometheus is installed and healthy.
Because Grafana queries Prometheus for data, ensure Prometheus is working as expected. See the Prometheus troubleshooting guide for more details.
Verify Grafana can communicate with Prometheus.
Start by opening the Grafana UI in a browser:
$ fsm dashboard [+] Starting Dashboard forwarding [+] Issuing open browser http://localhost:3000
Login (default username/password is admin/admin) and navigate to the data source settings. For each data source that may not be working, click it to see its configuration. At the bottom of the page is a “Save & Test” button that will verify the settings.
If an error occurs, verify the Grafana configuration to ensure it is correctly pointing to the intended Prometheus instance. Make changes in the Grafana settings as necessary until the “Save & Test” check shows no errors:
More details about configuring data sources can be found in Grafana’s docs.
For other possible issues, see Grafana’s troubleshooting documentation.
5 - Uninstall
If for any reason, fsm uninstall mesh
(as documented in the uninstall guide) fails, you may manually delete FSM resources as detailed below.
Set environment variables for your mesh:
export fsm_namespace=fsm-system # Replace fsm-system with the namespace where FSM is installed
export mesh_name=fsm # Replace fsm with the FSM mesh name
export fsm_version=<fsm version>
export fsm_ca_bundle=<fsm ca bundle>
Delete FSM control plane deployments:
kubectl delete deployment -n $fsm_namespace fsm-bootstrap
kubectl delete deployment -n $fsm_namespace fsm-controller
kubectl delete deployment -n $fsm_namespace fsm-injector
If FSM was installed alongside Prometheus, Grafana, or Jaeger, delete those deployments:
kubectl delete deployment -n $fsm_namespace fsm-prometheus
kubectl delete deployment -n $fsm_namespace fsm-grafana
kubectl delete deployment -n $fsm_namespace jaeger
If FSM was installed with the FSM Multicluster Gateway, delete it by running the following:
kubectl delete deployment -n $fsm_namespace fsm-multicluster-gateway
Delete FSM secrets, the meshconfig, and webhook configurations:
Warning: Ensure that no resources in the cluster depend on the following resources before proceeding.
kubectl delete secret -n $fsm_namespace $fsm_ca_bundle mutating-webhook-cert-secret validating-webhook-cert-secret crd-converter-cert-secret
kubectl delete meshconfig -n $fsm_namespace fsm-mesh-config
kubectl delete mutatingwebhookconfiguration -l app.kubernetes.io/name=flomesh.io,app.kubernetes.io/instance=$mesh_name,app.kubernetes.io/version=$fsm_version,app=fsm-injector
kubectl delete validatingwebhookconfiguration -l app.kubernetes.io/name=flomesh.io,app.kubernetes.io/instance=mesh_name,app.kubernetes.io/version=$fsm_version,app=fsm-controller
To delete FSM and SMI CRDs from the cluster, run the following.
Warning: Deletion of a CRD will cause all custom resources corresponding to that CRD to also be deleted.
kubectl delete crd meshconfigs.config.flomesh.io
kubectl delete crd multiclusterservices.config.flomesh.io
kubectl delete crd egresses.policy.flomesh.io
kubectl delete crd ingressbackends.policy.flomesh.io
kubectl delete crd httproutegroups.specs.smi-spec.io
kubectl delete crd tcproutes.specs.smi-spec.io
kubectl delete crd traffictargets.access.smi-spec.io
kubectl delete crd trafficsplits.split.smi-spec.io
6 - Traffic Troubleshooting
Table of Contents
6.1 - Iptables Redirection
When traffic redirection is not working as expected
1. Confirm the pod has the Pipy sidecar container injected
The application pod should be injected with the Pipy proxy sidecar for traffic redirection to work as expected. Confirm this by ensuring the application pod is running and has the Pipy proxy sidecar container in ready state.
kubectl get pod test-58d4f8ff58-wtz4f -n test
NAME READY STATUS RESTARTS AGE
test-58d4f8ff58-wtz4f 2/2 Running 0 32s
2. Confirm FSM’s init container has finished runnning successfully
FSM’s init container fsm-init
is responsible for initializing individual application pods in the service mesh with traffic redirection rules to proxy application traffic via the Pipy proxy sidecar. The traffic redirection rules are set up using a set of iptables
commands that run before any application containers in the pod are running.
Confirm FSM’s init container has finished running successfully by running kubectl describe
on the application pod, and verifying the fsm-init
container has terminated with an exit code of 0. The container’s State
property provides this information.
kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name: test-58d4f8ff58-wtz4f
Namespace: test
...
...
Init Containers:
fsm-init:
Container ID: containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
Image: flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
Image ID: docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
iptables -t nat -N PROXY_INBOUND && iptables -t nat -N PROXY_IN_REDIRECT && iptables -t nat -N PROXY_OUTPUT && iptables -t nat -N PROXY_REDIRECT && iptables -t nat -A PROXY_REDIRECT -p tcp -j REDIRECT --to-port 15001 && iptables -t nat -A PROXY_REDIRECT -p tcp --dport 15000 -j ACCEPT && iptables -t nat -A OUTPUT -p tcp -j PROXY_OUTPUT && iptables -t nat -A PROXY_OUTPUT -m owner --uid-owner 1500 -j RETURN && iptables -t nat -A PROXY_OUTPUT -d 127.0.0.1/32 -j RETURN && iptables -t nat -A PROXY_OUTPUT -j PROXY_REDIRECT && iptables -t nat -A PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003 && iptables -t nat -A PREROUTING -p tcp -j PROXY_INBOUND && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15010 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15901 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15902 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15903 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp -j PROXY_IN_REDIRECT
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 22 Mar 2021 09:26:14 -0700
Finished: Mon, 22 Mar 2021 09:26:14 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)
When outbound IP range exclusions are configured
By default, all traffic using TCP as the underlying transport protocol are redirected via the Pipy proxy sidecar container. This means all TCP based outbound traffic from applications are redirected and routed via the Pipy proxy sidecar based on service mesh policies. When outbound IP range exclusions are configured, traffic belonging to these IP ranges will not be proxied to the Pipy sidecar.
If outbound IP ranges are configured to be excluded but being subject to service mesh policies, verify they are configured as expected.
1. Confirm outbound IP ranges are correctly configured in the fsm-mesh-config
MeshConfig resource
Confirm the outbound IP ranges to be excluded are set correctly:
# Assumes FSM is installed in the fsm-system namespace
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.outboundIPRangeExclusionList}{"\n"}'
["1.1.1.1/32","2.2.2.2/24"]
The output shows the IP ranges that are excluded from outbound traffic redirection, ["1.1.1.1/32","2.2.2.2/24"]
in the example above.
2. Confirm outbound IP ranges are included in init container spec
When outbound IP range exclusions are configured, FSM’s fsm-injector
service reads this configuration from the fsm-mesh-config
MeshConfig
resource and programs iptables
rules corresponding to these ranges so that they are excluded from outbound traffic redirection via the Pipy sidecar proxy.
Confirm FSM’s fsm-init
init container spec has rules corresponding to the configured outbound IP ranges to exclude.
kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name: test-58d4f8ff58-wtz4f
Namespace: test
...
...
Init Containers:
fsm-init:
Container ID: containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
Image: flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
Image ID: docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
iptables -t nat -N PROXY_INBOUND && iptables -t nat -N PROXY_IN_REDIRECT && iptables -t nat -N PROXY_OUTPUT && iptables -t nat -N PROXY_REDIRECT && iptables -t nat -A PROXY_REDIRECT -p tcp -j REDIRECT --to-port 15001 && iptables -t nat -A PROXY_REDIRECT -p tcp --dport 15000 -j ACCEPT && iptables -t nat -A OUTPUT -p tcp -j PROXY_OUTPUT && iptables -t nat -A PROXY_OUTPUT -m owner --uid-owner 1500 -j RETURN && iptables -t nat -A PROXY_OUTPUT -d 127.0.0.1/32 -j RETURN && iptables -t nat -A PROXY_OUTPUT -j PROXY_REDIRECT && iptables -t nat -A PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003 && iptables -t nat -A PREROUTING -p tcp -j PROXY_INBOUND && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15010 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15901 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15902 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp --dport 15903 -j RETURN && iptables -t nat -A PROXY_INBOUND -p tcp -j PROXY_IN_REDIRECT && iptables -t nat -I PROXY_OUTPUT -d 1.1.1.1/32 -j RETURN && && iptables -t nat -I PROXY_OUTPUT -d 2.2.2.2/24 -j RETURN
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 22 Mar 2021 09:26:14 -0700
Finished: Mon, 22 Mar 2021 09:26:14 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)
In the example above, the following iptables
commands are responsible for explicitly ignoring the configured outbound IP ranges (1.1.1.1/32 and 2.2.2.2/24
) from being redirected to the Pipy proxy sidecar.
iptables -t nat -I PROXY_OUTPUT -d 1.1.1.1/32 -j RETURN
iptables -t nat -I PROXY_OUTPUT -d 2.2.2.2/24 -j RETURN
When outbound port exclusions are configured
By default, all traffic using TCP as the underlying transport protocol are redirected via the Pipy proxy sidecar container. This means all TCP based outbound traffic from applications are redirected and routed via the Pipy proxy sidecar based on service mesh policies. When outbound port exclusions are configured, traffic belonging to these ports will not be proxied to the Pipy sidecar.
If outbound ports are configured to be excluded but being subject to service mesh policies, verify they are configured as expected.
1. Confirm global outbound ports are correctly configured in the fsm-mesh-config
MeshConfig resource
Confirm the outbound ports to be excluded are set correctly:
# Assumes FSM is installed in the fsm-system namespace
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.outboundPortExclusionList}{"\n"}'
[6379,7070]
The output shows the ports that are excluded from outbound traffic redirection, [6379,7070]
in the example above.
2. Confirm pod level outbound ports are correctly annotated on the pod
Confirm the outbound ports to be excluded on a pod are set correctly:
kubectl get pod POD_NAME -o jsonpath='{.metadata.annotations}' -n POD_NAMESPACE'
map[flomesh.io/outbound-port-exclusion-list:8080]
The output shows the ports that are excluded from outbound traffic redirection on the pod, 8080
in the example above.
3. Confirm outbound ports are included in init container spec
When outbound port exclusions are configured, FSM’s fsm-injector
service reads this configuration from the fsm-mesh-config
MeshConfig
resource and from the annotations on the pod, and programs iptables
rules corresponding to these ranges so that they are excluded from outbound traffic redirection via the Pipy sidecar proxy.
Confirm FSM’s fsm-init
init container spec has rules corresponding to the configured outbound ports to exclude.
kubectl describe pod test-58d4f8ff58-wtz4f -n test
Name: test-58d4f8ff58-wtz4f
Namespace: test
...
...
Init Containers:
fsm-init:
Container ID: containerd://98840f655f2310b2f441e11efe9dfcf894e4c57e4e26b928542ee698159100c0
Image: flomesh/init:2c18593efc7a31986a6ae7f412e73b6067e11a57
Image ID: docker.io/flomesh/init@sha256:24456a8391bce5d254d5a1d557d0c5e50feee96a48a9fe4c622036f4ab2eaf8e
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
iptables-restore --noflush <<EOF
# FSM sidecar interception rules
*nat
:fsm_PROXY_INBOUND - [0:0]
:fsm_PROXY_IN_REDIRECT - [0:0]
:fsm_PROXY_OUTBOUND - [0:0]
:fsm_PROXY_OUT_REDIRECT - [0:0]
-A fsm_PROXY_IN_REDIRECT -p tcp -j REDIRECT --to-port 15003
-A PREROUTING -p tcp -j fsm_PROXY_INBOUND
-A fsm_PROXY_INBOUND -p tcp --dport 15010 -j RETURN
-A fsm_PROXY_INBOUND -p tcp --dport 15901 -j RETURN
-A fsm_PROXY_INBOUND -p tcp --dport 15902 -j RETURN
-A fsm_PROXY_INBOUND -p tcp --dport 15903 -j RETURN
-A fsm_PROXY_INBOUND -p tcp --dport 15904 -j RETURN
-A fsm_PROXY_INBOUND -p tcp -j fsm_PROXY_IN_REDIRECT
-I fsm_PROXY_INBOUND -i net1 -j RETURN
-I fsm_PROXY_INBOUND -i net2 -j RETURN
-A fsm_PROXY_OUT_REDIRECT -p tcp -j REDIRECT --to-port 15001
-A fsm_PROXY_OUT_REDIRECT -p tcp --dport 15000 -j ACCEPT
-A OUTPUT -p tcp -j fsm_PROXY_OUTBOUND
-A fsm_PROXY_OUTBOUND -o lo ! -d 127.0.0.1/32 -m owner --uid-owner 1500 -j fsm_PROXY_IN_REDIRECT
-A fsm_PROXY_OUTBOUND -o lo -m owner ! --uid-owner 1500 -j RETURN
-A fsm_PROXY_OUTBOUND -m owner --uid-owner 1500 -j RETURN
-A fsm_PROXY_OUTBOUND -d 127.0.0.1/32 -j RETURN
-A fsm_PROXY_OUTBOUND -o net1 -j RETURN
-A fsm_PROXY_OUTBOUND -o net2 -j RETURN
-A fsm_PROXY_OUTBOUND -j fsm_PROXY_OUT_REDIRECT
COMMIT
EOF
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 22 Mar 2021 09:26:14 -0700
Finished: Mon, 22 Mar 2021 09:26:14 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from frontend-token-5g488 (ro)
In the example above, the following iptables
commands are responsible for explicitly ignoring the configured outbound ports (6379, 7070 and 8080
) from being redirected to the Pipy proxy sidecar.
iptables -t nat -I PROXY_OUTPUT -p tcp --match multiport --dports 6379,7070,8080 -j RETURN
6.2 - Permissive Traffic Policy Mode
When permissive traffic policy mode is not working as expected
1. Confirm permissive traffic policy mode is enabled
Confirm permissive traffic policy mode is enabled by verifying the value for the enablePermissiveTrafficPolicyMode
key in the fsm-mesh-config
custom resource. fsm-mesh-config
MeshConfig resides in the namespace FSM control plane namespace (fsm-system
by default).
# Returns true if permissive traffic policy mode is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.enablePermissiveTrafficPolicyMode}{"\n"}'
true
The above command must return a boolean string (true
or false
) indicating if permissive traffic policy mode is enabled.
2. Inspect FSM controller logs for errors
# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')
Errors will be logged with the level
key in the log message set to error
:
{"level":"error","component":"...","time":"...","file":"...","message":"..."}
3. Confirm the Pipy configuration
Use the fsm verify connectivity
command to validate that the pods can communicate using a Kubernetes service.
For example, to verify if the pod curl-7bb5845476-zwxbt
in the namespace curl
can direct traffic to the pod httpbin-69dc7d545c-n7pjb
in the httpbin
namespace using the httpbin
Kubernetes service:
fsm verify connectivity --from-pod curl/curl-7bb5845476-zwxbt --to-pod httpbin/httpbin-69dc7d545c-n7pjb --to-service httpbin
---------------------------------------------
[+] Context: Verify if pod "curl/curl-7bb5845476-zwxbt" can access pod "httpbin/httpbin-69dc7d545c-n7pjb" for service "httpbin/httpbin"
Status: Success
---------------------------------------------
The Status
field in the output will indicate Success
when the verification succeeds.
6.3 - Ingress
When Ingress is not working as expected
1. Confirm global ingress configuration is set as expected.
# Returns true if HTTPS ingress is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.useHTTPSIngress}{"\n"}'
false
If the output of this command is false
this means that HTTP ingress is enabled and HTTPS ingress is disabled. To disable HTTP ingress and enable HTTPS ingress, use the following command:
# Replace fsm-system with fsm-controller's namespace if using a non default namespace
kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"useHTTPSIngress":true}}}' --type=merge
Likewise, to enable HTTP ingress and disable HTTPS ingress, run:
# Replace fsm-system with fsm-controller's namespace if using a non default namespace
kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"useHTTPSIngress":false}}}' --type=merge
2. Inspect FSM controller logs for errors
# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')
Errors will be logged with the level
key in the log message set to error
:
{"level":"error","component":"...","time":"...","file":"...","message":"..."}
3. Confirm that the ingress resource has been successfully deployed
kubectl get ingress <ingress-name> -n <ingress-namespace>
6.4 - Egress Troubleshooting
When Egress is not working as expected
1. Confirm egress is enabled
Confirm egress is enabled by verifying the value for the enableEgress
key in the fsm-mesh-config
MeshConfig
custom resource. fsm-mesh-config
resides in the namespace FSM control plane namespace (fsm-system
by default).
# Returns true if egress is enabled
kubectl get meshconfig fsm-mesh-config -n fsm-system -o jsonpath='{.spec.traffic.enableEgress}{"\n"}'
true
The above command must return a boolean string (true
or false
) indicating if egress is enabled.
2. Inspect FSM controller logs for errors
# When fsm-controller is deployed in the fsm-system namespace
kubectl logs -n fsm-system $(kubectl get pod -n fsm-system -l app=fsm-controller -o jsonpath='{.items[0].metadata.name}')
Errors will be logged with the level
key in the log message set to error
:
{"level":"error","component":"...","time":"...","file":"...","message":"..."}
3. Confirm the Pipy configuration
Check that egress is enabled in the configuration used by the Pod’s sidecar.
{
"Spec": {
"SidecarLogLevel": "error",
"Traffic": {
"EnableEgress": true
}
}
}