infra-networking | ccds-infra

Stats

Actions

Tags

infra-networking | ccds-infra

Infra Networking

Networking is silent until it isn't — and when it breaks, it looks like everyone else's bug. The work is the paths packets take across VPCs, regions, clouds, and the on-prem edge, plus the policy and observability that make them debuggable.

When to reach for this

Designing VPC/VNet layout: CIDR allocation, subnet tiers, AZ spread
Connecting things: peering vs. transit gateway/hub-spoke, VPN, Direct Connect/ExpressRoute, cross-cloud
Reviewing egress/ingress posture, segmentation, or zero-trust policy
Debugging connectivity, DNS resolution, latency, or MTU issues

Principles

Allocate CIDRs for 5+ years. Running out of IP space mid-growth is a 6-month renumbering project. Reserve a /16 per region per environment from a non-overlapping org-wide plan (mind on-prem and partner ranges), carve /24 subnets with AZ affinity, and write the plan down.
Private by default. New services start in private subnets with explicit egress via VPC endpoints or NAT. Public exposure requires an ADR and an ALB/WAF in front — never a public IP on the workload itself.
Prefer endpoints over NAT for cloud-service traffic. Gateway endpoints (S3/DynamoDB) are free; interface endpoints/PrivateLink keep traffic off the NAT path. NAT gateways bill per-GB processed (~$0.045/GB on AWS) — S3-heavy workloads behind NAT are a classic silent cost and a needless dependency.
Segment east-west. Security groups alone are not a segmentation story: per-service egress allowlists, default-deny between namespaces/tiers, and service-mesh mTLS with authz policies where identity-based policy is needed. Adopt a mesh for mTLS + traffic policy at scale, not for one retry knob.
DNS is a debug crime scene. The resolver chain (host → stub resolver → VPC resolver → private zones → external) hides failures. Centralize resolution policy, enable resolver query logging, alert on NXDOMAIN spikes, and keep split-horizon zones documented.
Flow logs are the truth. Enable VPC/NSG flow logs everywhere with retention that covers your incident lookback; during an incident, "which service talked to which, when" must be a query, not archaeology.
Cross-AZ and egress traffic cost real money. AWS cross-AZ runs ~$0.01/GB each direction; internet egress ~$0.09/GB. Map heavy flows and co-locate chatty services deliberately — treat egress cost like a latency budget.

Connectivity decision table

Situation	Use	Why
2–3 VPCs, simple mutual access	VPC peering	cheapest, no transit hop; no transitive routing
Many VPCs / multi-account hub-spoke	Transit gateway / hub VNet	transitive routing, central inspection; per-GB cost
On-prem, latency-tolerant or interim	Site-to-site VPN	fast to stand up; ~1.25 Gbps per tunnel ceiling
On-prem, steady high bandwidth	Direct Connect / ExpressRoute	predictable latency; weeks of lead time — order early
Single service exposed across accounts	PrivateLink	no CIDR coordination, no route sprawl

Connectivity debug order

DNS — does the name resolve, and to what (dig against each resolver in the chain)?
Route — is there a route table entry for the destination (and a return path)?
Policy — security group / NACL / network policy / mesh authz, in both directions
Path — MTU (VPN/overlay tunnels clamp below 1500; blackholed large packets look like hangs), conntrack exhaustion, NAT port limits
Endpoint — is the listener actually up (ss -ltn on the target)?

Pitfalls

Overlapping CIDRs with a future acquisition, partner, or on-prem range — un-fixable without renumbering or NAT hacks
Security groups referencing CIDRs where SG-references would do — rules rot as IPs churn
Asymmetric routing through stateful firewalls/NAT — replies drop and only some flows fail, intermittently
A single NAT gateway (one AZ) as the egress path for a multi-AZ workload — cross-AZ tax plus an availability single point
TLS terminated at the edge with plaintext east-west "because it's internal"
Debugging connectivity without flow logs enabled — turning them on after the incident starts the clock at zero

Related: infra-k8s (in-cluster network policy and CNI), infra-iam (identity is the other half of zero-trust), infra-observability (flow-log pipelines), infra-finops (egress and cross-AZ cost) · domain agent: infra-architect · output/ADR format: playbook-conventions