Skip to content
Tech·Beginner

Sitemaps + robots.txt: How to Tell Google Every Page of Your Portfolio

·8 min read·1,886 words
share
𝕏in

Key Takeaways

  • sitemap.xml is a map of every page on your site that you hand to Google
  • robots.txt tells crawlers which paths are allowed and where to find your sitemap
  • Next.js App Router lets you generate both files dynamically with sitemap.ts and robots.ts — no manual XML editing
  • Always submit your sitemap manually in Google Search Console — don't wait for Google to find it
  • The Coverage and Crawl Stats reports show you exactly what Googlebot did with each URL
This post is Part 4 of the Can I Make Google Happy? series — the final installment.

Introduction

We've covered metadata, social previews, and structured data. But all of that is useless if Google can't find your pages in the first place.

This blog covers the two files that control how Googlebot crawls your site:

  1. 1.sitemap.xml — a map of all your pages that you hand to Google
  2. 2.robots.txt — a rulebook that tells crawlers what they can and cannot access

Most portfolios get this wrong — either using a static hardcoded file (which gets stale), or skipping it entirely. I'll show you the Next.js App Router way that generates both files dynamically, and how to monitor crawl health in Google Search Console's Coverage and Crawl Stats reports.

Part 1: Sitemap

What Is a Sitemap?

A sitemap is an XML file that lists every URL on your site, along with hints about each page:

  • When it was last updated (lastModified)
  • How often it changes (changeFrequency)
  • How important it is relative to other pages (priority)

Google doesn't *require* a sitemap — it will crawl your site without one. But a sitemap makes crawling faster and more reliable, especially for sites that don't have many inbound links (like a new portfolio).

Static sitemap.xml vs. Dynamic sitemap.ts

The old way (static XML file in `public/`):

xml
<!-- public/sitemap.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourname.dev</loc>
    <lastmod>2026-01-01</lastmod>
  </url>
</urlset>

This works — but you have to manually update the lastmod date every time you change a page. If you forget, Google gets stale information.

The Next.js way (dynamic `src/app/sitemap.ts`):

ts
import { MetadataRoute } from "next";

export default function sitemap(): MetadataRoute.Sitemap {
  const base = "https://yourname.dev";
  const now = new Date();

  return [
    {
      url: base,
      lastModified: now,
      changeFrequency: "monthly",
      priority: 1,
    },
    {
      url: `${base}/projects`,
      lastModified: now,
      changeFrequency: "monthly",
      priority: 0.8,
    },
    {
      url: `${base}/experience`,
      lastModified: now,
      changeFrequency: "monthly",
      priority: 0.8,
    },
    {
      url: `${base}/education`,
      lastModified: now,
      changeFrequency: "yearly",
      priority: 0.6,
    },
  ];
}

Next.js automatically serves this at https://yourname.dev/sitemap.xml. Every build regenerates it with the current date — no manual updates needed.

Breaking Down Sitemap Fields

#### url

The full absolute URL of the page. Never use relative paths here.

#### lastModified: now

new Date() gives the current timestamp at build time. Every deployment automatically updates this. Google uses it to decide whether to re-crawl the page.

#### changeFrequency

A hint to Googlebot about how often this page changes:

ValueUse Case
"always"Pages that change on every load (live data)
"hourly"News/live feeds
"daily"Active blogs
"weekly"Portfolio projects (active updates)
"monthly"Portfolio homepage, experience page
"yearly"Education, static about page
"never"Archived, never changes

Important: Google treats this as a hint, not a rule. It uses its own signals (page change rate, importance) to decide actual crawl frequency.

#### priority

A value from 0.0 to 1.0. Default is 0.5.

  • 1.0 → Homepage (most important)
  • 0.8 → Key pages (projects, experience)
  • 0.6 → Secondary pages (education)
  • 0.4 or lower → Very low priority pages

Again, this is a hint. Google doesn't blindly follow it — but it helps when you have many pages.

Part 2: robots.txt

What Is robots.txt?

robots.txt is a plain text file at the root of your domain. Every crawler that visits your site reads this file FIRST before doing anything else.

  • Which pages they are allowed to visit
  • Which pages they should skip
  • Where to find your sitemap

It is NOT a security mechanism — it's a courtesy protocol. A malicious bot will ignore it. It's for legitimate crawlers like Googlebot, Bingbot, and others.

The Problem With a Static robots.txt

A static file at public/robots.txt looks like this:

txt
User-agent: *
Allow: /

Sitemap: https://yourname.dev/sitemap.xml

This works. But it has one problem: it's a hardcoded file that lives separately from your Next.js routing system. If you change your domain or add disallow rules, you have to remember to update this file manually.

The Right Way: Dynamic src/app/robots.ts

Next.js App Router supports a robots.ts file that generates robots.txt dynamically — the same pattern as sitemap.ts:

ts
import { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: {
      userAgent: "*",
      allow: "/",
    },
    sitemap: "https://yourname.dev/sitemap.xml",
  };
}

Next.js serves this at https://yourname.dev/robots.txt. Delete the old public/robots.txt after creating this file — you don't want two competing robots files.

Advanced robots.ts: Blocking Specific Paths

If your portfolio has any admin routes, API routes, or private pages, you can block them:

ts
export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: "*",
        allow: "/",
        disallow: ["/api/", "/admin/", "/_next/"],
      },
    ],
    sitemap: "https://yourname.dev/sitemap.xml",
  };
}
  • userAgent: "*" → applies to all crawlers
  • allow: "/" → allow everything by default
  • disallow: ["/api/"] → block the /api/ path and everything under it

For a simple portfolio with no private routes, the basic version (allow everything) is correct. Don't over-engineer it.

What User-agent: * Means

code
User-agent: *

The * wildcard applies this rule to ALL crawlers — Googlebot, Bingbot, DuckDuckBot, and every other well-behaved crawler.

You can also write rules for specific crawlers:

ts
rules: [
  {
    userAgent: "Googlebot",
    allow: "/",
  },
  {
    userAgent: "AhrefsBot",     // Block SEO spy tools
    disallow: "/",
  },
],

For a portfolio, sticking with userAgent: "*" is the right call.

The Relationship Between sitemap.ts and robots.ts

These two files work as a team:

code
robots.txt
  └── Sitemap: https://yourname.dev/sitemap.xml  ← points to sitemap

sitemap.xml
  ├── https://yourname.dev              (priority: 1.0)
  ├── https://yourname.dev/projects     (priority: 0.8)
  ├── https://yourname.dev/experience   (priority: 0.8)
  └── https://yourname.dev/education    (priority: 0.6)

Flow:

  1. 1.Googlebot arrives at your site
  2. 2.It reads robots.txt — knows it's allowed to crawl everything
  3. 3.It follows the Sitemap: link in robots.txt
  4. 4.It reads sitemap.xml — gets the full list of URLs and their priorities
  5. 5.It crawls each URL in order of priority

This is why both files live in src/app/ — they're part of the same routing system and both get generated fresh on every deployment.

Google Search Console: Submitting Your Sitemap

Don't wait for Google to find your sitemap on its own. Submit it manually:

Step 1: Open Google Search Console

Step 2: In the left sidebar → Sitemaps

Step 3: In the "Add a new sitemap" field, enter sitemap.xml (GSC prepends your domain automatically)

Step 4: Click Submit

Step 5: After a few minutes, refresh. The panel reports three things — Status: Success (Google read your sitemap), Discovered URLs (how many URLs Google found), and Last read (when Google last fetched it).

Google Search Console: Coverage Report

After submitting the sitemap, the Coverage report tells you what happened to each URL.

Step 1: GSC → Pages (previously called "Coverage")

Step 2: The report groups URLs into four categories:

StatusMeaning
ErrorPages Google tried to index but couldn't
Valid with warningsIndexed but has potential issues
ValidSuccessfully indexed — these appear in search results
ExcludedNot indexed, but Google says it's intentional

Common "Excluded" reasons for portfolios:

  • "Crawled – currently not indexed" → Google crawled it but chose not to index it yet. Wait a few days.
  • "Discovered – currently not indexed" → Google knows it exists but hasn't crawled it yet. Check your priority values.
  • "Duplicate, Google chose different canonical" → You have a canonical conflict. Check your alternates.canonical in layout.tsx.

Google Search Console: Crawl Stats Report

The Crawl Stats report shows you exactly how Googlebot is behaving on your site:

Step 1: GSC → SettingsCrawl Stats

What to look for:

  • Total crawl requests — How often Google visits your site. For a new portfolio, this starts low (5–20 requests/day) and increases as you get indexed.
  • Average response time — Should be under 500ms. Higher means slow server, which can reduce crawl budget.
  • Crawl requests by response — All should be 200 OK. If you see many 404 responses, you have broken pages in your sitemap.
  • File types crawled — Confirms Google is reading your robots.txt and sitemap.xml.

Full Checklist: Did You Tell Google Everything?

  • src/app/sitemap.ts created — generates /sitemap.xml dynamically
  • All portfolio pages included with correct priority and changeFrequency
  • lastModified: new Date() — auto-updates on every deployment
  • public/robots.txt deleted — no duplicate robots files
  • src/app/robots.ts created — generates /robots.txt dynamically
  • robots.ts references the correct sitemap URL
  • Verified https://yourdomain.com/sitemap.xml loads correctly in browser
  • Verified https://yourdomain.com/robots.txt loads correctly in browser
  • Sitemap submitted in Google Search Console
  • Coverage report shows pages as "Valid"
  • Crawl Stats show healthy response times and no 404 errors

Wrapping Up the Series

Over these four blogs, here's what we built for the portfolio's SEO foundation:

BlogWhat We DidGSC Report to Check
01Metadata, canonical, robots config, font loadingURL Inspection
02Open Graph image, Twitter CardURL Inspection → HTML tab
03JSON-LD Person schemaRich Results report
04sitemap.ts, robots.ts, crawl setupSitemaps, Coverage, Crawl Stats

Each piece reinforces the others. Your metadata tells Google what you are. Your OG image tells social platforms how to represent you. Your schema tells Google's Knowledge Graph who you are. And your sitemap + robots.txt tells Googlebot exactly where to look and when to come back.

That's how you make Google happy.

Resources

FAQ

Do I need both sitemap.ts and robots.ts?

You can have one without the other, but they're stronger together. robots.ts points crawlers to your sitemap, and sitemap.ts lists every URL you want indexed. Without robots.ts, Google has to guess your sitemap location.

What happens if I have both public/robots.txt and src/app/robots.ts?

The public/ file wins because it's served directly as a static asset. Delete it after creating robots.ts — otherwise the dynamic file is ignored.

After submitting your sitemap, expect 3–14 days for initial indexing. Some pages may stay in "Discovered – currently not indexed" for weeks. Building inbound links and consistently updating content speeds this up significantly.