# Site Discovery — Improvement & Growth Plan

> Package: capell-app/site-discovery · Kind: package · Tier: premium · Product group: Capell Search & SEO · Bundle: search-seo · Status: Draft

## 1. Snapshot

Site Discovery is the foundational discovery layer for the Capell Search & SEO bundle. It resolves canonical public URLs (`BuildPublicUrlRegistryAction`, `DiscoverPublicUrlsAction`, `DiscoverPublicPagesAction`), generates per-domain XML sitemaps with chunking + a `<sitemapindex>` and incremental state tracking (`Support/Sitemap/XmlSitemapGenerator.php`, `SitemapStateStore.php`), renders an HTML sitemap as a Capell page type (`Support/Sitemap/SitemapPageType.php`, Livewire `Livewire/Page/Sitemap.php`), runs sitemap quality gates (`ValidateSitemapQualityAction`), and reports generated-output parity in an admin page (`BuildGeneratedOutputParityReportAction`, `Filament/Pages/PublicUrlRegistryPage.php`). It owns no database tables (`capell.json` `database.migrations:false`). Surfaces: admin (Page/Site `Sitemap` actions, `SitemapTool`, Public URL Registry page), frontend (`/sitemap` page, `/sitemap-xml` file output), console (`capell:xml-sitemap`).

Downstream, **seo-suite depends on this package** as a `GeneratedOutputCoverageSource` and consumes the public-URL registry contracts (`PublicUrlContributor`, `DiscoverableUrlSource`, `DiscoveryOutputSource`); seo-suite — not this package — owns and serves `/robots.txt`, `/llms.txt`, `/llms-full.txt`, `/index.md`, `/{url}.md` (confirmed in `BuildSeoSuiteDoctorReportAction::OWNED_ROUTES`). The `DiscoveryOutputSource` contract here lets packages _advertise_ machine-readable outputs (e.g. `llms.txt`) into the registry, but this package serves none of them over HTTP.

Current marketplace `summary` (verbatim): _"Make every published Capell page discoverable — automatic XML sitemaps with index sharding, an HTML sitemap, and the canonical public-URL registry that powers the whole Search & SEO bundle."_ Screenshot reconciliation is closed for buyer-facing UI: `capell.json` declares the extension card plus committed route-backed PNG captures for page/site sitemap actions, the sitemap generation tool, the HTML sitemap page, the Public URL Registry, and the quality report. The raw XML sitemap output capture remains committed in `docs/screenshots.json` as protocol-route evidence, but it is no longer promoted as marketplace media.

## 2. Improvements (existing functionality)

1. **Done/Shipped: implemented the four advertised health checks** — `SiteDiscoveryHealthCheck` now supports all manifest keys (`site-discovery.public-urls`, `.xml-sitemaps`, `.incremental`, `.html-sitemap`) with real probes for URL contributors, XML route/storage/response output, incremental schedule config/state, and HTML sitemap registry/Livewire/page type wiring. Operator-facing output is translated and focused coverage locks keyed diagnostics. — `src/Health/SiteDiscoveryHealthCheck.php`, `resources/lang/en/package.php`, `tests/Feature/SiteDiscoveryHealthCheckTest.php` — effort M.
2. **Done/Shipped: scheduled opt-in `capell:xml-sitemap --incremental` regeneration.** The service provider now registers a config-gated incremental sitemap schedule with daily/custom cron options, `withoutOverlapping`, `onOneServer`, and a stable event name so sitemaps can self-heal after missed events, queue failures, or bulk imports. Scheduling is disabled by default and covered by focused tests. — `config/capell-site-discovery.php`, `src/Providers/SiteDiscoveryServiceProvider.php`, `tests/Integration/Sitemap/SitemapScheduleTest.php` — effort S.
3. **Done/Shipped: removed the unused `icamys/php-sitemap-generator` dependency.** Package code generates XML through Site Discovery's own sitemap classes and `DOMDocument`; the package composer manifest no longer requires `icamys/php-sitemap-generator`, package docs no longer advertise it, and `PackageManifestTest` guards against reintroducing it silently. — `composer.json`, `README.md`, `tests/Unit/PackageManifestTest.php` — effort S.
4. **Done/Shipped: removed the legacy `Support/SitemapGenerator`.** The test-only table-scanning support class and its isolated unit test are gone, leaving `Support/Sitemap/XmlSitemapGenerator` as the single sitemap generation path for Site Discovery. — `src/Support/SitemapGenerator.php`, `tests/Unit/Support/SitemapGeneratorTest.php` — effort S.
5. **Done/Shipped: persisted sitemap URL counts replace loader XML parsing.** `SitemapStateStore::save()` now writes `url_count`, legacy state files fall back to `count(urls)`, and `SitemapLoader` reads state metadata instead of loading full sitemap XML with `simplexml_load_string` on every admin pass. — `src/Support/Loader/SitemapLoader.php`, `SitemapStateStore.php` — effort M.
6. **Done/Shipped: `/sitemap-xml` is now an explicit package route.** The package registers `routes/web.php` via the service provider and serves generated sitemap XML through `SitemapXmlController`/`BuildSitemapXmlResponseAction`, including path-prefixed domains, chunk pages, ETag/Last-Modified cache headers, XML content type, and safe attachment filenames derived from the generated sitemap file key. Evidence: `SitemapXmlRouteTest` covers root-domain, path-domain, chunked, missing-file, and 304 responses. — `routes/web.php`, `src/Http/Controllers/SitemapXmlController.php`, `src/Actions/BuildSitemapXmlResponseAction.php` — effort M.
7. **Done/Shipped: page-save sitemap regeneration is debounced per site.** Page saved/deleted listeners now request regeneration through `RequestSiteSitemapRegenerationAction`, which uses a short per-site pending cache key and dispatches one delayed unique incremental job (`RegenerateSiteSitemapIncrementallyJob`) instead of running a full-site incremental pass for every queued page event. URL change notifications still run per page event. — `src/Actions/RequestSiteSitemapRegenerationAction.php`, `src/Jobs/RegenerateSiteSitemapIncrementallyJob.php`, `src/Listeners/Sitemap/RegenerateSitemapsOnPageSaved.php`, `src/Listeners/Sitemap/RegenerateSitemapsOnPageDeleted.php`, `config/capell-site-discovery.php` — effort M.
8. **Done/Shipped: sitemap quality results are surfaced in the admin registry.** `ValidateSitemapQualityAction` powers the Public URL Registry parity rows so editors can see missing sitemap, AI discovery, search, HTML cache, and Agent Delivery coverage directly beside each canonical URL. The remaining `SitemapTool` surface is intentionally narrow: queue regeneration and report queued state, while diagnostics live in the registry page. — `src/Filament/Pages/PublicUrlRegistryPage.php`, `resources/views/filament/pages/public-url-registry.blade.php` — effort M.

## 3. Missing Features (gaps)

Tie-back to `capabilities[]`: `site-discovery-sitemap-quality-gates`, `site-discovery-generated-output-parity`, `site-discovery-public-url-registry` are present and reachable; the gaps below are unclaimed.

- **Done/Shipped: image, video, and news sitemap extensions.** `SitemapUrlItemData` now accepts typed image/video/news payloads, and `XmlSitemapGenerator::toXml()` emits `image:`, `video:`, and `news:` extension markup with only the namespaces required by the current sitemap. Package contributors can attach rich sitemap metadata at the URL boundary without bypassing Site Discovery's quality gates or sitemap writer. **(Differentiator)**
- **Done/Retired: Google/Bing sitemap ping notifier.** The old unauthenticated sitemap ping workflow is no longer a viable implementation target: Google Search Central says the sitemap ping endpoint deprecation is complete and Bing Webmaster Tools points current sitemap submission toward Webmaster Tools, robots.txt sitemap discovery, and URL submission/IndexNow workflows. Site Discovery already ships the supported live-notification path through opt-in `IndexNowUrlChangeNotifier`; future work should be Search Console/Bing Webmaster API integration only if those authenticated surfaces become explicit product requirements.
- **Out of scope: robots.txt management remains SEO Suite-owned.** Site Discovery carries `robotsDirectives` per registry entry and advertises coverage into the generated-output parity report, but route ownership and file generation stay in `seo-suite` so the Search & SEO bundle has one robots surface.
- **Done/Shipped: per-locale hreflang sitemap annotations.** `SitemapUrlItemData` now accepts typed alternate links, and the XML writer emits `xhtml:link rel="alternate"` annotations with the XHTML namespace only when alternates are present. URL contributors can now provide locale-specific alternates without duplicating sitemap XML rendering. **(Differentiator for multi-locale Capell sites.)**
- **Out of scope: richer `lastmod` signals stay with content-owning packages.** Site Discovery preserves and serializes the last-modified values it receives from page and public URL contributors. Block, media, and related-model edits should update the owning package's public URL contribution rather than forcing Site Discovery to inspect package internals.
- **Out of scope: `llms.txt` generation remains SEO Suite-owned.** The `DiscoveryOutputSource` contract advertises machine-readable outputs into the registry and parity report, while SEO Suite owns `/llms.txt`, `/llms-full.txt`, `/index.md`, and `/{url}.md` route generation.
- **Done/Shipped: chunked sitemap writer avoids duplicate large arrays.** Chunking + `<sitemapindex>` exist (`max_urls_per_file`, default 50000, `-pN.xml` chunks), and `writeItems()` now walks the item list once with a bounded per-file chunk buffer instead of duplicating the full URL set through `array_chunk`. A deeper generator-backed URL source can still reduce the initial registry/build materialisation, but the XML writer no longer creates a second full-size copy for large sites. **(Scale.)**

## 4. Issues / Risks

- **Done/Shipped: advertised health checks are functional** — Site Discovery now fails Diagnostics when public URL contributors, XML sitemap output, incremental sitemap state/scheduling, or HTML sitemap registration are not ready. Cite `capell.json` `healthChecks[]`.
- **Done/Shipped: generated XML public-output safety now has end-to-end coverage.** `SitemapGeneratorTest` proves generated XML includes only public/indexable URLs and excludes draft pages, noindex registry URLs, admin/private URLs, and signed query markers. The registry/data-layer safety tests remain in place. `src/Support/Sitemap/XmlSitemapGenerator.php`, `tests/Integration/Sitemap/SitemapGeneratorTest.php`.
- **Done/Shipped: package route removes silent empty discovery for default Laravel installs.** `SiteDiscoveryServiceProvider` now registers the package web routes, and `SitemapXmlRouteTest` proves `/sitemap-xml`, path-prefixed `/uk/sitemap-xml`, and chunked `?p=2` output are served from generated sitemap files.
- **Done/Shipped: cache-safety invalidation metadata now matches listeners.** `capell.json` `cacheSafety.invalidationSources` records Page `saved`/`deleted` and Site `created`, matching the queued sitemap regeneration listeners registered in `SiteDiscoveryServiceProvider`. Generation still forgets `CacheEnum::sitemapPages` keys (`.public`, `.with-edit-urls`) in `XmlSitemapGenerator::forgetSitemapPageCaches`; the remaining risk is debounce/coalescing, not an empty manifest. — `capell.json`, `src/Providers/SiteDiscoveryServiceProvider.php`, `tests/Unit/PackageManifestTest.php`.
- **Performance budget** — manifest sets `frontendRenderBudgetMs:20`, `adminQueryBudget:40`. The HTML sitemap (`Livewire/Page/Sitemap.php`, 134 LOC) still needs budget coverage, but the admin sitemap loader no longer parses full XML per request; it reads persisted state `url_count` metadata with legacy `urls` fallback. No budget assertion test exists. Cite `capell.json` `performance`.
- **Done/Shipped: page event sitemap work is coalesced per site.** Saved/deleted page listeners no longer run `processIncremental($site)` directly; they schedule one delayed unique incremental regeneration per site through `RequestSiteSitemapRegenerationAction`/`RegenerateSiteSitemapIncrementallyJob`, with configurable debounce and uniqueness windows.
- **i18n** — generated XML now supports typed `hreflang` alternates (see §3); user-facing strings use translations correctly (`resources/lang/en`).
- **Done/Shipped: CHANGELOG records the current package work.** `CHANGELOG.md` now lists the marketplace screenshot/copy reconciliation, sitemap extension payloads, chunk writer change, ping retirement, and prior health-check work.

Test coverage is otherwise strong (≈140 cases across Unit/Integration/Feature/Arch): incremental state, chunk/index emission, lifecycle listeners, quality gates, IndexNow, registry dedup/normalisation, and public-safety at the data layer are all covered.

## 5. Marketplace & Selling

**Done/Shipped: marketplace copy and screenshots are buyer-facing.** The marketplace summary, description, Composer description, manifest description, docs intro copy, and route-backed UI screenshot gallery now lead with discoverability outcomes, sitemap sharding, HTML sitemap output, and the canonical Public URL Registry that powers the Search & SEO bundle. Raw XML output remains docs evidence only.

**Improved 1-sentence summary:** "Make every published Capell page discoverable — automatic XML sitemaps with index sharding, an HTML sitemap, and the canonical public-URL registry that powers the whole Search & SEO bundle."

**Improved 3–4 sentence description:** "Site Discovery is the discovery backbone of Capell's Search & SEO bundle. It builds per-domain XML sitemaps (with `<sitemapindex>` sharding and incremental regeneration that skips unchanged pages), publishes a public HTML sitemap, and maintains the canonical Public URL Registry that records canonical state, robots directives, last-modified, and sitemap/AI-discovery eligibility for every public URL. Other packages — SEO Suite, Search, Blog, Events — plug in as URL contributors and read a single, consistent view of what your site intends to publish. Built-in quality gates and a generated-output parity report catch missing or stale URLs before search engines do, and optional IndexNow notifications submit changes the moment a page is saved."

**Screenshot/media status:** the 7 contracted capture targets in `docs/screenshots.json` now have matching committed route-backed PNG captures, including Public URL Registry parity and quality-report visuals. The XML output capture is retained for verification but demoted from buyer-facing marketplace media because it is raw protocol output.

**Pricing/tier/bundle positioning:** `tier: premium`, `bundle: search-seo`. This is a **foundational dependency for seo-suite** (and search/blog/events contribute through it). It should not be sold standalone as a headline product — its value compounds inside the bundle. Recommend: bundle-anchored pricing where site-discovery is included with seo-suite, and positioned as "required base" so the bundle's value prop (AI discovery, llms.txt, robots, redirects in seo-suite) is gated on owning it. Cross-sell paths via `dependencies.requires` (admin/core/frontend) and Extension Suites: **seo-suite** (consumes registry + parity), **agent-delivery** (page-manifest-resolvable URL parity source), **search** (search-indexable coverage), **html-cache** (cached-URL coverage).

**Differentiators / value props / target buyer:** target buyer = agencies and product teams running multi-domain/multi-locale Capell sites who need reliable indexing without per-package crawlers. Differentiators: single canonical URL registry; incremental, queue-driven regeneration; generated-output parity diagnostics across the whole bundle; IndexNow out of the box; typed sitemap extension payloads for hreflang, image, video, and news metadata. Bundle boundaries are explicit: robots and AI discovery output generation stay with SEO Suite, while content-owning packages supply richer URL metadata through Site Discovery contracts.

**Keywords/tags (8–12):** `sitemap`, `xml-sitemap`, `sitemap-index`, `html-sitemap`, `seo`, `indexnow`, `url-registry`, `discoverability`, `crawl`, `lastmod`, `multi-domain`, `incremental-sitemap`.

## 6. Prioritized Roadmap

| Item                                                                                                                                                                                                                                                                                               | Bucket | Effort | Impact | Section  |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | ------ | ------ | -------- |
| Done/Shipped: Implement the 4 advertised health checks. Evidence: keyed diagnostics cover public URL contributors, XML sitemap output, incremental schedule/state, and HTML sitemap wiring.                                                                                                        | Done   | M      | High   | §2.1, §4 |
| Done/Shipped: Add test asserting generated XML excludes unpublished/noindex/private URLs end-to-end. Evidence: generated XML excludes draft, noindex, admin/private, and signed URL markers.                                                                                                       | Done   | S      | High   | §4       |
| Done/Shipped: Register a real `/sitemap-xml` route/controller (stop relying on web-server config). Evidence: package web routes serve root, path-prefixed, and chunked sitemap XML with cache headers.                                                                                             | Done   | M      | High   | §2.6, §4 |
| Done/Shipped: Remove unused `icamys/php-sitemap-generator` dependency. Evidence: package composer manifest drops the dependency and `PackageManifestTest` guards against reintroduction.                                                                                                           | Done   | S      | Med    | §2.3     |
| Done/Shipped: Close screenshot mismatch. Evidence: `capell.json` declares the extension card plus styled route-backed UI PNG captures, while all 7 required `docs/screenshots.json` contract entries remain committed and guarded by `PackageManifestTest`.                                        | Done   | S      | Med    | §1, §5   |
| Done/Shipped: Schedule opt-in `capell:xml-sitemap --incremental`. Evidence: config-gated daily/custom cron scheduler wiring with overlap and one-server locks is covered by `SitemapScheduleTest`.                                                                                                 | Done   | S      | Med    | §2.2     |
| Done/Shipped: Debounce/coalesce per-site regeneration in page-save listener. Evidence: saved/deleted page listeners schedule one delayed unique incremental regeneration job per site instead of running the generator per event.                                                                  | Done   | M      | High   | §4       |
| Done/Shipped: Reconcile manifest `cacheSafety.invalidationSources` with real listeners. Evidence: manifest declares Page saved/deleted and Site created invalidation sources, guarded by `PackageManifestTest`.                                                                                    | Done   | S      | Med    | §4       |
| Done/Shipped: Delete legacy `Support/SitemapGenerator`. Evidence: the test-only table-scanning support class and its isolated unit test were removed, leaving the package-owned XML generator as the only sitemap generator.                                                                       | Done   | S      | Med    | §2.4     |
| Done/Shipped: Persist URL count in state store; drop per-request `simplexml_load_string`. Evidence: state files include `url_count`, legacy state falls back to `count(urls)`, and `SitemapLoader` reads state metadata instead of parsing XML.                                                    | Done   | M      | Med    | §2.5, §4 |
| Done/Shipped: Improve marketplace summary/description + parity screenshot. Evidence: manifest/docs/Composer copy use the outcome-led listing and `capell.json` promotes the committed Public URL Registry parity/quality captures.                                                                 | Done   | S      | High   | §5       |
| Done/Shipped: Image/video/news sitemap extensions (`SitemapUrlItemData`). Evidence: typed sitemap extension data objects and XML writer support emit image, video, and news tags with scoped namespaces.                                                                                           | Done   | L      | High   | §3       |
| Done/Shipped: hreflang alternates for per-locale sitemaps. Evidence: typed `SitemapAlternateData` payloads render as `xhtml:link rel="alternate"` annotations with scoped XML namespace output.                                                                                                    | Done   | M      | Med    | §3, §4   |
| Done/Retired: Google/Bing sitemap ping notifier. Evidence: official guidance makes unauthenticated sitemap ping obsolete; Site Discovery keeps opt-in IndexNow as the supported URL-change notifier path and should only add authenticated webmaster integrations as a new scoped product feature. | Done   | M      | Med    | §3       |
| Done/Shipped: Streaming chunk writer for very large sitemaps. Evidence: `XmlSitemapGenerator::writeItems()` writes paginated sitemap files sequentially with a bounded chunk buffer instead of `array_chunk` duplicating the full item list.                                                       | Done   | L      | Med    | §3, §4   |