How to Find All Current and Archived URLs on an internet site

There are numerous causes you could need to have to uncover many of the URLs on an internet site, but your exact goal will identify Everything you’re seeking. For illustration, you might want to:

Detect each and every indexed URL to research challenges like cannibalization or index bloat
Obtain current and historic URLs Google has seen, specifically for internet site migrations
Find all 404 URLs to Get better from submit-migration errors
In Every single circumstance, only one Instrument won’t Present you with everything you'll need. However, Google Search Console isn’t exhaustive, as well as a “internet site:instance.com” research is limited and tricky to extract info from.

Within this put up, I’ll wander you thru some tools to construct your URL checklist and just before deduplicating the data employing a spreadsheet or Jupyter Notebook, based on your site’s dimension.

Aged sitemaps and crawl exports
For those who’re on the lookout for URLs that disappeared through the live web-site just lately, there’s an opportunity someone with your workforce might have saved a sitemap file or maybe a crawl export prior to the variations were made. For those who haven’t by now, check for these documents; they're able to often present what you will need. But, if you’re looking through this, you probably did not get so Fortunate.

Archive.org
Archive.org
Archive.org is a useful Software for Website positioning responsibilities, funded by donations. If you try to find a website and choose the “URLs” selection, it is possible to access nearly 10,000 stated URLs.

Even so, there are a few restrictions:

URL limit: You'll be able to only retrieve as much as web designer kuala lumpur 10,000 URLs, and that is insufficient for much larger web pages.
Quality: Many URLs may be malformed or reference source files (e.g., images or scripts).
No export possibility: There isn’t a designed-in approach to export the record.
To bypass the lack of the export button, use a browser scraping plugin like Dataminer.io. On the other hand, these constraints necessarily mean Archive.org may not give a whole Option for larger websites. Also, Archive.org doesn’t reveal whether Google indexed a URL—however, if Archive.org uncovered it, there’s a very good probability Google did, also.

Moz Professional
Although you might ordinarily use a url index to find external web-sites linking to you personally, these applications also uncover URLs on your site in the procedure.


The best way to use it:
Export your inbound hyperlinks in Moz Professional to get a speedy and easy list of concentrate on URLs from a web-site. In the event you’re addressing a huge Web site, consider using the Moz API to export facts beyond what’s manageable in Excel or Google Sheets.

It’s imperative that you note that Moz Professional doesn’t confirm if URLs are indexed or found by Google. Even so, due to the fact most web sites apply exactly the same robots.txt principles to Moz’s bots as they do to Google’s, this technique generally is effective effectively as a proxy for Googlebot’s discoverability.

Google Research Console
Google Research Console offers various useful sources for setting up your listing of URLs.

Inbound links stories:


Much like Moz Professional, the One-way links part offers exportable lists of concentrate on URLs. Unfortunately, these exports are capped at 1,000 URLs Each individual. It is possible to utilize filters for certain internet pages, but since filters don’t utilize to your export, you may perhaps must depend on browser scraping equipment—limited to five hundred filtered URLs at any given time. Not excellent.

Functionality → Search engine results:


This export provides a summary of webpages receiving lookup impressions. Though the export is limited, you can use Google Search Console API for bigger datasets. You will also find no cost Google Sheets plugins that simplify pulling extra extensive data.

Indexing → Web pages report:


This segment presents exports filtered by situation style, even though these are also limited in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is an excellent source for amassing URLs, which has a generous limit of 100,000 URLs.


A lot better, you could implement filters to produce distinctive URL lists, proficiently surpassing the 100k Restrict. By way of example, in order to export only web site URLs, stick to these techniques:

Action one: Insert a section to your report

Action 2: Click on “Make a new section.”


Stage three: Outline the segment which has a narrower URL pattern, such as URLs that contains /weblog/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they provide beneficial insights.

Server log documents
Server or CDN log files are Possibly the last word Resource at your disposal. These logs seize an exhaustive listing of each URL path queried by consumers, Googlebot, or other bots over the recorded period of time.

Things to consider:

Facts size: Log documents could be huge, lots of web-sites only keep the final two months of data.
Complexity: Analyzing log information is usually demanding, but many tools can be found to simplify the method.
Merge, and very good luck
Once you’ve collected URLs from all of these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for much larger datasets, equipment like Google Sheets or Jupyter Notebook. Make certain all URLs are consistently formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of present-day, previous, and archived URLs. Fantastic luck!

Leave a Reply

Your email address will not be published. Required fields are marked *