Ask HN: Best practices for technical SEO for domains with more than 100k pages?

3 points by brntsllvn 2 months ago

By "technical SEO" I mean SEO problems that require software development.

What's special about domains with 100,000+ pages (e.g. niche search engines and large e-commerce sites) is that most basic SEO advice is necessary but far from sufficient.

I'm on a project with 1,000,000 pages and only about 50,000 are indexed. My "crawl budget" is absolutely small, but relatively large. How do I make sure the "right" content gets indexed? How does Google even define "right"?

I seem to be writing a lot of code to answer basic question like "which of my 1,000,000 pages are indexed?" and "which pages should I be actively seeking to index and de-index?"

If you work on "technical SEO" for something like Stack Overflow or Amazon I'd love to learn from you.

And if these are the wrong questions to ask, I'd especially like to learn from you.

vgeek 2 months ago

I'm curious to hear other people's takes on this, too.

How unique is the content on each page? What do the crawled and not indexed vs discovered not crawled ratios look like relative to indexed pages? What does linking to the detail pages look like (category pages, similar items)? How much authority do category pages have in terms of backlinks? Page speed issues-- in GSC, can you manage average download time <100ms (makes a huge difference)?

Noindexing may help-- if you just initially handle that based on page content uniqueness and/or presumed MSV potential for the page. Otherwise, based on your language, can you request a quota increase from the 200/day notifications for the Indexing API?

If you like reading about the topic, maybe head to BHW forums (very poor SNR, though) and read about what some of the more unscrupulous people are doing to index MM+ pages/mo and extract some of the less-risky strategies (e.g. don't abuse the news/job posting APIs to get spam content indexed).

As additional commentary, the last 2-3 months have been chaos in terms of algo updates. There was essentially a 30 day period where Googlebot all but disappeared for most sites, then the two HCUs and other random updates tossed in for good measure, stacked with desktop infinite scroll and the switch to mobile first indexing. I can't remember as many changes happening in a single quarter in over 15 years of being SEO or SEO adjacent.

  • brntsllvn 2 months ago

    We made the (bad?) choice to move everything to AMP based on suggestion from SEMRush. This appears to have helped because everything loads very very fast (definitely achievable without AMP, but we wanted to test the migration effort and impact on indexing). The immediate impact was a 20% bump in indexed pages.

    I use SEMRush, GSC, GA and Mouseflow religiously. That is, regularly and with faith. I don't think they necessarily point me in the right direction.

    I'll check out the forums.

135792468 2 months ago

Finally an SEO question on HN! While I haven’t worked on SO, I’ve worked on a medical equivalent site which started with 1500 of 2.5m pages indexed and when I was finished with my contract they had over 1.4m indexed.

There are a few factors that you need to adjust for: authority, quality, structure.

Authority is mostly out of your hands as a technical SEO scope but as more content gets indexed, more links and signals will come in. The real culprits are quality and structure. As others have said, low-quality pages need to be addressed. They can always be worked back in later but you’re better to be ruthless about it now and add them back later.

Profile pages, thin content pages, duplicates all need to go. No index them first, then block in robots eventually and doesn’t hurt to canonical them where applicable. If you jump straight to robots, they won’t pickup the noindex directive.

The structure, a lot of people think of folder structure as IA but really it’s linking structure. It’s important to know where you’re placing emphasis on pages by how you’re internal links are setup. This is also the best way to surface deep pages as you can link to them to bring Google deeper into your site.

Also you can try refreshing dates and seeing if that helps. Short term solution but works plenty well.

I know this is pretty general but it’s hard without seeing your issues specifically. If I can help you get on the right track, lmk. Gladly take a look. I run a small(free) community with some strong technical SEOs in it as well who like helping. Not sure how to connect but lmk if interested and we will figure it out.

  • brntsllvn 2 months ago

    Definitely interested.

    Brent [at] CarGraph [dot] com

Minor49er 2 months ago

A few questions that come to mind off the bat:

Do you have a sitemap.xml file? Does it include accurate last updated dates or suggest realistic crawl schedules (eg: only index static pages every month, but frequently updated pages every day)?

Have you run your site through an SEO checker (or at least some of the ignored pages) to make sure that they are accessible and not being blocked/ignored by a crawler? Any server responses or canonical links that may signal the wrong thing to a bot?

Do you have any deep links to your more overlooked pages on the homepage (or at least on your more popular pages)?

Do you have backlinks in the wild that point to the pages that you want to have crawled? Are they marked as nofollow? Do you post links to these pages on any social media accounts?

  • brntsllvn 2 months ago

    We have a sitemap index and several dozen sitemaps based on how we segment our pages (and each sitemap has a limit of 50k).

    We use SEMRush and that's partly what's driving my questions: Their guidance is largely irrelevant for us: We're a niche search engine.

    We have several thousand backlinks and I wish I had a better strategy here, but our content is broadly unshareable (versus something like Stack Overflow, which is very shareable).

    • Minor49er 2 months ago

      Do you also use something like Google Analytics or Bing Webmaster Tools on your site?

      • brntsllvn 2 months ago

        Definitely. Daily tools include Google analytics, Google search console, mouseflow and SEMRush. Bing Webmaster is surprisingly good.

        • Minor49er 2 months ago

          Have you tried Google's Index Coverage report? It may reveal why those pages are not yet crawled, if they are even known about

          • brntsllvn 2 months ago

            Absolutely. But the reasons are vague at best. The key reason seems to boil down to limited crawl budget simply because we don't have reputation. In which case, we need to rethink if our business has the ability to generate shareable content. It might not.

            • Minor49er 2 months ago

              What is the business, if you don't mind me asking?

              • brntsllvn 2 months ago

                Aggregated used car listings.

                • Minor49er 2 months ago

                  Just trying out a few searches on Google, I'm seeing a pretty decent amount of results from the site. Though the content on the site could be embellished a lot more to get more value out of the pages since the primary focus is clearly just the pricing graphs right now.

                  I noticed that many of the vehicle images didn't load for me. It looks like they're hosted on whatever site they're pulled from. It would be a good idea to save thumbnails of the images that you find so you can serve them locally and include them for every listing

                  Since you're scraping car data too, it might be a good idea to get some basic descriptions for each type of vehicle. You might be able to opt into the Kelley Blue Book Developers program or similar:


                  It would also be good to provide at least some snippets of the descriptions of the vehicles that you're listing. More textual content would give Google something extra to dig into and would keep users from leaving the site and just browsing ebay or Craigslist themselves as often

                  It would be a bigger effort, but having basically a page for each listing locally would be interesting. You could show the original text and posts from the original source. Make the source clear, along with the last time you retrieved the information from it. Then if the original listing gets removed, you could keep it around. Show that it was unlisted or sold, but have a banner at the top that says something like "This listing is no longer available. Click here to look for available [make] [model]s". It could help people to find why one vehicle was cheaper than another (condition, location, etc)

                • RobSm 2 months ago

                  People don't care about such content and so does Google.

                  • brntsllvn 2 months ago

                    Firstly thank you for your time and effort on our behalf. We're grateful for your generosity. Next, I totally agree we should probably have an internal version of listing info so users stay on our platform versus bouncing and then having to come back. Good insight.

                    I'll talk with the team about richer data as it furthers our SEO goals. I think the idea here would be that better engagement will improve our ranking? Is that correct?

                    The tricky part is getting impressions and clicks but this could help.

                    • Minor49er 2 months ago

                      > I think the idea here would be that better engagement will improve our ranking? Is that correct?

                      Absolutely. Rather than performing optimizations for SEO, it would be better at this point to add more to the site. Once that's done, post about it in different places like on Reddit. You should start to see more traffic from both web crawlers and people alike

                      • brntsllvn 2 months ago

                        Got it. Really appreciate you taking a look.

                        • Minor49er 2 months ago

                          No problem. Good luck in your endeavor!

sn0w_crash 2 months ago

Hmm maybe you can structure your links such that the high priority pages you want indexed are getting 10x more link juice than the rest? This would at least signal to Google which pages you deem more important to you

  • brntsllvn 2 months ago

    We've been ruthlessly eliminating low performing pages (low click and CTR) from indexing, but this is manual and it's unclear this matters as Google may already be doing this for us.