136 points by tndibona
16 days ago
Zillow publicizes a higher granularity (~90 MSAs and weekly) of this data at https://www.zillow.com/research/data/
> Median List Price: The median price at which homes across various geographies were listed.
> Median Sale Price: The median price at which homes across various geographies were sold.
> Sale-to-List Ratio (mean/median): Ratio of sale vs. final list price.
> Percent of Sales under/over List: Ratio of sales where Sale Price below/above the final list price; excludes homes sold for exactly the list price.
Zillow's data packaging is undoubtedly better than ours (analytics isn't yet very core to our business) but if you'd like to see rent values statically associated to MSAs and ZIP codes on a map, we publish that at https://dwellsy.com/rentmaps
This could really use filters for # rooms and # baths.
Wow, this is really nicely made!
Hello there. OP here! your site is similar. The UI is better than mine. Can we exchange notes?
Gladly. My email is in my profile.
I have a Colab notebook that will generate a report with property trends for any MSA in Zillow's dataset:
It uses the Zillow data set mentioned above. I would suggest giving it a shot as well, the graphs are rudimentary but they work well enough for me. :)
Hey thanks for this. Im looking at this. But I’d still have to download spreadsheets and the data is aggregated by regions bigger than zipcodes I think. Im hoping my site offers a simple and straightforward experience
My project will average out the data on recent "sale-price over-or-under the asking-price" for an entire zipcode. This should produce a report showing you how things are trending. I hope this will help people make an informed decision.
1.This works for the USA only.
2.The report should take up to a minute to display as the data is being scraped from Zillow.
3.If the site does go down, it is due to the volume of traffic from reddit. Then I'm going to need to scale up my servers. I will try and stay on top of this.
4.If the no results are obtained, Zillow probably didn't have "sales" data for that zip code, or the connection to Zillow failed. I would just retry once again.
I'm looking for feedback on these points:
1. How do I monatise without adverts? I would hate to display ads and kill the experience.
2. Do you want similar trends? Eg. I could also show "Average days in the market for properties", "Similar zip codes for your budget." etc
3. Would you be willing to register and voluntarily give up some data ? I would like to know where you are searching from, it will help me build a database of "Where are people from and where are they looking to move?"
4. Any thing that will help me improve is good.
Can you recommend a zip code to try, I tried several in Northern Colorado and got "We did not get any results".
One way you could monetize would be to make deals with MLSes for integration into their services, either for realtors or for consumers. Can't tell how appropriate it would be for realtors, I assume it's more oriented towards consumers.
If there's some way to get to see Northern Colorado there, I can send it around to some POI in the local MLS and see what they think. Again, you might not be looking for MLS feedback.
Disclaimer: I run coloproperty.com and the MLS. :-)
> I run coloproperty.com and the MLS
How did you find yourself running those? What's your story if you don't mind me asking?
I've known many people who got into real estate (just getting a realtor's license) on the side as a way to make some extra money, but to me the appeal has always been having access to the "raw" MLS information / hearing about properties before anybody else.
Do you have any gut feeling for what percentage of properties never hit the market because of connections like these?
Can we connect ? I will look up your info from your website
I replied in e-mail on the address on Pillr.
Not working for me
Could you try now? I made hacker news front page so my site is getting hit with traffic and the proxy service data limit maxed out. I’ve just topped up the account.
It works now
You're likely breaking Zillow TOS, so expect a C&D before you monetize
I think scraping websites that don't require authentication is legal in the US nowadays.
Monetizing data acquired through scraping could run afoul of copyrights, so the love letters would probably only arrive once OP attempts monetization.
>I think scraping websites that don't require authentication is legal in the US nowadays.
The LinkedIn lawsuit resolved to that conclusion, though I think LI is appealing again.
It basically says the site can't take legal action against scrapers of public (non-auth protected) info. Still, there's nothing that says they have to make it easy for you, to include deploying anti-scraping measures, rate limits, redesigns and moving data behind auth.
I thought that USA doesn't allow copyrighting facts which to me numeric values certainly are.
OP Here! Yes I don't know the law well but common sense tells me that its one thing to copyright a brand, logo, movie, or anything creative, and it's another to copyright numbers on a database.
The only argument Zillow can against me (imo) is that I'm being a burden on their servers at scale.
I would be very happy if my project gains attention from a product manager from Zillow and they say “Why don’t we just let him pay for the data”. I’m unable to get their attention and they’re very cryptic about their GraphQL API spec.
Hell yeah brother, greyhat on.
Alas, you'll likely hear from a lawyer before a PM, but good luck nonetheless.
It's usually the PM who prompts the lawyer, "hey that person is stealing my limelight, please fix it".
Yeah. I feel like Zillow will pretend like all this data somehow belongs to them when it belongs to the public.
IANAL but just because it is public data doesn't mean that Zillows aggregation and capture of the data can't be subject to copyright.
Good point. IANAL too, But I can’t imagine copyrights on work that didn’t even belong to Zillow. It’s not creative content like a movie. They are just collecting data off existing sources.
I could also argue that I’m redistributing the computed data done by my web app like the average, median, and the graph. Then I’d have to hide the Zillow results source table.
Facts by themselves are not copyrightable, but certain arrangements of facts may be.
If this is the case, why not go to the source of the data? (I think it’s because the source of the data is not guaranteed to be free but hitting Zillow’s public infra is)
The problem is the source isn’t one place. The source is a collection of 500 MLSes across North America. They have a vested interest in keeping the data accessible only for licensed realtors. Striking a conversation with 500 MLSes would be a big undertaking, I would need contacts and connections in the real estate world.
Hmm the source Zillow and other realtor company’s get their data from are local governments. They all offer tax, ownership info, last sale price etc. they all also charge differing amounts from free (if you walk into their records office) to cost per dataset to as costly as cost per data request. Then there’s the sheer number of locales you must query for this info, typically on a per county basis. So you head to a data broker who has collected all of this information but they charge for their time and efforts which is $&$$. So that’s why I mentioned Zillow’s public servers are cheaper
Contacts and connections are easy. Money is harder :)
Might be worth noting: There is a DOJ antitrust lawsuit against NAR over MLS practices, though I don't know if your use-case is part of that.
> How do I monatise without adverts?
I found this interesting, and was immediately curious for more: what the data was like going back more than a month, if there are any correlations to number of bedrooms, school districts, or the like, etc. etc. I could imagine you coming up with very interesting auto-generated 40-50 page reports.
I would not pay for this now, but when I was on the real estate market last year I might have. You could target individual buyers or sellers ($5-20 or so for a single report), real estate agents (some sort of subscription), or both -- don't know what the best strategy is.
As a prototype, now the data is fetched from Zillow and cached. I’d have to start saving data to disk and run queries for what your asking. Which is possible, I’m trying to buy time to build this out. I need to attract investment also. Thanks for your kind words.
> How do I monatise without adverts? I would hate to display ads and kill the experience.
I'd say go ahead and display ads, just do it without being obnoxious. Limit ads to small static images hosted on your own domain or text links, keep them unobtrusive, and don't use Google or any other evil surveillance capitalism company.
Because the site is driven by zip code you've got a great opportunity to provide ads relevant to the area and not the specific user. Anyone who finds reasonable ads so off-putting that it would drive them away from your service will be using ad blockers anyway.
Personally, I never register for websites if they require an email address unless they have a need to know my identity to accomplish what I want from them. Your site wouldn't meet that requirement so no registration for you. I'll sign up for some sites that need a registration but don't require a legit email address but they don't get my real info (fake names, addresses, etc)
Remember that any information you collect you then have to retain, backup, secure, etc. and it puts on the hook for reporting data breaches and for gathering and handing all of that information over to governments and other parties in response to court orders. Never collect or store any more data than you absolutely need to and you'll have so much less to worry about.
If you really really want to make money without ads, limit the number of accesses to something sane and charge the people who hammer your site day after day or week after week some fee for the trouble. They obviously find your service useful so maybe a small fee would be worth it to most of them. Maybe you can even offer to automatically push the kinds of data they keep requesting over and over again to them in some way.
>> How do I monatise without adverts?
I think folks who are house hunting will find this useful when looking for comps (they will be getting information for multiple if not all houses in their target zip code in a table; they can easily digest the information in the table). So you could potentially charge a small fee (say $5 - $20) for house hunters to run this query and have the information available to them (say a link) for a month or so. During that period, they can click on the custom generated link and it will give them the most recent result.
Same thing for 'Average days in the market' but this will be targeted at those looking to sell.
The trick would be where to find these folks so they know about your service. Maybe advertise in housing forums or reddit (find the housing related forums)
The secure link is an awesome idea, thanks
House hunting bot. User specifies area by drawing on a map. Filter criteria like number of bedrooms, driving minutes to any particular address, etc.. Display as a sortable table.
> 1. How do I monatise without adverts? I would hate to display ads and kill the experience.
Well, selling it to Zillow is probably a natural fit. Or some other realtor org, but you'd probably have to change the source of your data.
Hire a sales guy/team to sell this as a tool to realtors and wholesaling companies.
They spend a ton of money on market analysis products because most of them don't have any sort of background that helps with it.
If it helps them get 1 more client or help a listing sell for just a couple % more it has paid for itself.
Button it up a bit, paywall it, run every zipcode as a job and cache the fully rendered html report the end user sees.
You should look at 94941, there some sort of anomaly.
Thanks for the report. I found the problem . Look here https://www.zillow.com/homedetails/14-Eton-Way-Mill-Valley-C...
If you scroll down you’ll see they made an error listing the property at 389million instead of 3.89 million. I’ll have to use some upper /lower limits in the graph to bump out outliers like this . It’s hard to pick a number, some distress sale properties get sold for very low prices
re #4 - tiny UX improvement: could you add a retry button? I keep waiting a minute, clicking return to homepage, retyping the zip, etc.
edit: One other thing, it showed "Loading property 6 of undefined..." then back to "We did not get any results" which seems quirky.
Thanks for the feedback. My original plan was to load the data “live” as obtained using web sockets but that damn socket.io is proving to be unreliable. I’ll land up doing something like you recommended.
As for the “loading property of undefined” , thanks for the report. A code cleanup is long overdue
> 1.This works for the USA only.
I'm pretty sure the splash image is London, UK. Very confusing!
Fun fact. I used a beautiful image I took of cordoba Spain from the top of the mezquita as the backdrop. Someone on Reddit called me a Brazilian slum lord :(. So I put this San Francisco stock image
And you'd be wrong! It appears to be from San Francisco (unless we are talking a different picture ?)
Right you are!
Monetisation: what about automated appraisal?
Like a Zestimate?
I think they got overloaded or too many queries on their backend. No results for several different zip codes in my big city.
OP here, experiencing the HN kiss of death. Load balancer 502s .I’ve increased the number of docker containers across the entire stack. Hopefully it copes up with the load. My aws bill will be hiiiigh
It's working for me now 24 hours later. The results are interesting, thanks for putting this up there.
same. I think today, OP learns about caching ;)
Already using redis for the cache. I’m using a rotating proxy to bypass Zillow captcha, they’re not able to keep up with the load
It'll be sad if they figure out who you and your bots are, and then deny you specifically access. Got backup plans for when you get zero results for every query because of that?
Yes lol. Realtor.com is also scrapable so that’s the immediate backup. Long term I have to speak FTP csv files from a bunch of archaic MLS integrations. If I need to go that route, I need money.
To me, that seems a much simpler route than screen-scraping (yes, despite problems parsing CSV format).
I've been involved with screen-scraping of req's for <major airplane company> to produce spreadsheets for others in the same company, and I'll take FTPing the data->CSV almost any day.
* Especially when the FAA is the eventual consumer *
The bigger challenge is not the ftp integration per say. The fact that all the housing data is distributed across nearly 500 MLS sites that closely guard their data and make it available only to licensed realtors. I would expect each of these MLS sites to have different formats. So trying to strike up a conversation and allowing them to let me use their FTP site is the biggest hurdle. Zillow and the rest of the big guys already have their claws deep into the MLS sites and they probably want to closely guard their pipelines. Small guys like me don’t know where to start.
I do integration work for healthcare; you want an interface engine to pull the data from the ftp site, then translate/convert it into your desired format.
We do that all the time.
What’s your favorite tool for this? Just starting to get my feet wet in this world.
But I want to see Realtor.com vs Zillow numbers (vs MLSListings vs Redfin vs the others...). It's not like any of these are ground truth. How much are the discrepancies, where are they and why?
Also, is zipcode the lowest level of granularity (presumably not)? Do any of these services have finer granularity?
Or just get an injunction.
Seems like an uphill argument as long as the data isn't behind a login and you're not DDoSing the site.
> they’re not able to keep up with the load
Zillow can absolutely keep up with the load. Your proxies are being blocked by the WAFs that Zillow utilizes, usually PerimeterX. Also, based on your other comments, you have no idea how to architect a web application that can handle the trickle of HN traffic, and I'd wager any problems are more likely on your side.
I assumed the "they're" referred to the "rotating proxies" not "zillow".
Is correct. The proxy service… it’s slower on the responses from the logs
I mean, I think I have some idea. The problem is more on the affordability side. I had to scale up the containers in my cluster to cope with the increase in 502s. I live with the fear of a high AWS bill. Then there is the question of the proxy rotators data limit being breached, which happened! I topped it up for now. Now I’d rather spend time asking Zillow to allow me to exist rather than trying to architect a more efficient proxy mechanism.
This site could run on a $5/mo VM without issue. Proxies that charge for bandwidth can get expensive, sure, but that’s a separate issue. Waiting on a response from Zillow/Proxies requires nearly zero cpu on your servers and it should be a background/async process. There’s only ~40k zip codes and only ~10k that remotely matter for even very small towns.
Find a shared proxy provider that doesn’t limit bandwidth. It’s usually about $0.50/mo per IP. Not that any of this really matters, you’ll still get blocked by Zillow. It’s not an IP limit l anyway, it’s a TLS fingerprinting or JS-based fingerprinting issue that is getting you blocked.
I'm using Puppeteer with Chromium to mimic a browser with the headers changed to look like a real user. The proxy service only handles the data transport. Running a browser cluster at scale costs me CPU, and memory (My peak yesterday on HN was 40-100 requests a second). Direct GET / POST requests to Zillow didn't work. I'm not sure $5 per month VM will cut it. Do you have the details on a provider that will give me a $5/m deal for this setup ?
Check out residential proxy services. Some of them have apps users install, others may be more... nefarious. They're cheap and you can rotate through them. Some may have clients that are a pain to work with, some may be SOCKS access.
https://www.webscrapingapi.com/top-residential-proxy-provide... obviously SEO'd page, but it has useful info.
Can't speak for Zillow, but some API endpoints don't allow you to cache results you obtain from them, license-wise.
API ? You’re funny, I’ve been begging Zillow for their API and no response. I’m scraping using html parsers
Good luck with this. Scraping may be legal, but redistributing the data probably isn't. I expect the real estate cartel protects their MLS data pretty aggressively, otherwise there would be dozens of sites like this already. https://apnews.com/article/technology-lifestyle-business-pro...
Read the whole article. How can I compete with a cartel? I’m doomed. Maybe take this idea as-is and switch countries to the British market is also an idea I’m thinking
This sort of data is reasonably accessible in the UK, although perhaps not quite to the same degree as your app. Either way, I suspect the value-add is less than it would be in the US.
Rightmove, the primary property app in the UK, gives you the previous price the properties were purchased for, although perhaps not analytics across an area per se.
Do you think region wide trends would be well received by the english public? I'm Exploring similar needs in other countries where data is a more readily available.
Homeowners who are selling don't want their data locked up. This should be a class-action suit. I bet there are a dozen attorneys who'd take on this case. You may not make out personally, but a condition can be the opening of listing data. Its locked nature should be considered problematic at best for fair housing rules.
Thanks for your kind words
The question is, after this superb yak shaving, did you actually buy a house?
Rofl. No! Got priced out after having bid 35k over asking price. Because you know, people have more money… I’m renting in NJ right now. Hoping this side gig leads somewhere
Great Q&A here. Good luck, OP.
"Generating the report for 84058, this will take a while!" contains two independent clauses. They should not be separated by a comma. They should be separated by a period. It should read "Generating the report for 84058. This will take a while!"
Thanks. I’ll fix it.
Very nice! Properties in the area I’m interested in are now selling for 7% below asking price whereas a year ago they were getting 30% above asking price. On a different site, several properties that I saw sell in the past few months have relisted. It’s obvious they overbought or were speculating and are trying to get out now.
I can’t wait for the business study of how much wealth the Property Brothers end up destroying - should be much more interesting then the story of the guy who stopped putting olives on the airline salads.
I’m glad this project helped!
How is this different from Redfin's sale-to-list data, and why do people use Zillow instead of Redfin when superficially at least Zillow appears to be wrong about everything?
I’m hoping the message and presentation makes the difference. It was hard for me to follow the multi coloured ratio lines on the graph.
You should add a column to show the closing price difference as a percentage of the asking price and include this in the summary value statement at the top of the page.
I have a toggle button to switch between "dollars" and "percentage". Does it not do what you're saying?
It does but I guess it’d be nice to see both at the same time.
Yeah I see your point. Hard to strike a balance between a simple UI and a bloated dashboard. This one is for the discussion board.
Neat, just tried it. I think the daily trend is too granular. Twice a month or maybe monthly would show a stronger trend.
As I already knew, houses in this zip often selling $300k above list, some are pushing $800k. Yea inflation! cough
Thanks for your words!
I have reached Hacker news front page which is a lifetime achievement for me. I just hope someone from Zillow takes notice and has a constructive conversation with me.
Just an FYI, Zillow is litigious and your entire service goes against their ToS. They will sue if this takes off and doesn't help them to somehow.
Context for that: Zillow is not the original owner of the data. The data is owned by the several hundred regional MLS’s (note the watermarks in any picture you see on the site).
Data use agreements with those organizations require that anyone using it make efforts to prevent its being scraped.
Source: have signed a DUA with an MLS myself for academic research.
OP Here! Good insight. Do you have contacts with the MLS? I'd like to use them instead of relying on Zillow. How can I contact you ?
I tried a couple of zip codes and it works pretty well. There were a couple of glitches in rendering the results, but I like it. It’s a useful tool.
One suggestion is that 30 days is not enough for sparsely populated areas.
Maybe add the option for 3 months?
Do you intend to make the dataset available? I've attempted to do something similar, but had difficulty, would love access to the dataset.
Please reach out “info at pillr.io” . FYI, I’m a single person, not a big company.
Zillow published a sales price for my house that was 33% greater than the actual sale price. I don't trust Zillow's price data.
Probably the broker added the price incorrectly, or the public county data was keyed incorrectly.
Is probably correct. There is another comment in this post and I did some investigation. The 3.89 million dollar property was erroneously entered as 389 million.
Could you add inputing multiple ZIP codes (or just a city)? The data for just one seems too noisy once you start playing with the slider.
Good suggestion. Noted for future work
It seems the only filter is by square footage to narrow results.
Are there plans to filter by more params?
I can do a lot more. But this whole thing is a house of cards because I’m relying on Zillow who can shut me down anytime. I need one of Zillow/ RedFin / MLS companies to agree to allow me to formally use their data. Then, I can start building the good filters and fancy dashboard. Getting their attention has been futile so far. Just cold email responses at best or no response at all.
Looks similar to RealtorStats.org
Yep, site claims 107% over list price. It tracks for Nashville. Love this site.
Can you do by country and zip code? Would like to see Canada there.
Thanks for the request. Could you give me the best sources in Canada similar to zillow? I can try and explore if I can scrape it.
What is the legality of scraping those sites?
Afaik, US Supreme Court has ruled that it is not illegal to scrape public data for a similar case with LinkedIn. As for me being a pain for Zillows bandwidth - I’ve requested them umpteen times to allow me to pay for their API. They just simple ignore me. I’ve found a way around their captcha, in the long term, I have to obtain the data directly from the MLSes
Would you be willing to share how you get around their captcha?
It's perfectly legal!
That being said, you still need to be resource-polite. A lot of people scrape Zillow through browser automation toolkits like Selenium, Puppeteer etc. because it's a JS heavy website and these tools are really bandwidth intensive. This could, in theory, get you in trouble for DDOS.
Instead, since Zillow is using Next.js for their backend so, you can actually retrieve the dataset for any page just by parsing the nextjs cache. This can be done by selecting data in the <script id="__NEXT_DATA__"> node which requires minimal resources from both sides. e.g. in python:
from parsel import Selector
response = httpx.get("https://www.zillow.com/b/1625-e-13th-st-brooklyn-ny-5YGKWY/")
script_data = Selector(text=response.text).css('#__NEXT_DATA__').get()
script_data = json.loads(script_data)
# all of the property data is here, for example building details:
Outstanding tip. Thank you VERY much.
eager to try, tried a couple places in NYC, got "Oops! We did not get any results" :(
Why not add this as placeholder value directly to your search input field? “Try 19130”. For those who come to the site without reading the comments here.
Good suggestion. Next iteration will surely have this
Next next iteration maybe if no results show a similar one that's cached for you so there's no unavailable requests for a valid zip? And/or storing stale data with a last accessed date range.
I wish you the best of luck getting the blessing of MLS, I had an idea for something similar and couldn't get anything useful.
Also I think Zillow is litigious as part of their agreement to have access to the data (and proprietary reasons obviously)
"over the asking price" is a useless metric, might be more interesting to elevate $/sqft.
I would disagree with this statement. It's really interesting in markets like the bay area from an individual consumer perspective
why though? genuine question.
sometimes sellers over or under estimate their home. Over asking price doesn't indicate how a market is doing. It indicates other things: are sellers under estimating their home to start bidding wars? how delusional some sellers are on the price of their home? etc. These are interesting things to know, but as a buyer, I'm more interested in putting the right offer at the right time, and there's no better indicator than $/sqft for a specific location. Of course it varies (how many bedrooms? how many bathrooms? other cool amenities?) but this could be a subfilter of this product.
This is super cool
(Saying this in the voice of Austin Powers) - Athankyu
Redfin has this and quite a lot more on their Market Insights feature.
Yep I’m aware of this but I’m trying to beat them with the simple UI and straight forward messaging.