One of the features of big data is the accumulation of massive amounts of information that are not suited to traditional econometric and statistical techniques. I predict that this phenomenon will someday change the way real estate economics are done.
Models of house prices are used in a lot of ways. This is how a lot of cities and counties do “mass appriasals”. It’s how house price indexes separate price changes from quality changes. When companies like Zillow generate Zestimates of what a house is worth, these are the models underlying them. It’s how the Bureau of Labor statistics adjusts housing rents for quality changes and depreciation when computing the CPI.
The general form of these models is to take the log of price as the dependent variable, and housing and neighborhood characteristics as the independent variable. But the list of independent variables that is usually available gives you a relatively limited description. While the exact list varies form data source to data source, typically it might look like: building square footage, lot square footage, number of stories, number of bedrooms, number of bathrooms, central air (yes/no), number of fireplaces, garage size, finished basement (yes/no), and of course address. Using this last variable you can get a lot more information about the neighborhood from other data sources.
This is the basics of what a lot of house price models are dealing with. It might not sound like a lot of information, but in many circumstances you can explain a huge amount of the variation in house prices using relatively limited information like this. In fact using just neighborhood indicators and square footage you can usually explain a significant amount of the variation. However, even 80% of the variation explained still leaves a lot unexplained. And intuitively these kinds of variables still leave a lot of information out. This is where I think things will be changing in the future due to what people are calling “smart homes”.
There are a variety of “smart home” products on the market now, but these things are going to get more and more sophisticated soon. The cuurent generation of Roombas just sort of bounce around the room, but other robot vacuums scan the room first and make a plan. The Nest smart thermostat that Google recently acquired recently pays attention to where you are in your house and when and remembers this. It’s easy to imagine where this leads in the future: Robot vacuums will know the dimension of every room in the house, the kind of flooring in each, and maybe even the quality and age of the flooring; smart TVs will know which room is the living room; Nest knows which room is the kitchen, which are the bedrooms, and how energy efficient the home is. What’s more Nest will be able to measure the extent to which the house is fully utilized and therefore how convenient the layout is; in other words, a Walkscore for your home.
And this is just the information that smart homes will bring. There is a whole other set of high dimension data coming on neighborhoods. This includes data tracking the paths of bikers and runners, counting potholes, measuring the location and quality of amenities like the real Walkscore measure, and so forth. This provides even wider data that often has both high frequency spatial and time dimensionality. Some of this stuff is being incorporated in analysis already (see for example research from Emily Washington and Eli Dourado on Walkscore). But there is a long way to go in terms of incorporating a lot of high-dimensional property and neighborhood data into house price models.
In short, you can imagine the creation of very, very wide high-dimension datasets. What do you do with a dataset that “knows” your house with this level of detail? Suddenly variable selction becomes an important part of the challenge, and economic theory may provide little guidance. The use of principle components to create measures of “house quality” from several variables is not new in real estate economics, but that is really just scratching the surface when it comes to machine learning models that could be utilized and doesn’t really capture the potentially highly non-linear and interactive nature of the relationships between these variables.
High-frequency and high-dimensional datasets are going to change a lot of research that until now has been relatively simple. Researchers just starting out their academic careers should definitely be taking a look at machine learning tools in anticipation of the exciting future of data.