AI in Urban Planning : Making Machines Learn What You Perceive of Your Surroundings

In the future, would robots take over and do urban planning for us? Well, they might and might not, it’s still very early to say.

The difficulty of automating tasks in urban planning is the reason we planners currently do not feel the heat of being overtaken by our machine overlords. But can we dedicate super complex tasks to machines and take responsibility for decisions and administration for city management?

Can Artificial Intelligence be utilized as a friend in crunching large datasets and values to provide numerical scales that we can use to determine future action?

You might not be aware, but the whole urban planning landscape has been changing for quite some time. Computer scientists have been quite active in applying AI techniques to solve urban problems. Exciting ideas have been sought and applied in various cities on the planet. However, it is debatable as to how much work AI can do on its own in planning our cities. As Urban planning is more than merely a calculation, it involves socio-cultural and demographical perspective to provide solutions to urban problems.

Among many examples of integrating AI in urban planning, one that stands out is the automated mapping of human emotions in local urban surroundings. Let’s understand it informally. Assume there is a Biped robot (let’s name it iRobot from the movie I, Robot) walking in the roads of the city (Figure 1). He starts capturing visual feed with his camera eyes and audio feed with microphone ears. Similar to Skynet software, an AI program is developed, which analyses the feed and label the particular surroundings as safe, boring, lively, calm, and pleasant.

Now imagine the army of robots moving across city analyzing every nook and cranny mapping perceived emotions for the entire city. Well, you would think that the plot is good enough for the Black Mirror type series, but not possible or feasible in the near future. What if I told you that we already have developed the software, just need an iRobot to install that software into.

You must be thinking, why haven’t you heard “urban perception” as a word or a subject before. I agree the term requires some modifications, or let’s say these terminologies arrive when you leave computer scientists to do urban planners jobs. The name is more of a misnomer; you could actually understand it better as “human perception of urban surroundings.” The idea surrounding the mapping of perception is not new. In fact, Kevin Lynch, in one way, studied cognitive mapping using five key elements that make up a city, sounds familiar (I guess it is).

Since the Lynchian era, everything has changed. For instance, Lynch and other subsequent authors didn’t have access to an extensive database of urban datasets especially pertaining to how city look from pedestrian’s point of view, until now (looking at you Google Street View imageries!). The earlier similar studies were limited to a few sites, and locations can now be expanded to multiple cities in various countries. We are sitting on a treasure trove of data and tools right now than we realize.

I created my own instance of emotion-capturing software and became that robot.

By now, if you have searched the term “Urban perception” on the internet, you would have come across a plethora of studies about mapping perceptions. Usually, in all those studies, researchers use Google Street View imageries, because you know they are available for thousands of cities and cover every corner of the city. But you will agree that this is not fun, because you would then just go on the internet and download a bunch of pics and use some AI magic to do all the job.

But, lucky you, the Indian government has prohibited Google from capturing street views of Indian cities. So no such data for you. But then you would say, so what, you would just capture a bunch of photographs in the streets and get it over with (what’s with the fuss). Now imagine how many pictures you will click to make it comparable to any decent AI software. AI requires huge datasets, a platform like GSV makes sense because it can download street views of whole countries. How will we compete with that? With our very own jugaad.

Google Street View Vehicle — Figure 1 - A Google Street View collection vehicle.

But first, let’s review some technicalities of street view data collection. Google uses a vehicle which has 360-degree camera setup and state of the art localization (GPS) mechanism (Figure 2). The vehicle moves along the streets and takes a panorama at every few meters or so. The vehicle does not have any specified time at which it captures the images. So you might be thinking about what would you develop to beat the Google method of data collection.

You have seen the strengths of the platform, now let’s review some of the downsides. First, Google captures only photographs and not audio from the streets. But we are trying to build a robot which not only sees but also hears, so that’s a bummer. Second, due to the unavailability of the timestamp in the images, we are never sure of the time of the day when the photograph is captured. The surroundings around us regularly change and exhibit a transient nature. So a temporal collection of data would be needed to judge the place. Difficult to understand?

Let’s understand this with the help of an example. Take a scenario of a street market, it is empty in the morning and mostly in the afternoons, As the evening approaches, it becomes more lively and vibrant. Now say, our Google Vehicle came in the morning and collected a photograph, which obviously did not capture anything other than the empty street. So the AI software trained on that data would just say that place is outright dull and depressing. AI would certainly misclassify its real character and proudly produce the wrong results.

So the take away from this discussion is that we need to create an AI program, which produces better inference on the location that the instance discussed above. For the same, we need to create a dataset of urban streets that has a high temporal resolution (Clicking photographs of the same location at different time intervals) and multimodal collection (Sound plus images).

Now, let’s talk about the jugaad I keep talking about. There is no second opinion as to how Indian cities are different than cities in western countries. A population with diverse cultural and social habits lead to organic built form over the years. Apart from that, the compact built form of informal settlements and core areas of cities does pose problems in audio and visual data collection. A Google Street View vehicle has no chance to cover these narrow lanes and congested traffic. What’s the solution?

You would say why not mount a camera on the head and move along the streets. After a bit more thinking, you would realize that, although it is one of the options of data collection, it is certainly not the fastest one. What’s the alternative?

A bicycle. Who would have thought before starting a Ph.D. work that I would use a bike to collect datasets? Why not? It is agile and makes less to no sound, therefore suitable for sound collection. It is good with congested roads and friendly to pedestrians. And most important, I could use my newly purchased BTWIN bike for a spin. The camera in the mobile phone became the robot’s eyes, Audio recorder became its ears, and I became the robot (A cycle mounted one) (Figure 3).

audio video data collection by Btwin — Figure 2 - Cycle-based audio and visual data collection

To be honest, data collection with jugaad was not all cool and breezy. I remember it was the summer sun of May, which did half the damage, the other half was done by Mumbai Traffic. I used to cover a distance of 10 kilometres in roughly an hour (Figure 4). And the process is repeated for 12 days covering street view images and audio clips of the same route from morning to evening. I collected around 18000 photographs and 700 minutes of audio data. The best part is that the audio was collected through binaural microphones, which would actually give you stereo sound, and you may feel as if you are moving around in the streets.

Data collection routes — Figure 3 - The Selected routes for data collection.

I know what you will be thinking; the robot collects audio and visual data, but where does the magic happen. How does it figure out what to think of that data?

To answer this, we need to dive a bit deeper into the realm of AI models. Be ready for some AI mumbo jumbo coming next. AI algorithms are trained in two ways, Unsupervised and Supervised.

Supervised learning (training) requires a pair of x and y’s. x stands for features and y for labels. Let’s, say, every x has its y. Say you want to build an algorithm to detect cats and dogs. The first step would be to collect some images of dogs and some cats. Now you have to label them as Dogs and Cats, respectively. Here your x’s are the images, and associated y is the labels (cat or a dog).

Now we just take these pairs and throw them into AI algorithm to do the task for us. The algorithm now learns to associate dogs with the labels and understands which features in the image correspond to a dog and which features relate to a cat. This method is also valid for other types of complex datasets such as time series and audio clips.

We adopt a similar method in our image and audio classification. If you have been carefully reading the text, you would realize that we have our x’s but not our y’s. How do we get our y’s?

In the earlier example, there were two classes, cat and dog. The models are capable of identifying the classes on which they are trained. The cat-dog model cannot be used to classify if the animal is a giraffe. It’s like when you find some crazy looking fruit in the market, and you ask the shopkeeper, which fruit is it? Cause you only have been trained on bananas, peaches, mangoes, and other common fruits. I hope you still with me.

The question remains. What is our y? Our y would be the perceptions on which we want to train our images. Sounds complex? It sure is. But focus on your y. It would be simple enough to follow. Say we want to teach our model two classes (a) Safe (b) Lively (consider them like cats and dogs). We wish to label each image as Safe and Lively. So what do we do?

Sit down and do the manual labor of labeling images, which we think looks safe and others that look lively. Now we just throw these x’s (images) with y’s (labels of safe and lively) in the algorithm to obtained our trained model. Now we can collect any random street view imagery, and the developed model would tell if the image gives a perception of safe or lively. Pretty cool!

In this whole study, six perceptual attributes (six types of labels) are used for images and eight types for audio data. The AI brain now processes each image and audio clip and provides an output on these 14 perceptions. Our robot is now equipped with the AI brain and now has its thoughts and emotions regarding the urban surroundings. We can call this one eyeRobot!

Dynamic and static elements — Figure 4 - CNN models aimed at detecting objects and understanding semantic segmentation.

To be honest, this is a simplistic explanation as to what is going on under the hood. The AI models sees the data and extract relevant features in a different way (Figure 5). The wide varieties of methods are popular, you will constantly see the terms such as “Object Detection”, “Semantic Segmentation”, and “Image Classification”. These terms and images (Figure 5) sound familiar because these methods are used in Self Driving cars. Interesting! However, we will leave this discussion on the detailed outputs and inner workings for some other day. The subject is so interesting that it deserves a topic on its own.

I am sure you are not convinced yet. There would be thousands of questions in your mind regarding all the processes. As mere mortals, we often need more context and material to gather our thoughts. However, you are in luck. At the end of the text, you would find the suggested reading. And guess what, these are all research papers pertaining to the same context which we discussed and are authored by me.

What is the need for such studies?

It’s the time when I answer the question. So what? Why do we need such perception and all? Is it even Urban planning?

Individuals’ perception of urban spaces is essential to focus on human-centric and bottom-to-top urban planning practices and research. In recent years, the demand for better management and planning has become even more relevant with the increase in urban population and development.

Perception studies are essential to city planning and management as perception-related scientific data can be used as evidence to validate land use and site planning by-laws. Also, the methodological approach to measure, judge, and evaluate the urban surroundings can be reiterated in different areas of cities or different cities.

Such studies may help in comparison and benchmarking urban spaces with the aim of retrofitting and redevelopment. Convinced yet?

The sense of security, belongingness, and happiness are affected by the environment in which a person dwells, works, and recreates. The quantification of the environmental and perceptual attributes of surroundings is, therefore, a building block to understand these traits.

Although AI algorithms and urban data collection from scratch need a solid bridge to gain better understanding. I hope I am successful in making a few of the pillars for the bridge.

AI in Urban Planning : Making Machines Learn What You Perceive of Your Surroundings

What is the need for such studies?

Further Reading

Recent Posts

Comments