Web Scraping Instagram with Selenium

and that darn NoSuchElementException

Web Scraping Instagram with Selenium
Web Scraping Instagram with Selenium by Mariya Sha

and that darn NoSuchElementException

Selenium is a very powerful web scraping tool, it can target specific content elements on a webpage and extract them mercilessly!
But great power also leaves room for great errors, and in this short tutorial, I will show handy ways to bypass them and automate the entire process of image extraction.

We’ll focus on one task — web scraping a full database of cat images out of Instagram. We’ll do it step by step and we’ll discuss the challenges and the reasoning behind certain commands:

  1. Login to out personal Instagram account
  2. Handle the pop-up messages by clicking on “not now”
  3. Search for a keyword “#cat”
  4. Scroll down and select all the above thumbnails
  5. Create a new directory on your computer
  6. Save all the images inside the new directory

Install Chrome Driver

Download Chrome Webdriver: https://chromedriver.chromium.org/downloads

  • a quick tip: I highly recommend saving chromedriver.exe in your root path, that way you won’t need to specify the URL of your file every time you initialize the driver (please refer to the comment inside the following code cell).
save chromedriver.exe inside the root path

We’ll begin with the most basic Selenium commands — starting the Web Driver and navigating to Instagrams’ home page:from selenium import webdriverfile_path = 'C:/Users/goaim/chromedriver.exe'driver = webdriver.Chrome(file_path)
#driver = webdriver.Chrome() if your chromedriver.exe inside rootdriver.get("http://www.instagram.com")

These commands will result in a new browser window popping-up and navigating to instagram.com.

Log in to your Instagram account:

We’ll begin by opening our developer tools (with a right mouse click anywhere on the page and then select “developer tools”).

select Instagrams username input field with developer tools

Then we’ll select the inspect tool (denoted by a green square on the image above) and click on the username input filed (red rectangle).
Once we do that, the developer tools will present us with the source code of the particular element we’ve selected. In our case, we will focus on the name attribute (yellow rectangle) of this input element, and we’ll note that it equals “username”.

But back in our code, a very curious thing happens when we try to target this element. We get a really annoying error no matter which selector we use and even though the official Selenium documentation swears by the following commands — these will not work in our case:username = driver.find_element_by_name(‘username’)username = driver.find_element(By.XPATH, '//input[@name="username"]')username = driver.find_element(By.CSS_SELECTOR, "input[name='username']")

Each of these commands will result in the exact same error:NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"input[name='username']"}

So what are we missing here??

We tend to forget that each webpage has its own life cycle, where all the elements load one after the other to assemble the content.
So if the element we are targeting is still being processed and hadn’t loaded on the page yet — Selenium won’t be able to detect it.

Consider the following solution to this problem; we will wait until the element becomes clickable and only then we will select it.
For this, we’ll need to import 3 simple functions with very scary and long import statements: EC (Expected Conditions), By and WebDriverWait.
We’ll add the following lines to our code (and if we’re already there, we’ll include the password input as well):from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWaitusername = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, “input[name=’username’]”)))password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))

This way, we allow enough time for the elements to load and this step is absolutely crucial if you’re looking for a %100 automated process.
Now we are safe to enter our own personal user name and password and click on the login button (please refer to the developer tools screenshot below the code snippet):username.clear()
username.send_keys(“my_username”)password.clear()
password.send_keys(“my_password”)Login_button = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.CSS_SELECTOR, “button[type=’submit’]”))).click()

Select Instagrams Login button in the developer tools

So we are basically selecting a button element, where the type attribute is “submit”. And once we run that, we are presented with 2 sequential pop-up messages, where we’ll need to click on the not now button to dismiss them.

Handling pop-up messages

Select Instagrams Not Now button in the developer tools

Here we will select the button, with the text “Not Now”, and we’ll perform this action twice because there will be another pop-up message following the first:not_now = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ‘//button[contains(text(), “Not Now”)]’))).click()not_now2 = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ‘//button[contains(text(), “Not Now”)]’))).click()

Next, we are finally presented with our Instagram feed and we can move on with searching keywords.

Search for a keyword

Next, we will enter a hashtag inside the search field, in our case — #cats.
Then, we will press the ENTER key, because Instagram didn’t include a button for us to click on so we need to improvise.
For this, we will also delay the execution of our Python code and adjust it to the loading speed of our webpage. However, instead of using WebDriverWait, we will use time.sleep(seconds):import timesearchbox = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH, “//input[@placeholder=’Search’]”)))searchbox.clear()keyword = “#cat”
searchbox.send_keys(keyword)
time.sleep(5)
searchbox.send_keys(Keys.ENTER)
time.sleep(5)
searchbox.send_keys(Keys.ENTER)
time.sleep(5)

Please note, that we are targeting the input field where the placeholder is equal to ‘Search’ (please refer to the developer tools, the screenshot is below).
And also, note that we are pressing ENTER twice, and we are waiting thrice for 5 whole seconds as Instagram made our lives so much harder by not having a search button to click on.
Since we are waiting for 15 seconds, I highly recommend you to be patient and let the program run slowly but surely (you may need to adjust the number of seconds, depending on your system).

Select Instagrams search input field in the developer tools

And once we reached the home of #cat, we’ll scroll down the page and select all the images above our scrolling position.

Scroll and select thumbnails#scroll
driver.execute_script(“window.scrollTo(0, 4000);”)#select images
images = driver.find_elements_by_tag_name(‘img’)
images = [image.get_attribute(‘src’) for image in images]
images = images[:-2] #slicing-off IG logo and Profile pictureprint(‘Number of scraped images: ‘, len(images))

Please note, we first select all the image tags, and then we use list comprehensions to only keep their ‘src’ property and dispose of the rest.
Lastly, we slice-off the last 2 images which usually consist of your own personal profile picture and Instagram’s logo.
I also like to print the total number of images I’d be scraping before I download them- just so I know what to expect.

Learn How to Extract Full Size Images

Please visit my blog for further information and code updates: https://www.mariyasha.com/post/web-scraping-instagram-thumbnails-with-selenium

Create a new directory for the scraped images

Once we have the URLs of our images in a neat list, we’ll create a brand new folder named as our keyword (minus the hashtag):import os
import wgetpath = os.getcwd()
path = os.path.join(path, keyword[1:] + “s”)
os.mkdir(path)

The first command selects the current directory, the second command specifies the folder name we would like to add to this directory (based on our keyword) and the last command creates the directory on our computer
(mkdir = make directory).

Download web scraped images

Lastly, we will use wget to help us with downloading the images.
We’ll create a variable “counter” which will represent the index of each image so that the file names we set are formatted as “cat1.jpg”, then “cat2.jpeg” and so on all the way until the last image.#download images
counter = 0
for image in images:
save_as = os.path.join(path, keyword[1:] + str(counter) +
'.jpg')
wget.download(image, save_as)
counter += 1

And now, we can sit back, relax, press on the “Run All Cells” button and be impressed with our superior coding skills.
Now each time you’ll revise the keyword — you’ll get a brand new image database within seconds!

What can we do with a database of images?

From my point of view, the answer is simple — Machine Learning and image classification! We can train a neural network to learn the difference between cats and dogs and predict whether a specific, never seen before, image is of a cat or of a dog.
We can do the same with flowers, buildings, landmarks and even faces and when we have free access to such powerful tools as Selenium — the sky is the limit. This technology is very handy when it comes to data science, artificial intelligence and even graphic design, so if you’d like to learn more about web scraping, and Selenium in particular, please have a look at the documentation:
https://selenium-python.readthedocs.io/

Are you more of a video person?

This lecture is also available on Youtube, so you can follow the step by step tutorial and code along with me!
(link to the starter files available in the description of the video)

Automate Instagram with Selenium:

Automate Facebook with Selenium:

Automate LinkedIn with Selenium: