2021.09.07 00:00

When the web page that you want to get information from is not simply accessible through the URL, you can either:

  1. Find out if it’s a form submission and use curl to simulate the POST request and then use beautifulsoup to parse the html.
  2. Use selenium to simulate an actual click in a browser.

Regardless of what rendered it, a web page displayed on a web browser will always be in an html format (or maybe I’m wrong?). Similar like beautifulsoup we can trace through the html elements, and get which button we should click.

To do this, we need selenium and a driver that can access to your installed Chrome browser headlessly. Apparently python 3 requires beautifulsoup4 instead of beautifulsoup.

pip install selenium beautifulsoup4 lxml chromedriver-binary

The basic code is:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import chromedriver_binary

options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)

url = 'https://someurl.com'
wait = WebDriverWait(driver, 10)
driver.get(url)

# Do clicking stuff here

driver.quit()

We can wait for the page load by waiting for certain id or class name:

element = wait.until(EC.element_to_be_clickable((By.ID, "name_of_element_id")))
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "name_of_class")))

Then if we want to click it:

element.click()

After clicking, it will load the next page, so after another wait we can do another clicking, ie.:

element = wait.until(EC.element_to_be_clickable((By.ID, "name_of_element_id1")))
element.click()
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "name_of_class")))
element.click()

We can also check the status of a checkbox:

element = driver.find_element_by_id("checkbox_id")
element.get_attribute('checked')

Or get the page source to parse it using beautifulsoup:

html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
boxes = soup.find_all('div',{"class":'box'})

It’s much more straightforward than I thought but tedious to setup if you really want to scrap a lot of data.



blog comments powered by Disqus