When the web page that you want to get information from is not simply accessible through the URL, you can either:
- Find out if it’s a form submission and use
curl
to simulate the POST request and then usebeautifulsoup
to parse the html. - Use
selenium
to simulate an actual click in a browser.
Regardless of what rendered it, a web page displayed on a web browser will always be in an html format (or maybe I’m wrong?). Similar like beautifulsoup
we can trace through the html elements, and get which button we should click.
To do this, we need selenium and a driver that can access to your installed Chrome browser headlessly. Apparently python 3 requires beautifulsoup4
instead of beautifulsoup
.
pip install selenium beautifulsoup4 lxml chromedriver-binary
The basic code is:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import chromedriver_binary
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
url = 'https://someurl.com'
wait = WebDriverWait(driver, 10)
driver.get(url)
# Do clicking stuff here
driver.quit()
We can wait for the page load by waiting for certain id or class name:
element = wait.until(EC.element_to_be_clickable((By.ID, "name_of_element_id")))
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "name_of_class")))
Then if we want to click it:
element.click()
After clicking, it will load the next page, so after another wait we can do another clicking, ie.:
element = wait.until(EC.element_to_be_clickable((By.ID, "name_of_element_id1")))
element.click()
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "name_of_class")))
element.click()
We can also check the status of a checkbox:
element = driver.find_element_by_id("checkbox_id")
element.get_attribute('checked')
Or get the page source to parse it using beautifulsoup
:
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
boxes = soup.find_all('div',{"class":'box'})
It’s much more straightforward than I thought but tedious to setup if you really want to scrap a lot of data.