Download stock data from Yahoo Finance

by Num3Ilia 05 Nov 2024

Python | Selenium

Overview

The primary objective of this project is to automate data scraping from Yahoo Finance, enabling the extraction of financial data, key statistics, and historical price information for various stock tickers. Leveraging Selenium and Python, we create a tool that navigates Yahoo Finance, downloads reports, and organizes them into directories for future analysis. This setup is ideal for users who require large amounts of financial data but find manual collection too tedious.

Now the catch here is that you have to buy gold and if you are not in USA, then you also have to buy a VPN so you can exit from your location in USA somewere.

I dont have any affilition with anything and based on what I have looked around, this is the cheapest way to get history financial data from the web (50 USD + 15 EUR).

yahoo finance for 1 month GOLD plan
NordVPN for 1 month here

Lets get started

To set up the environment we need first to install the necessary packages such as:

Selenium for browser automation
logging for details activity log
os and shutil for directory management
get the ticker json file from here company_tickers_exchange.json or here company-tickers.json

We begin by defining some variables which will help us in the process, then move to logging system to record important actions and error which will create a yahoo_finance_scraper.log to save every step taken to achieve our goal (it will be used later on for troubleshooting and identifications of any potential issues).

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
from pathlib import Path
import shutil
from datetime import datetime, timedelta
import time
import json
import logging
from dotenv import load_dotenv



####################
# Setup variables  #
####################

# Load environment variables from .env file
load_dotenv()

# Retrieve variables from environment
email = os.getenv("EMAIL")

# Setup logging
logging.basicConfig(
    filename='yahoo_finance_scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Setup WebDriver
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)

# Folder were files will be downloaded IN
downloads_path = os.path.expanduser("~/Downloads")  # MAC/Linux users
# For Windows: downloads_path = "C:\\Users\\<your-username>\\Downloads"
# Folders were files will be moved OUT ( Change as need it)

# Navigate to the ticker's key statistics page
# tickers = ["AAPL", "TSLA", "NFLX", "META", "NVDA", "MSFT", "AMZN", "GOOG"]
# tickers = ["AAPL",] # testing scrapping
with open("company_tickers.json", "r") as file:
    data = json.load(file)
# create the list of tickers which will be processed to get the history
tickers = [data[i]['ticker'] for i in data.keys()]

# Yahoo statements
statements = ["financials", "balance-sheet", "cash-flow"]

# Destination folder for download files
dest_folder = Path(os.getenv("DEST_FOLDER"))

Code Brake down:

Importing the necessaries modules so we can work the magic.
Create a file called .env and drop your email and folder path or just replace in my code email, downloads_path and dest_folder variable.
Initialize logging which will create a file, called in my case yahoo_finance_scraper.log to save our actions and help troubleshoot later on in case we need to.
WebDriver -> The WebDriver setup begins by configuring the Chrome driver with WebDriverWait, enabling us to wait for elements to load before interacting with them. This is critical for ensuring the page is ready for the next step.
I have used the download generic path for the files to be stored. Share both Mac and Windows ( I have tested on my mac).
Create company Ticker files:
- The 1st example, contains two keys:
  - fields with list of strings - field names in used order ["cik", "name", "ticker", "exchange"]
  - data - list of company records e.g. [789019, "MICROSOFT CORP", "MSFT", "Nasdaq"]
- The 2nd example the keys are number ("0", "1", "2", ...) type string and each key has a dictionary with what are you looking for {..., "1":{"cik_str":789019,"ticker":"MSFT","title":"MICROSOFT CORP"}, ...}
statement -> we are going to use this to push some buttons with Selenium help
dest_folder -> Organizing downloaded files is essential to prevent clutter. A designated downloads_path helps store files in a central location, which we can later organize by ticker (please define your location here).

Note: Bottom line its up to you to choose which company ticker to be used (would fit your purpose from the above 2 files). In my script I have used the second one and unpacked the dictionary into a list from which I have extracted only the ticker name (at the moment of writing we have 10210 names).

# Ticker list
['AAPL', 'MSFT', 'NVDA', 'GOOGL', 'AMZN', 'META', 'BRK-B', 'TSM', 'LLY', 'AVGO', 'TSLA' ...]

Moving further into the code, we get to the part were we login to Yahoo Finance (often with two-factor authentication (2FA)) and to navigate this we have to drop some actions:

# Open Yahoo Finance and login
logging.info("Opening Yahoo Finance login page")
driver.get("https://login.yahoo.com/")
email_input = wait.until(EC.element_to_be_clickable((By.NAME, "username")))
email_input.send_keys('your_email@example.com')
email_input.send_keys(Keys.RETURN)

# Wait for 2FA completion
logging.info("Waiting for 2FA authentication")
WebDriverWait(driver, 25).until(EC.url_matches(r"^https://www\.yahoo\.com/.*"))

After your username and password (together with 2FA) had been validated, now you can work the magic and pull all data which you are looking for.

In my example, I will focus my script to digest following pages:

Valuation Measures	Anual \| Quarterly \| Monthly
Historical Data	Date MAX
Financials	Income Statement (Annual \| Monthly)
	Balance Sheet (Annual \| Monthly)
	Cash Flow (Annual \| Monthly)

To be able to achieve this I will write some helper functiones to aid into the process.

####################
# Helper Functions #
####################
def get_latest_file(directory):
    """Returns the most recently downloaded files from a directory."""
    files = [os.path.join(directory, f) for f in os.listdir(directory)]
    return max(files, key=os.path.getctime) if files else None

# Define function to move the files to a designated folder / tick
def move_file_to_folder(file_path, ticker):
    """Moves the downloaded files to another folder/ticker."""
    try:
        destination_folder = dest_folder / ticker
        # Create the folder if it doesn't exist
        destination_folder.mkdir(parents=True, exist_ok=True)

        # Move files to the designated folder
        new_location = destination_folder / os.path.basename(file_path)
        shutil.move(file_path, new_location)
        logging.info(f"Moved {os.path.basename(file_path)} to {new_location}")
    except Exception as e:
        logging.error(f"Error moving file for ticker {ticker}: {e}")

Code Brake down:

The get_latest_file and move_file_to_folder functions organize downloaded files by moving them to folders named after each ticker symbol. This structure makes it easy to locate files by company.

Navigating Between Financial Statements

To extract financial data, such as income statements and balance sheets, the script navigates through tabs on each ticker’s page. We define helper functions that switch between “Quarterly” and “Annual” views, making it easier to collect data across different time periods.

def switch_tab(tab_name):
    """Switch between Annual and Quarterly tabs."""
    try:
        tab_button = wait.until(EC.element_to_be_clickable((By.XPATH, f"//button[@id='tab-{tab_name.lower()}']")))
        tab_button.click()
        logging.info(f"Switched to {tab_name} tab")
        time.sleep(3)
    except Exception as e:
        logging.error(f"Error switching to {tab_name} tab: {e}")


def navigate_to_statement(ticker, statement):
    """Navigate to the Income Statement, Balance Sheet, or Cash Flow page."""
    try:
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/{statement}/")
        print(f"Navigated to {statement} for {ticker}.")
        time.sleep(3)  # Wait for page to load
    except Exception as e:
        print(f"Error navigating to {statement} for {ticker}: {e}")

def wait_for_table_data():
    """Wait for the historical data table to load with at least one row."""
    try:
        logging.info("Waiting for historical data table to load")
        # 2nd row (first is usually the header)
        table_row = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.XPATH, "//table//tr[2]")))
        logging.info("Table data loaded successfully")
    except Exception as e:
        logging.error(f"Error waiting for table data: {e}")

Downloading Data from Different Sections

The download_data function clicks the download button available on each page to save data locally. Error handling is essential here, as the function logs any failures in locating elements or initiating downloads.

def download_data(ticker):
    """ Define function to download data for any visible tab"""
    try:
        logging.info(f"Attempting to download data for {ticker}")
        time.sleep(3)  # Allow page to be loaded as download button is already there
        # Wait for the download button and click it
        download_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='download-link']")))
        download_button.click()
        logging.info("Download initiated")
        time.sleep(2)  # Allow download to complete

        latest_file = get_latest_file(downloads_path)
        if latest_file and not latest_file.endswith(".crdownload"):
            move_file_to_folder(latest_file, ticker)
    except Exception as e:
        logging.error(f"Error downloading data for {ticker}: {e}")


#  Define a function to switch tabs and download data
def switch_and_download(period, T):
    try:
        # Switching to period
        wait.until(EC.presence_of_element_located((By.ID, f"tab-{period.lower()}")))
        time.sleep(2)  # Allow page to be loaded as download is already there
        # Switch to the required tab
        period_button = wait.until(EC.element_to_be_clickable((By.ID, f"tab-{period.lower()}")))
        period_button.click()
        logging.info(f"Switched to {period} tab for {ticker}")
        # Ensure the tab content loads before downloading
        time.sleep(2)
        # Download data after switching tabs
        download_data(T)
    except Exception as e:
        logging.error(f"Error switching to {period} tab for {ticker}: {e}")

Each download is followed by a short wait, ensuring the file is fully downloaded before moving it to the designated folder. The script also checks if the file has a .crdownload extension, which indicates an incomplete download, and waits accordingly.

Retrieving Historical Data and Max Date Range

Historical data includes open-high-low-close-volume (OHLCV) data. The script uses the select_max_date_range function to download the full history of the stock price by choosing the “Max” range from a dynamically generated button.

def generate_date_range():
    """Generate the current date range string (e.g., 'Oct 26, 2023 - Oct 26, 2024')."""
    today = datetime.today()
    previous_yearDate = f"{datetime.today().strftime('%b')} {today.day}, {today.year -1}"
    start_date = today.strftime("%b %d, %Y")
    logging.info(f"Generating date range {previous_yearDate} - {start_date}")
    return f"{previous_yearDate} - {start_date}"


def select_max_date_range():
    """Select the 'Max' date range option."""
    try:
        date_range = generate_date_range()
        logging.info(f"Looking for date range: {date_range}")
        # Find the button with the dynamically generated date range
        date_button = wait.until(EC.element_to_be_clickable((By.XPATH, f"//button[@title='{date_range}']")))
        date_button.click()
        max_option = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@value='MAX']")))
        max_option.click()
        logging.info("Selected 'Max' date range")
        time.sleep(1)
    except Exception as e:
        logging.error(f"Error selecting Max date range: {e}")

Running the Full Script

The script loops over each ticker symbol, navigating to Yahoo Finance’s data sections for each one. This approach allows for seamless scraping across multiple stocks.

Here’s a glimpse into the main loop:

##########
# Step 2:#
##########

for ticker in tickers:
    try:
        # 1. Download Key Statistics
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/key-statistics/")
        logging.info(f"Downloading Quarterly data for {ticker} (default view)...")
        # Get default page (quarterly).
        download_data(ticker)

        # Switch to Annual and Monthly tabs and download data
        for period in ["Annual", "Monthly"]:
            logging.info(f"Downloading {period} data for {ticker}...")
            switch_and_download(period, ticker )
            time.sleep(2)  # Give time for each download to complete

        # 2. Download Key history
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/history/")
        time.sleep(3)  # Allow page to be fully loaded, safer
        logging.info(f"Downloading history OHLCV data for {ticker}...")
        select_max_date_range()
        wait_for_table_data()
        download_data(ticker)

        # 3. Download financials - Income Statement
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/financials/")
        logging.info(f"Downloading Quarterly and Annually financials data for {ticker}...")
        download_data(ticker)
        switch_tab("Quarterly")
        download_data(ticker)

        # 4. Download financials - Balance Sheet
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/balance-sheet/")
        logging.info(f"Downloading Quarterly and Annually balance-sheet data for {ticker}...")
        download_data(ticker)
        switch_tab("Quarterly")
        download_data(ticker)

        # 4. Download financials - Balance Sheet
        driver.get(f"https://finance.yahoo.com/quote/{ticker}/cash-flow/")
        logging.info(f"Downloading Quarterly and Annually cash-flow data for {ticker}...")
        download_data(ticker)
        switch_tab("Quarterly")
        download_data(ticker)
    except Exception as e:
        logging.error(f"Error processing {ticker}: {e}")

# Close the browser
driver.quit()

Wrap UP

After processing each ticker, the script closes the browser. By logging all errors, this script provides a comprehensive way to scrape Yahoo Finance and handle issues as they arise. In the end, you’ll have a well-organized dataset ready for further analysis.

References

Selenium with Python LINK

GitHub

scrapYahooFin.py

As always, happy PiPing peeps!