Create a full list of stocks traded on NASDAQ

by Num3Ilia 20 May 2024

Python | Selenium | Pandas | BeatifulSoup

Finding a complete and reliable list of stocks traded on NASDAQ can be a challenging task, but after extensive searching, I found an excellent resource that we can leverage: stockanalysis.com.

This site provides all the essential information we need, including "Company Name," "Symbols," and more. This data will form the foundation for creating a detailed stock list, which can later be enriched with additional data sources to develop a comprehensive view of each company from fundamental, technical, and sentiment perspectives.

In this blog post, I will guide you through the process of using Python, BeautifulSoup, and Selenium to scrape this data and output it into a structured spreadsheet. This dataset will serve as a valuable resource for anyone looking to analyze stock performance and make informed investment decisions.

Tools and Libraries

To accomplish this task, we will use the following tools and libraries:

Python: Our primary programming language.
BeautifulSoup: A library for parsing HTML and extracting data from web pages.
Selenium: A tool for automating web browsers, allowing us to handle dynamic content.
Pandas: A powerful data manipulation library that will help us create and manage the spreadsheet.

Setting Up the Environment

Before we begin, ensure you have Python installed on your machine. Next, install the necessary libraries using pip:

pip install requests beautifulsoup4 selenium pandas

Additionally, you need a WebDriver for Selenium to interact with your web browser. For instance, if you are using Chrome, download ChromeDriver from here and place it in a directory included in your system's PATH.

Beautiful, now lets set sail!

The Scraping Script

Now, let's dive into the script. This script will scrape the stock data from stockanalysis.com, parse the HTML content with BeautifulSoup, and save the data into a CSV file using Pandas.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

# Set up Selenium WebDriver and insert the headless (invisible) option when scrapping
options = webdriver.ChromeOptions()
options.add_argument('--headless')

# Initialized the webdrive (Chrome, Firefox etc can be used)
driver = webdriver.Chrome(options=options)

# Page which will be scrapped
url = "https://stockanalysis.com/list/nasdaq-stocks/"

# This is connected with how many subpages currently the page has to present all NASDAQ symbols
pages = 7

# List to store all table rows, one by one
all_rows = []

# Navigate to the URL
driver.get(url)

for i in range(1, pages + 1):
    # Wait for the page and elements to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "tbody"))
    )

    # Pull the content of the page into a variable
    html_content = driver.page_source
    
    # Read the html contect with beautifulsoup
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Extract data, skipping the first 'td' using slicing inside the loop
    tbody = soup.find('tbody')    
    for tr in tbody.find_all('tr'):
        row = [td.text for td in tr.find_all('td')[1:]]  # Skipping the first td here
        all_rows.append(row)

    # Handle 'Next' button click if not last page
    if i < pages:
        next_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(.,'Next')]"))
        )
        next_button.click()
        time.sleep(1)  # Wait a moment for the page to load

# Close the driver
driver.quit()

# Create DataFrame after collecting all rows
stocks = pd.DataFrame(all_rows, columns=["Symbol","CompanyName","MarketCap","StockPrice","%Change","Revenue"])

#print(stocks)

# Output the dataframe into a CSV without index
stocks.to_csv("nasdaqStocks.csv", index=False)

Explanation of the Script

Set Up Selenium WebDriver: We configure Selenium to use Chrome in headless mode, which means the browser window will not be displayed.
Load the Page: Selenium navigates to the target URL and waits for the page to load.
Parse the HTML Content: BeautifulSoup parses the page content fetched by Selenium.
Extract Data: We locate the table containing the stock data and extract the headers and rows.
Create DataFrame: Using Pandas, we create a DataFrame from the extracted data.
Save to CSV: Finally, we save the DataFrame to a CSV file for easy access and further analysis.

This would be even more fun, if we play a little with the scrapped data, so lets drop some changes:

# Let's have some fun with the output
# I want to have a list of stocks which has the revenue in B and organize them in an ascending order
stocks['BMK'] = stocks['Revenue'].str.extract(r'([BMK])$')

# Let's change the NaN values which come from - in the Revenue with N/A
stocks['BMK'] = stocks['BMK'].fillna('N/A')

# Function to strip 'B', 'M', or 'K' from each string in the Revenue column
def remove_bmk_suffix(value):
    if value == '-':
        return '0'  # Replace placeholder '-' with '0'
    return value.rstrip('BMK')  # Strip 'B', 'M', or 'K' from the right


# Apply the function to remove the suffixes from the 'Revenue' column
stocks['Revenue'] = stocks['Revenue'].apply(remove_bmk_suffix)

# Optionally, convert the 'Revenue' column to a numeric type
stocks['Revenue'] = pd.to_numeric(stocks['Revenue'], errors='coerce')

# Filter rows where 'BMK' is 'B' and revenue is greater than 100
billions_df = stocks[(stocks['BMK'] == 'B') & (stocks['Revenue'] > 10)]

# Sort rows by 'Revenue' in descending order
sorted_billions_df = billions_df.sort_values(by='Revenue', ascending=False)

# Capture the result in a dataframe for further use
result = sorted_billions_df[['Symbol', 'CompanyName', 'Revenue']]

Explanation of the Script

Data Processing:
- I have created another column BMK which extracted from the Revenue column any letter with B, M or K.
- The column BMK has no values also, thus I have filled that with "N/A"
- Have created a function to replace no value with 0 and strip from the end text B, M, or K letters
- Have applied the above function and cleaned the Revenue column, alowing only numbers to be present, which would allow us to jingle with descend/ascend order.
- Converted the Revenue column from object to numberic
Filter and Sort Data:
- In order to sort and get only the Symbols to my interest, have created a sort value then on the sort value, extracted only the columns which was on my interest.
- Filter: Select rows where the BMK column is 'B' (Billions) and Revenue is greater than 10.
- Sort: Sort the filtered DataFrame by Revenue in descending order.
- Capture Result: Create a new DataFrame with the filtered and sorted results.

The output of our work, looks like this:

#	Symbol	CompanyName	MarketCap	StockPrice	%Change	Revenue	BMK
0	MSFT	Microsoft Corporation	3,145.65B	423.13	1.76%	236.58	B
1	AAPL	Apple Inc	2,914.02B	190.04	1.39%	381.62	B
2	NVDA	NVIDIA Corporation	2,367.63B	946.50	3.61%	60.92	B
3	GOOGL	Alphabet Inc.	2,138.36B	172.09	1.02%	318.15	B
4	GOOG	Alphabet Inc.	2,136.05B	173.59	0.96%	318.15	B
	...	...	...	...	...	...	...
3378	SMX	SMX (Security Matters) PLC	685.80K	0.12	-1.31%	0.00	N/A
3379	CETX	Cemtrex, Inc.	483.35K	0.30	1.47%	65.36	M
3380	JFBR	Jeffs' Brands Ltd	407.65K	0.29	27.44%	10	M
3381	BDRX	Biodexa Pharmaceuticals Plc	262.53K	01.Nov	0.91%	482.28	K
3382	AIMAU	Aimfinity Investment Corp. I	115.44K	Nov.22	-0.36%	0.00	N/A

Wrap UP

With this script, you can now scrape and compile a comprehensive list of NASDAQ-traded stocks. This dataset will be a valuable asset for conducting various analyses and making informed investment decisions. In future posts, we can explore how to enhance this dataset with additional fundamental, technical, and sentiment data using APIs and further scraping techniques.

GitHub

scrapNasdaqStocks.py

Happy PiPing folks!