Finding a complete and reliable list of stocks traded on NASDAQ can be a challenging task, but after extensive searching, I found an excellent resource that we can leverage: stockanalysis.com.
This site provides all the essential information we need, including "Company Name," "Symbols," and more. This data will form the foundation for creating a detailed stock list, which can later be enriched with additional data sources to develop a comprehensive view of each company from fundamental, technical, and sentiment perspectives.
In this blog post, I will guide you through the process of using Python, BeautifulSoup, and Selenium to scrape this data and output it into a structured spreadsheet. This dataset will serve as a valuable resource for anyone looking to analyze stock performance and make informed investment decisions.
To accomplish this task, we will use the following tools and libraries:
Before we begin, ensure you have Python installed on your machine. Next, install the necessary libraries using pip:
pip install requests beautifulsoup4 selenium pandas
Additionally, you need a WebDriver for Selenium to interact with your web browser. For instance, if you are using Chrome, download ChromeDriver from here and place it in a directory included in your system's PATH.
Beautiful, now lets set sail!
Now, let's dive into the script. This script will scrape the stock data from stockanalysis.com, parse the HTML content with BeautifulSoup, and save the data into a CSV file using Pandas.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd
# Set up Selenium WebDriver and insert the headless (invisible) option when scrapping
options = webdriver.ChromeOptions()
options.add_argument('--headless')
# Initialized the webdrive (Chrome, Firefox etc can be used)
driver = webdriver.Chrome(options=options)
# Page which will be scrapped
url = "https://stockanalysis.com/list/nasdaq-stocks/"
# This is connected with how many subpages currently the page has to present all NASDAQ symbols
pages = 7
# List to store all table rows, one by one
all_rows = []
# Navigate to the URL
driver.get(url)
for i in range(1, pages + 1):
# Wait for the page and elements to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "tbody"))
)
# Pull the content of the page into a variable
html_content = driver.page_source
# Read the html contect with beautifulsoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract data, skipping the first 'td' using slicing inside the loop
tbody = soup.find('tbody')
for tr in tbody.find_all('tr'):
row = [td.text for td in tr.find_all('td')[1:]] # Skipping the first td here
all_rows.append(row)
# Handle 'Next' button click if not last page
if i < pages:
next_button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, "//button[contains(.,'Next')]"))
)
next_button.click()
time.sleep(1) # Wait a moment for the page to load
# Close the driver
driver.quit()
# Create DataFrame after collecting all rows
stocks = pd.DataFrame(all_rows, columns=["Symbol","CompanyName","MarketCap","StockPrice","%Change","Revenue"])
#print(stocks)
# Output the dataframe into a CSV without index
stocks.to_csv("nasdaqStocks.csv", index=False)
This would be even more fun, if we play a little with the scrapped data, so lets drop some changes:
# Let's have some fun with the output
# I want to have a list of stocks which has the revenue in B and organize them in an ascending order
stocks['BMK'] = stocks['Revenue'].str.extract(r'([BMK])$')
# Let's change the NaN values which come from - in the Revenue with N/A
stocks['BMK'] = stocks['BMK'].fillna('N/A')
# Function to strip 'B', 'M', or 'K' from each string in the Revenue column
def remove_bmk_suffix(value):
if value == '-':
return '0' # Replace placeholder '-' with '0'
return value.rstrip('BMK') # Strip 'B', 'M', or 'K' from the right
# Apply the function to remove the suffixes from the 'Revenue' column
stocks['Revenue'] = stocks['Revenue'].apply(remove_bmk_suffix)
# Optionally, convert the 'Revenue' column to a numeric type
stocks['Revenue'] = pd.to_numeric(stocks['Revenue'], errors='coerce')
# Filter rows where 'BMK' is 'B' and revenue is greater than 100
billions_df = stocks[(stocks['BMK'] == 'B') & (stocks['Revenue'] > 10)]
# Sort rows by 'Revenue' in descending order
sorted_billions_df = billions_df.sort_values(by='Revenue', ascending=False)
# Capture the result in a dataframe for further use
result = sorted_billions_df[['Symbol', 'CompanyName', 'Revenue']]
BMK which extracted from the Revenue column any letter with B, M or K.BMK has no values also, thus I have filled that with "N/A"Revenue column, alowing only numbers to be present, which would allow us to jingle with descend/ascend order.BMK column is 'B' (Billions) and Revenue is greater than 10.Revenue in descending order.The output of our work, looks like this:
| # | Symbol | CompanyName | MarketCap | StockPrice | %Change | Revenue | BMK |
|---|---|---|---|---|---|---|---|
| 0 | MSFT | Microsoft Corporation | 3,145.65B | 423.13 | 1.76% | 236.58 | B |
| 1 | AAPL | Apple Inc | 2,914.02B | 190.04 | 1.39% | 381.62 | B |
| 2 | NVDA | NVIDIA Corporation | 2,367.63B | 946.50 | 3.61% | 60.92 | B |
| 3 | GOOGL | Alphabet Inc. | 2,138.36B | 172.09 | 1.02% | 318.15 | B |
| 4 | GOOG | Alphabet Inc. | 2,136.05B | 173.59 | 0.96% | 318.15 | B |
| ... | ... | ... | ... | ... | ... | ... | |
| 3378 | SMX | SMX (Security Matters) PLC | 685.80K | 0.12 | -1.31% | 0.00 | N/A |
| 3379 | CETX | Cemtrex, Inc. | 483.35K | 0.30 | 1.47% | 65.36 | M |
| 3380 | JFBR | Jeffs' Brands Ltd | 407.65K | 0.29 | 27.44% | 10 | M |
| 3381 | BDRX | Biodexa Pharmaceuticals Plc | 262.53K | 01.Nov | 0.91% | 482.28 | K |
| 3382 | AIMAU | Aimfinity Investment Corp. I | 115.44K | Nov.22 | -0.36% | 0.00 | N/A |
With this script, you can now scrape and compile a comprehensive list of NASDAQ-traded stocks. This dataset will be a valuable asset for conducting various analyses and making informed investment decisions. In future posts, we can explore how to enhance this dataset with additional fundamental, technical, and sentiment data using APIs and further scraping techniques.
Happy PiPing folks!