I am a trained accountant, and one of my greatest challenges in my transition to being an academic was my lack of coding ability and education.
I learned the below through help of many friends, Youtube videos, and really dumb mistakes. I hope that whatever is documented here helps those in a similar situation to me and makes their transition more bearable. A lot of the below is also documentation for myself to refer back to.
If you're reading this and think, wow this is really obvious stuff - you are a much better coder than I am.
Admittedly, I used Chat GPT to help me with my code when starting. As opposed to wading through a dark fog, it helps to have a flash light. That having been said, it doesn't get you all of the way there, and you do ultimately have to know what you're doing.
I started off using R, but I ultimately found using Python through Jupyter Notebooks and VS Code more user friendly for data manipulation, and I use Stata for my actual regression analyses as that seems to be standard in my discipline.
For some reason whenever I tried connecting to WRDS through Python (specifically Jupyter Workbooks), it would also prompt me to create a .pgpass file, but it would never actually save my username and password for usage.
I then realized the wrds.Connection function does not actually create a .pgpassfile, and you need to generate it yourself.
## Module Set up
import wrds
## Connect to WRDS
conn = wrds.Connection(wrds_username='username')
conn.create_pgpass_file()
The Python API for WRDS calls the data through an SQL process, and it took me forever to understand what it was pulling as.
comp = conn.raw_sql("""
select prc #<---- These are your variables. I find them by going to WRDS and looking at what I want to pull
from crsp.msf as a #< ---- this is the infuriating part of crsp, trying to find the library you want to pull from. Go to the online portal where you want to pull data, go to Data Preview. The first part crsp will be indicated in the table name. In python, you need to pull it as crsp.xxx, where xxx is the ending after the . of the table name. You will then need to pull it as a.
where indfmt = 'INDL'
and datafmt = 'STD'
and popsrc = 'D'
and consol = 'C'
and datadate >= '01/01/2014'
and datadate < '12/31/2021'
""", date_cols=['datadate'])