Gender Influence in Code Generation

Examining Prompting Styles and Bias in Large Language Models

Master's Thesis Research Project — An empirical study investigating whether and how gender shapes the way people prompt AI coding assistants, and whether these differences translate into measurable differences in generated code quality.

Overview

This repository contains the full analytical pipeline for a study in which participants solved real-world programming tasks using LLMs (ChatGPT, Claude). Their conversations were collected via an online survey, stored in a structured database, and analysed across three dimensions:

Dimension	What is measured
Prompt Linguistics	Writing style, tone, length, grammar, politeness, n-grams, request type, sentiment
Code Quality	Pylint score, Radon cyclomatic complexity, maintainability index
Gender Prediction	Logistic Regression, Support Vector Machine, Fine-tuned RoBERTa classifier trained on user prompts

Research Questions

Prompting Style — Do cisgender men and women differ in how they write prompts to LLMs (length, formality, politeness markers, sentence structure, request framing)?
Code Quality — Does the gender of the prompter correlate with the quality of the LLM-generated code?
Gender Predictability — Can a machine learning model reliably predict a user's gender from their prompts alone?

Data Pipeline

Online Survey (LimeSurvey)
        │
        ▼
Playwright Scraper ──► Raw chat HTML from ChatGPT / Claude share links
        │
        ▼
Importer ──► SQLite (giicg.db)
        │
        ├── Language Detection  (xlm-roberta-base-language-detection)
        ├── Translation DE/IT → EN  (HuggingFace Helsinki-NLP)
        ├── Spelling Correction  (oliverguhr/spelling-correction-english-base)
        └── Contraction Expansion
        │
        ▼
Prompt Parser (GPT-4o via LangChain)
    Segments each user message into:
      conversational | code | other
        │
        ▼
Analysis Notebooks
    ├── Linguistic analyses (spaCy, statsmodels, scipy, pingouin)
    ├── Code quality  (Pylint, Radon)
    └── Gender prediction  (RoBERTa fine-tune, LIME)

Statistical Methods

Group comparison: Welch's t-test, Mann-Whitney U, Fisher's exact test
Effect sizes: Cohen's d, odds ratio
Multiple testing correction: Bonferroni
Normality checks: Shapiro-Wilk
Explainability: LIME (Local Interpretable Model-agnostic Explanations)

Notebooks

`notebooks/prompt_analysis/` — Linguistic Prompt Analyses

Notebook	Description
`00_Power_Analysis.ipynb`	Sample size and statistical power calculation
`00_Mask_Prompts.ipynb`	Anonymisation of prompts for modelling
`01_PromptLength_Raw_Prompt.ipynb`	Token & character length analysis (raw prompts)
`01_PromptLength_Conversational.ipynb`	Length analysis on conversational prompt segments
`02_Top_Used_Ngrams.ipynb`	Most frequent uni-/bi-/trigrams by gender group
`03_Grammar_Spelling.ipynb`	Grammatical error rates and spelling correction analysis
`03_Punctuation.ipynb`	Punctuation usage patterns
`03_Word Count Analyses.ipynb`	Vocabulary richness and word-count statistics
`04_Request_Type.ipynb`	Informational vs. involved request classification
`04_Sentiment.ipynb`	Sentiment polarity analysis
`06_Involved_Informational.ipynb`	Deep-dive into involved/informational language dimensions
`10_Communication_Objectives_Quality.ipynb`	LLM-judged communication quality
`10_Rating_Communication_Quality.ipynb`	Manual & automated communication quality ratings

`notebooks/code_analysis/` — Code Quality Analyses

Notebook	Description
`01_Select_Prompt_Candidates.ipynb`	Filter & select prompts with extractable code
`02_Translate.ipynb`	Translate non-English code comments / docstrings
`03_Run_Prompts.ipynb`	Re-run prompts against LLM APIs for reproducibility
`04_Parse_code.ipynb`	Extract and store code blocks from messages
`05_Satisfaction.ipynb`	User satisfaction ratings analysis
`06_Scores_On_All_Codeblocks.ipynb`	Aggregate Pylint + Radon scores
`07_Pylint_Radon.ipynb`	Detailed linting and complexity analysis
`08_Pylint_Codes.ipynb`	Breakdown of individual Pylint error/warning codes
`09_Code_Quality_X_Gender_Request_Type.ipynb`	Code quality by gender × request type
`10_LLM_as_a_Judge_CoT.ipynb`	Chain-of-thought LLM evaluation of code quality
`11_Code_Quality_Correlations.ipynb`	Correlations between code quality metrics

`notebooks/prediction/` — Gender Prediction

Notebook	Description
`07_Gender_Prediction.ipynb`	Baseline gender prediction experiments
`08_Dataset_for_Roberta.ipynb`	Dataset preparation for RoBERTa fine-tuning
`08_Roberta_Per_Prompt_*.ipynb`	Per-prompt RoBERTa fine-tuning (standard, masked, CoLab)
`08_Roberta_Per_User.ipynb`	Per-user aggregated prediction
`08_Roberta_Hyperparam_Search.ipynb`	Hyperparameter optimisation
`08_Lime_Explainability*.ipynb`	LIME-based model explainability (masked & unmasked)
`09_Test_Roberta.ipynb`	Final model evaluation
`Push_Model_To_Hub.ipynb`	Upload fine-tuned model to HuggingFace Hub

Tech Stack

Category	Libraries
Data & Storage	`pandas`, `numpy`, SQLite
Web Scraping	`playwright`, `beautifulsoup4`
NLP	`spacy` (`en_core_web_sm`), `transformers`, `wtpsplit`, `contractions`
LLM APIs	`openai` (GPT-4o), `anthropic` (Claude), `langchain`
ML / Deep Learning	`torch`, `scikit-learn`, `adapters`, `datasets`
Statistics	`scipy`, `statsmodels`, `pingouin`
Code Analysis	`pylint`, `radon`
Explainability	`lime`
Visualisation	`matplotlib`, `seaborn`

⚙️ Setup

1. Install dependencies

pip install -r requirements.txt

2. Download spaCy language model

python -m spacy download en_core_web_sm

3. Configure API keys

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here

4. Launch notebooks

jupyter notebook

📄 License

This project is part of a Master's thesis. All code is provided for academic transparency. The survey dataset is not redistributed due to participant privacy.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
helpers		helpers
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gender Influence in Code Generation

Examining Prompting Styles and Bias in Large Language Models

Overview

Research Questions

Data Pipeline

Statistical Methods

Notebooks

`notebooks/prompt_analysis/` — Linguistic Prompt Analyses

`notebooks/code_analysis/` — Code Quality Analyses

`notebooks/prediction/` — Gender Prediction

Tech Stack

⚙️ Setup

1. Install dependencies

2. Download spaCy language model

3. Configure API keys

4. Launch notebooks

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gender Influence in Code Generation

Examining Prompting Styles and Bias in Large Language Models

Overview

Research Questions

Data Pipeline

Statistical Methods

Notebooks

notebooks/prompt_analysis/ — Linguistic Prompt Analyses

notebooks/code_analysis/ — Code Quality Analyses

notebooks/prediction/ — Gender Prediction

Tech Stack

⚙️ Setup

1. Install dependencies

2. Download spaCy language model

3. Configure API keys

4. Launch notebooks

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`notebooks/prompt_analysis/` — Linguistic Prompt Analyses

`notebooks/code_analysis/` — Code Quality Analyses

`notebooks/prediction/` — Gender Prediction

Packages