Expert ChatGPT Prompt for Data Science

ChatGPT is an artificial intelligence system developed by Anthropic to be helpful, harmless, and honest. It can generate human-like text on a wide variety of topics when given a prompt. As a powerful language model, ChatGPT has many potential applications in data science.

Data scientists can use ChatGPT prompts to automate repetitive tasks, accelerate workflows, and unlock creative ideas. However, it is critical to provide clear, specific, and unbiased prompts to get high-quality responses.

Here are some best practices for crafting effective ChatGPT prompts for data science projects:

Provide Context and Details

Give ChatGPT background information on the goal and scope of your project. For example:

I am working on a predictive modeling project to forecast daily sales for an ecommerce company. The dataset contains transaction records with features like product IDs, categories, prices, dates, etc.

Specifying key details helps ChatGPT understand your objective and tailor the response accordingly.

Ask Clear and Direct Questions

Break down broad requests into narrow, answerable questions. For instance:

Can you suggest some feature engineering ideas for the date field in my dataset?

Vague, open-ended questions often lead to generic or irrelevant responses from ChatGPT.

Give Examples to Guide the Response

Providing examples helps ChatGPT infer the type of response you need. Try:

Here is a sample record from my dataset:

{Product ID: 853, Category: Electronics, Price: $599, Date: 2022-12-01}. Can you suggest a way to engineer the date feature based on this example?

Specify the Expected Output Format

Tell ChatGPT exactly how you want the information formatted. Use markdown syntax for code blocks, lists, etc.

Please provide a bulleted list of 5-10 potential features I could derive from the date field in my dataset. Use markdown formatting.

Check for Accuracy and Fairness

Review ChatGPT’s responses critically instead of accepting them outright. Look out for incorrect facts, harmful stereotypes, and biased perspectives.

You mentioned the category field could proxy for customer demographics. What’s your basis for making that assumption? Please elaborate.

Refine Prompts through Feedback

If a response is inadequate, provide constructive feedback to ChatGPT:

Your last response made an unsupported generalization about customer demographics. Please suggest feature ideas without making demographic assumptions, focusing only on the available transaction data.

Continuously refining prompts this way helps improve quality over time.

ChatGPT Prompt Examples for Data Science

Here are some real-world examples of effective ChatGPT prompts for common data science tasks:

1. Data Exploration

I have a dataset with 1 million rows and 5 columns: ID, Age, Gender, Country, PurchaseAmount. Please suggest useful ways I can explore this data and provide code examples in Python where applicable. Focus on univariate analysis techniques. Use markdown formatting for any lists or code.  

Potential response:

  • Calculate summary statistics like mean, median, mode for numeric columns using df.describe()
  • Create histograms and box plots for PurchaseAmount using Matplotlib/Seaborn
  • Generate frequency tables and bar charts for categorical features like Country and Gender
  • Identify and handle missing values
  • Check for outliers and anomalies
  • Assess data types and distributions

2. Data Cleaning

My dataset has a text column for customer comments. Please suggest 5 data cleaning tasks for preparing this text data for sentiment analysis. Provide Python code examples where relevant. Use markdown formatting.

Potential response:

  • Remove punctuation and special characters
    python import re text = re.sub(r'[^\w\s]','',text)
  • Convert text to lowercase
    python text = text.lower()
  • Expand contractions
    python import contractions text = contractions.fix(text)
  • Remove stop words
  • Correct spelling mistakes

3. Feature Engineering

I have an ecommerce dataset with columns UserId, Product, Date, PurchaseAmount. Suggest 3-5 ideas for feature engineering that could be useful for customer segmentation. Provide explanations and Python code examples where applicable. Use markdown formatting. 

Potential response:

  • Purchase frequency – Calculate number of purchases per customer per month. Can indicate loyalty. purchases_per_month = df.groupby(['UserId', pd.Grouper(key='Date', freq='M')])['PurchaseAmount'].count()
  • Recency – Compute time since last purchase. Helpful for predicting churn. python df['Recency'] = df.groupby('UserId')['Date'].transform(max) - df['Date']
  • Product affinity – Encode whether customer repeatedly buys certain products. Reveals preferences. product_affinity = df.groupby(['UserId', 'Product'])['PurchaseAmount'].count()

4. Model Training

I want to predict customer lifetime value (CLV) based on recency, frequency, monetary value features in my retail dataset. Please suggest a suitable machine learning algorithm for this regression task and provide starter code for training and evaluating the model in Python. Use markdown formatting.  

Potential response:

The features indicate customer purchase behavior over time. So a linear regression model would be too simplistic. Recommend training a regression random forest model, which can capture non-linear relationships and feature interactions.

Here is starter code:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

X = df[['recency', 'frequency', 'monetary']] 
y = df['clv']

rf_model = RandomForestRegressor()
rf_model.fit(X, y)

y_pred = rf_model.predict(X)

rmse = mean_squared_error(y, y_pred, squared=False) 
r2 = r2_score(y, y_pred)

print(f'RMSE: {rmse}')
print(f'R-squared: {r2}')

This trains a random forest model on the features and calculates evaluation metrics like RMSE and R-squared on the predictions. Further hyperparameter tuning and cross-validation would improve performance.


Using prompts like these examples can help data scientists quickly gather ideas or boilerplate code for various tasks in the machine learning workflow. The key is crafting clear, detailed prompts and providing constructive feedback to ChatGPT when responses miss the mark. With practice, ChatGPT can become an invaluable assistant for data science projects.

Useful Websites:

  • https://www.springboard.com/blog/data-science/ai-prompts-for-data-science/ – Examples of AI prompts for common data science tasks
  • https://docs.kanaries.net/articles/chatgpt-prompt-data-scientist – Collection of 35 hands-on ChatGPT prompts for data science
  • https://plainenglish.io/blog/11-cool-chatgpt-prompts-for-data-science – Creative prompts to explore data science with ChatGPT
  • https://clickup.com/templates/ai-prompts/data-analysis – ChatGPT templates for data analysis tasks
  • https://www.indeed.com/career-advice/career-development/examples-of-tone – Guide with examples of different tones in writing