To create a DataFrame from another DataFrame in Pandas, you have several common approaches depending on your specific goal. These methods range from simply duplicating an existing DataFrame to combining multiple DataFrames or extracting specific subsets of data.
Here's an overview of the key methods:
Understanding How to Create a New DataFrame from Existing Data
Creating a new DataFrame from an existing one can involve various operations, such as combining data from different sources, selecting specific parts of a DataFrame, or transforming its contents. Each method serves a distinct purpose, allowing for flexible data manipulation and analysis.
Method | Description | Primary Use Case | Key Pandas Function(s) |
---|---|---|---|
Concatenation | Combining rows or columns from two or more DataFrames into a single new DataFrame. | Merging datasets vertically (stacking) or horizontally (side-by-side). | pd.concat() |
Subsetting/Filtering | Creating a smaller DataFrame by selecting specific rows, columns, or applying conditions. | Focusing on relevant data, analyzing specific segments. | df[] , df.loc[] , df.iloc[] |
Copying | Creating an independent duplicate of an existing DataFrame. | Preserving original data, making changes without affecting the source. | .copy() |
Transformation | Applying functions or operations to create new columns or modify existing data. | Feature engineering, data cleaning, deriving new insights. | .apply() , .assign() , arithmetic operations, custom functions |
1. Combining DataFrames Using pd.concat()
One of the most powerful ways to create a new DataFrame from others is by combining them. To join one DataFrame to another DataFrame in Pandas, we use the concat()
function. This is particularly useful when you have data split across multiple DataFrames that logically belong together.
The pd.concat()
function takes a list of DataFrames as an argument and returns a new DataFrame with the joined data. It allows you to stack DataFrames either vertically (row-wise) or horizontally (column-wise).
Key Parameters for pd.concat()
:
objs
: A list of DataFrames to concatenate.axis
: Determines the axis along which to concatenate.axis=0
(default): Concatenates along rows, stacking DataFrames vertically.axis=1
: Concatenates along columns, placing DataFrames side-by-side.
join
: How to handle indexes on the other axis.'outer'
(default): Keeps all index values, filling missing data withNaN
.'inner'
: Keeps only index values that are common across all DataFrames.
Example: Vertical Concatenation (Stacking Rows)
This is commonly used when you have identical columns but different rows of data in separate DataFrames.
import pandas as pd
# Create two sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=[0, 1])
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=[2, 3])
print("--- DataFrame 1 ---")
print(df1)
print("\n--- DataFrame 2 ---")
print(df2)
# Concatenate df1 and df2 row-wise (axis=0)
result_df_rows = pd.concat([df1, df2], axis=0)
print("\n--- Concatenated DataFrame (Rows) ---")
print(result_df_rows)
# Output:
# A B
# 0 A0 B0
# 1 A1 B1
# 2 A2 B2
# 3 A3 B3
Example: Horizontal Concatenation (Joining Columns)
This is useful when DataFrames share a common index or you want to add new columns from another DataFrame.
# Create two sample DataFrames
df_left = pd.DataFrame({'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']},
index=['a', 'b', 'c'])
df_right = pd.DataFrame({'Age': [25, 30, 35],
'City': ['NY', 'LA', 'Chicago']},
index=['a', 'b', 'c'])
print("\n--- DataFrame Left ---")
print(df_left)
print("\n--- DataFrame Right ---")
print(df_right)
# Concatenate df_left and df_right column-wise (axis=1)
result_df_cols = pd.concat([df_left, df_right], axis=1)
print("\n--- Concatenated DataFrame (Columns) ---")
print(result_df_cols)
# Output:
# ID Name Age City
# a 1 Alice 25 NY
# b 2 Bob 30 LA
# c 3 Charlie 35 Chicago
For more complex joins based on specific keys, consider using pd.merge()
, which offers more sophisticated options for combining DataFrames based on common columns rather than just indexes. You can learn more about pd.merge()
in the Pandas documentation.
2. Subsetting and Filtering DataFrames
One of the most common ways to create a new DataFrame is by extracting a subset of an existing one. This involves selecting specific rows, columns, or applying conditions to filter the data.
a. Selecting Specific Columns
You can create a new DataFrame containing only a subset of the original columns.
import pandas as pd
df_original = pd.DataFrame({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'Price': [1200, 25, 75, 300],
'Stock': [10, 50, 30, 15]
})
print("--- Original DataFrame ---")
print(df_original)
# Create a new DataFrame with only 'Product' and 'Price' columns
df_subset_cols = df_original[['Product', 'Price']]
print("\n--- DataFrame with Selected Columns ---")
print(df_subset_cols)
# Output:
# Product Price
# 0 Laptop 1200
# 1 Mouse 25
# 2 Keyboard 75
# 3 Monitor 300
b. Filtering Rows Based on Conditions
You can create a new DataFrame by selecting rows that meet specific criteria.
# Create a new DataFrame with products priced over $100
df_filtered_price = df_original[df_original['Price'] > 100]
print("\n--- DataFrame Filtered by Price (> $100) ---")
print(df_filtered_price)
# Output:
# Product Price Stock
# 0 Laptop 1200 10
# 3 Monitor 300 15
3. Creating a Copy of a DataFrame
If you simply want an independent duplicate of an existing DataFrame, use the .copy()
method. This is crucial because assigning one DataFrame to another variable often creates a view rather than a true copy, meaning changes to the new variable might affect the original DataFrame.
import pandas as pd
df_source = pd.DataFrame({'Item': ['A', 'B'], 'Value': [10, 20]})
print("--- Source DataFrame ---")
print(df_source)
# Create a true copy
df_copy = df_source.copy()
# Modify the copy
df_copy.loc[0, 'Value'] = 100
print("\n--- Copied DataFrame (Modified) ---")
print(df_copy)
# Output:
# Item Value
# 0 A 100
# 1 B 20
print("\n--- Source DataFrame (Unchanged) ---")
print(df_source)
# Output:
# Item Value
# 0 A 10
# 1 B 20
4. Applying Functions and Transformations
You can create a new DataFrame by applying functions or calculations to existing columns, often resulting in new derived columns.
import pandas as pd
df_sales = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [10.0, 20.0, 15.0],
'Quantity': [5, 3, 7]
})
print("--- Original Sales DataFrame ---")
print(df_sales)
# Create a new DataFrame by adding a 'Total_Sale' column
# This implicitly creates a new DataFrame based on the transformation
df_sales_with_total = df_sales.assign(Total_Sale=df_sales['Price'] * df_sales['Quantity'])
print("\n--- Sales DataFrame with Total Sale ---")
print(df_sales_with_total)
# Output:
# Product Price Quantity Total_Sale
# 0 A 10.0 5 50.0
# 1 B 20.0 3 60.0
# 2 C 15.0 7 105.0
These methods provide a comprehensive toolkit for creating new DataFrames from existing ones in Pandas, covering a wide range of data manipulation needs.