Computer Networking
Databases
Languages
Production Software
Real_Time_Systems
Security

Lec 14 Pandas

  • Normalizing: mapping/scaling
scores = pd.DataFrame({
    'Name': ['Minnie', 'Joe', 'Bob', 'Jeff'],
    'HW1': [4, 6, 5, 6],
    'HW2': [54, 55, 59, 63],
    'HW3': [20, 20, 19, 14],
})

def get_lowest_avg_hw(df: pd.DataFrame) -> str: 
    num_hws = len(df.columns) - 1
    for i in range(1, num_hws+1):
        min = min(df[f'HW{i}'])
        max = max(df[f'HW{i}'])
        df[f'scaled_HW{i}'] = (df[f'HW{i}'] - min) / (max - min)
    df['avg'] = sum([df[f'scaled_HW{i}']]) for i in range (1, num_hws + 1) / num_hws
    min_index = np.argmin(df['avg'])
    return df['Name'][min_index]

Merging

  • pd.merge() will merge two datasets with one common column
  • What if two tables have two of the same column?
    • pd.merge(df, other_df, 'column to merge with')
    • If they have differing values within a certain column, merge will just add both columns, with {column_name}_{df_name} {column_name}_{df_name_2}
  • Values that appear in one dataset and not the other, only the ones in both will be included
    • Can use how arg
      • left -> the first df's rows will be included, and for anything missing on the 2nd df, NaN
      • right -> the opposite
      • outer -> all rows from both
      • inner -> only rows that are in both
await micropip.install("pandas")

df_pricing = pandas.DataFrame({
    'Item': [1, 2, 3, 4, 5],
    'Price': [1, 2, 3, 4, 5]
})

df_purchased = pandas.DataFrame({
    'Item': [1, 2, 3]
})
print(price(pandas.merge(df_pricing, df_purchased, 'right')))
  • MapReduce:
    • Counting frequency of each word in a piece of text
      • Split the text into words, mapping each word to key/value: (word, 1) pair where word is being mapped. Each word is 1 to start.
    • Reduce phase groups key/value pairs by word, and then adds up the numbers in each group
def mapreduce(documents: List[str]) -> Dict[str, int]:
    mapped = [mapper(doc) for doc in documents]
    shuffled = shuffle(mapped)
    reduced = [reducer(key, values) for key, values in shuffle]