I would like to shuffle a fraction (for example 40%) of the values of a specific column in a Pandas dataframe.
How would you do it? Is there a simple idiomatic way to do that, maybe using np.random, or sklearn.utils.shuffle?
I have searched and only found answers related to shuffling the whole column, or shuffling complete rows in the df, but none related to shuffling only a fraction of a column.
I have actually managed to do it, apparently, but I get a warning, so I figure even if in this simple example it seems to work, that is probably not the way to do it.
Here's what I've done:
import pandas as pd import numpy as np df = pd.DataFrame({'i':range(20), 'L':[chr(97+i) for i in range(20)] }) df['L2'] = df['L'] df.T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 L a b c d e f g h i j k l m n o p q r s t L2 a b c d e f g h i j k l m n o p q r s t For now, L2 is simply a copy of column L. I keep L as the original, and I want to shuffle L2, so I can visually compare both. The i column is simply a dummy column. It's there to show that I want to keep all my columns intact, except for a fraction of L2 that I want to shuffle.
n_rows=len(df) n_shuffle=int(n_rows*0.4) n_rows, n_shuffle (20, 8) pick_rows=np.random.permutation(list(range(n_rows)))[0:n_shuffle] pick_rows array([ 3, 0, 11, 16, 14, 4, 8, 12]) shuffled_values=np.random.permutation(df['L2'][pick_rows]) shuffled_values array(['l', 'e', 'd', 'q', 'o', 'i', 'm', 'a'], dtype=object) df['L2'][pick_rows]=shuffled_values I get this warning:
C:\Users\adumont\.conda\envs\fastai-cpu\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel. df.T I get the following, which is what I expected (40% of the values of L2 are now shuffled):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 L a b c d e f g h i j k l m n o p q r s t L2 e b c l i f g h m j k d a n o p q r s t You can see the notebook here (it's rendered better on nbviewer than here): https://nbviewer.jupyter.org/gist/adumont/bc2bac1b6cf7ba547e7ba6a19c01adb6
Thanks in advance.