pandas - Ingest data once in python -


i have dataframe in python contains of data binary classification. ingest data in 2 iterations - once of data of 1 class , of data of other class. run randomisation of rows. problem have every time rerun script rows data frame recreated , randomised creating unreproducible results.

should run dataframe creation , randomisation external file? there common practices data ingestion in model building?

i haven't tried attempted in regard. wondering if makes sense statistical point of view or common practice ? try such as:

import data_ingest data_ingest.function_data_call() 

but again every time run script calls external script forms data , randomises it. not solution looking for.

i can't show example, loading in documents (text files) - document binary classification. structure of dataframe following:

row|           content         | class -------------------------------------- 1  | sky blue           | 0 2  | river runs deep purple| 0 3  | yellow fever              | 0 4  | red strawberries          | 1 5  | black orchids nice    | 1 

ingestion code:

for f in [f f in os.listdir(path1) if not f.startswith('.')]:    io.open(path1+f, "r", encoding="utf-8") myfile:      # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))      tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')      data1.append(" ".join(tmp1.split()))  df1 = pd.dataframe(data1, columns=["content"]) df1["class"] = "1"  f in [f f in os.listdir(path1) if not f.startswith('.')]:    io.open(path1+f, "r", encoding="utf-8") myfile:      # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))      tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')      data1.append(" ".join(tmp1.split()))  df1 = pd.dataframe(data1, columns=["content"]) df1["class"] = "1"  f in [f f in os.listdir(path2) if not f.startswith('.')]:    io.open(path2+f, "r", encoding="utf-8") myfile:      # data2.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '').replace(' ', ''))      tmp2 = myfile.read().rstrip().replace('-', '').replace('\n', '')      data2.append(" ".join(tmp2.split()))  df2 = pd.dataframe(data2, columns=["content"]) df2["class"] = "0"  ### concatenate 2 dataframe 1 , re-index emails = pd.concat([df1,df2], ignore_index=true)  ## randomize rows  emails = emails.reindex(np.random.permutation(emails.index)) 

if want reproduce same result after (pseudo-)randomization, can set random seed. each time use same seed, same sequence of random numbers.

secondly, can save intermediate result either file, json or pickle. can check if exists, , if not, recreate it.


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

java.lang.NoClassDefFoundError When Creating New Android Project -