pandas - Ingest data once in python -
i have dataframe in python contains of data binary classification. ingest data in 2 iterations - once of data of 1 class , of data of other class. run randomisation of rows. problem have every time rerun script rows data frame recreated , randomised creating unreproducible results.
should run dataframe creation , randomisation external file? there common practices data ingestion in model building?
i haven't tried attempted in regard. wondering if makes sense statistical point of view or common practice ? try such as:
import data_ingest data_ingest.function_data_call()
but again every time run script calls external script forms data , randomises it. not solution looking for.
i can't show example, loading in documents (text files) - document binary classification. structure of dataframe following:
row| content | class -------------------------------------- 1 | sky blue | 0 2 | river runs deep purple| 0 3 | yellow fever | 0 4 | red strawberries | 1 5 | black orchids nice | 1
ingestion code:
for f in [f f in os.listdir(path1) if not f.startswith('.')]: io.open(path1+f, "r", encoding="utf-8") myfile: # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '')) tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '') data1.append(" ".join(tmp1.split())) df1 = pd.dataframe(data1, columns=["content"]) df1["class"] = "1" f in [f f in os.listdir(path1) if not f.startswith('.')]: io.open(path1+f, "r", encoding="utf-8") myfile: # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '')) tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '') data1.append(" ".join(tmp1.split())) df1 = pd.dataframe(data1, columns=["content"]) df1["class"] = "1" f in [f f in os.listdir(path2) if not f.startswith('.')]: io.open(path2+f, "r", encoding="utf-8") myfile: # data2.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '').replace(' ', '')) tmp2 = myfile.read().rstrip().replace('-', '').replace('\n', '') data2.append(" ".join(tmp2.split())) df2 = pd.dataframe(data2, columns=["content"]) df2["class"] = "0" ### concatenate 2 dataframe 1 , re-index emails = pd.concat([df1,df2], ignore_index=true) ## randomize rows emails = emails.reindex(np.random.permutation(emails.index))
if want reproduce same result after (pseudo-)randomization, can set random seed. each time use same seed, same sequence of random numbers.
secondly, can save intermediate result either file, json or pickle. can check if exists, , if not, recreate it.
Comments
Post a Comment