DataFrame/src/DataFrame/DataFrame.class.st at scatterplot · PolyMathOrg/DataFrame

History

2259 lines (1755 loc) · 69.3 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

I am a tabular data structure designed for data analysis.

I store data in a table and provide an API for querying and modifying that data. I know row and column names associated with the table of data, which allows you to treat rows as observations and columns as features and reference them by their names. I also know the type of data stored in each column. In general, I am similar to spreadsheets such as Excel or to data frames in other languages, for example pandas (Python) or R.

The efficient data structure that I use to store the data is defined by DataFrameInternal. However, you can think of me as a collection of rows. Every time you interact with one of my rows or columns it will be an instance of the DataSeries class. I use DataTypeInductor to induce types of my columns every time they are modified. DataPrettyPrinter allows you to print me as a beautiful string table, DataFrameFTData defines a data source based on myself that is used in FastTable to display me in the inspector. I provide aggregation and grouping fuctionality which is implemented using a helper class DataFrameGrouped.

Public API and Key Messages

Creating empty data frame (class side):

- new (empty data frame)

- new: point (empty data frame with given dimensions)

- withColumnNames: arrayOfColumnNames (empty data frame with column names)

- withRowNames: arrayOfRowNames (empty data frame with row names)

- withRowNames: arrayOfRowNames columnNames: arrayOfColumnNames (empty data frame with row and column names)

Creating data frame from an array of columns (class side):

- withColumns: arrayOfArrays

- withColumns: arrayOfArrays columnNames: arrayOfColumnNames

- withColumns: arrayOfArrays rowNames: arrayOfRowNames

- withColumns: arrayOfArrays rowNames: arrayOfRowNames columnNames: arrayOfColumnNames

Creating data frame from an array of rows (class side):

- withRows: arrayOfArrays

- withRows: arrayOfArrays columnNames: arrayOfColumnNames

- withRows: arrayOfArrays rowNames: arrayOfRowNames

- withRows: arrayOfArrays rowNames: arrayOfRowNames columnNames: arrayOfColumnNames

Converting:

- asArrayOfColumns

- asArrayOfRows

Dimensions

- numberOfColumns

- numberOfRows

- dimensions (a Point numberOfRows @ numberOfColumns)

Column and row names:

- columnNames

- columnNames: arrayOfNewNames

- rowNames

- rowNames: arrayOfNewNames

Column types

- columnTypes (classes of values stored in each column)

Getting columns:

- column: columnName

- columnAt: index

- columns: arrayOfColumnNames

- columnsAt: arrayOfIndices

- columnsFrom: firstIndex to: lastIndex

Getting rows:

- row: rowName

- rowAt: index

- rows: arrayOfRowNames

- rowsAt: arrayOfIndices

- rowsFrom: firstIndex to: lastIndex

- at: index (same as rowAt:)

Getting a cell value:

- at: rowIndex at: columnIndex

Setting columns

- column: columnName put: arrayOrDataSeries

- columnAt: index put: arrayOrDataSeries

Setting rows

- row: rowName put: arrayOrDataSeries

- rowAt: index put: arrayOrDataSeries

Setting a cell value:

- at: rowIndex at: columnIndex put: value

Head and tail:

- head (first 5 rows)

- head: numberOfRows

- tail (last 5 rows)

- tail: numberOfRows

Adding columns:

- addColumn: dataSeries

- addColumn: dataSeries atPosition: index

- addColumn: array named: columnName

- addColumn: array named: columnName atPosition: index

- addEmptyColumnNamed: columnName

- addEmptyColumnNamed: columnName atPosition: index

Adding rows:

- addRow: dataSeries

- addRow: dataSeries atPosition: index

- addRow: array named: rowName

- addRow: array named: rowName atPosition: index

- addEmptyRowNamed: rowName

- addEmptyRowNamed: rowName atPosition: index

- add: dataSeries (same as addRow:)

Removing columns:

- removeColumn: columnName

- removeColumnAt: index

Removing rows:

- removeRow: rowName

- removeRowAt: index

- removeFirstRow

- removeLastRow

Enumerating (over rows):

- collect: block

- do: block

- select: block

- withKeyDo: block

Aggregating and grouping:

- groupBy: columnName (returns an instance of DataFrameGrouped)

- groupBy: columnName aggregate: selector (groups data and aggregates it with a given function)

- group: columnNameOrArrayOfColumnNames by: columnName (groups part of data frame)

Applying:

- applyElementwise: block (to all columns)

- toColumn: columnName applyElementwise: block

- toColumnAt: index applyElementwise: block

- toColumns: arrayOfColumnNames applyElementwise: block

- toColumnsAt: arrayOfIndices applyElementwise: block

Sorting:

- sortBy: columnName

- sortDescendingBy: columnName

- sortBy: columnName using: block

Statistical functions (applied to quantitative columns):

- min

- max

- range (max minus min)

- average

- mean

- mode

- median (second quartile)

- first quartile

- third quartile

- interquartileRange (trird quartile minus first quartile)

- stdev (standard deviation)

- variance

Internal Representation and Key Implementation Points.

DataFrameInternal defines how data is stored inside me.

Class {

#name : #DataFrame,

#superclass : #Collection,

#instVars : [

'contents',

'rowNames',

'columnNames',

'dataTypes'

#category : #'DataFrame-Core'

}

{ #category : #'instance creation' }

DataFrame class >> new: aPoint [

^ super new initialize: aPoint

]

{ #category : #'instance creation' }

DataFrame class >> withColumnNames: anArrayOfColumnNames [

"Create an empty data frame with given column names"

| numberOfColumns df |

numberOfColumns := anArrayOfColumnNames size.

df := self new: 0 @ numberOfColumns.

df columnNames: anArrayOfColumnNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withColumnNames: anArrayOfColumnNames withRowNames: anArrayOfRowNames [

"Create an empty data frame with given column and row names"

| numberOfColumns numberOfRows df |

numberOfColumns := anArrayOfColumnNames size.

numberOfRows := anArrayOfRowNames size.

df := self new: numberOfRows @ numberOfColumns.

df columnNames: anArrayOfColumnNames.

df rowNames: anArrayOfRowNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withColumns: anArrayOfArrays [

^ self new initializeColumns: anArrayOfArrays

]

{ #category : #'instance creation' }

DataFrame class >> withColumns: anArrayOfArrays columnNames: anArrayOfColumnNames [

| df |

df := self withColumns: anArrayOfArrays.

df columnNames: anArrayOfColumnNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withColumns: anArrayOfArrays rowNames: anArrayOfRowNames [

^ anArrayOfArrays

ifNotEmpty: [ (self withColumns: anArrayOfArrays)

rowNames: anArrayOfRowNames;

yourself ]

ifEmpty: [ self withRowNames: anArrayOfRowNames ]

]

{ #category : #'instance creation' }

DataFrame class >> withColumns: anArrayOfArrays rowNames: anArrayOfRowNames columnNames: anArrayOfColumnNames [

^ anArrayOfArrays

ifNotEmpty: [ (self withColumns: anArrayOfArrays)

rowNames: anArrayOfRowNames;

columnNames: anArrayOfColumnNames;

yourself ]

ifEmpty: [ self withRowNames: anArrayOfRowNames ]

]

{ #category : #'instance creation' }

DataFrame class >> withDataFrameInternal: aDataFrameIndernal rowNames: rows columnNames: columns [

^ self new

initializeContents: aDataFrameIndernal

rowNames: rows

columnNames: columns

]

{ #category : #'instance creation' }

DataFrame class >> withRowNames: anArrayOfRowNames [

"Create an empty data frame with given row names"

| numberOfRows df |

numberOfRows := anArrayOfRowNames size.

df := self new: numberOfRows @ 0.

df rowNames: anArrayOfRowNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withRowNames: anArrayOfRowNames columnNames: anArrayOfColumnNames [

"Create an empty data frame with given row and column names"

| numberOfRows numberOfColumns df |

numberOfRows := anArrayOfRowNames size.

numberOfColumns := anArrayOfColumnNames size.

df := self new: numberOfRows @ numberOfColumns.

df rowNames: anArrayOfRowNames.

df columnNames: anArrayOfColumnNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withRows: anArrayOfArrays [

^ self new initializeRows: anArrayOfArrays

]

{ #category : #'instance creation' }

DataFrame class >> withRows: anArrayOfArrays columnNames: anArrayOfColumnNames [

^ anArrayOfArrays

ifNotEmpty: [ (self withRows: anArrayOfArrays)

columnNames: anArrayOfColumnNames;

yourself ]

ifEmpty: [ self withColumnNames: anArrayOfColumnNames ]

]

{ #category : #'instance creation' }

DataFrame class >> withRows: anArrayOfArrays rowNames: anArrayOfRowNames [

| df |

df := self withRows: anArrayOfArrays.

df rowNames: anArrayOfRowNames.

^ df

]

{ #category : #'instance creation' }

DataFrame class >> withRows: anArrayOfArrays rowNames: anArrayOfRowNames columnNames: anArrayOfColumnNames [

^ anArrayOfArrays

ifNotEmpty: [ (self withRows: anArrayOfArrays)

rowNames: anArrayOfRowNames;

columnNames: anArrayOfColumnNames;

yourself ]

ifEmpty: [ self withColumnNames: anArrayOfColumnNames ]

]

{ #category : #comparing }

DataFrame >> = aDataFrame [

"Most objects will fail here"

aDataFrame species = self species

ifFalse: [ ^ false ].

"This is the fastest way for two data frames with different dimensions"

aDataFrame dimensions = self dimensions

ifFalse: [ ^ false ].

"If the names are different we don't need to iterate through values"

(aDataFrame rowNames = self rowNames

and: [ aDataFrame columnNames = self columnNames ])

ifFalse: [ ^ false ].

^ aDataFrame contents = self contents

]

{ #category : #adding }

DataFrame >> add: aDataSeries [

"Add DataSeries as a new row at the end"

self flag:

'This mathod name is not correct. It is misleading. We should think if we should delete it or keep it'.

self addRow: aDataSeries

]

{ #category : #adding }

DataFrame >> addColumn: aDataSeries [

"Add DataSeries as a new column at the end"

self addColumn: aDataSeries named: aDataSeries name.

(self dataTypes ) at: aDataSeries name put: aDataSeries calculateDataType

]

{ #category : #adding }

DataFrame >> addColumn: aDataSeries atPosition: aNumber [

"Add DataSeries as a new column at the given position"

self addColumn: aDataSeries asArray named: aDataSeries name atPosition: aNumber

]

{ #category : #adding }

DataFrame >> addColumn: anArray named: aString [

"Add a new column at the end"

self addColumn: anArray named: aString atPosition: self numberOfColumns + 1

]

{ #category : #adding }

DataFrame >> addColumn: anArray named: aString atPosition: aNumber [

"Add a new column at the given position"

(self columnNames includes: aString)

ifTrue: [ Error signal: 'A column with that name already exists' ].

contents addColumn: anArray asArray atPosition: aNumber.

columnNames add: aString afterIndex: aNumber - 1.

dataTypes at: aString put: (anArray asDataSeries calculateDataType)

]

{ #category : #adding }

DataFrame >> addEmptyColumnNamed: aString [

"Add an empty column at the end"

self addEmptyColumnNamed: aString atPosition: self numberOfColumns + 1

]

{ #category : #adding }

DataFrame >> addEmptyColumnNamed: aString atPosition: aNumber [

"Add an empty column at the given position"

self addColumn: (Array new: self numberOfRows) named: aString atPosition: aNumber

]

{ #category : #adding }

DataFrame >> addEmptyRowNamed: aString [

"Add an empty row at the end"

self addEmptyRowNamed: aString atPosition: self numberOfRows + 1

]

{ #category : #adding }

DataFrame >> addEmptyRowNamed: aString atPosition: aNumber [

"Add an empty row at the given position"

self addRow: (Array new: self numberOfColumns) named: aString atPosition: aNumber

]

{ #category : #adding }

DataFrame >> addRow: aDataSeries [

"Add DataSeries as a new row at the end"

self addRow: aDataSeries asArray named: aDataSeries name

]

{ #category : #adding }

DataFrame >> addRow: aDataSeries atPosition: aNumber [

"Add DataSeries as a new row at the given position"

self addRow: aDataSeries named: aDataSeries name atPosition: aNumber

]

{ #category : #adding }

DataFrame >> addRow: anArray named: aString [

"Add a new row at the end"

self addRow: anArray named: aString atPosition: self numberOfRows + 1

]

{ #category : #adding }

DataFrame >> addRow: anArray named: aString atPosition: aNumber [

"Add a new row at the given position"

(self rowNames includes: aString)

ifTrue: [ Error signal: 'A row with that name already exists' ].

contents addRow: anArray atPosition: aNumber.

rowNames add: aString afterIndex: aNumber - 1

]

{ #category : #plot }

DataFrame >> addScatterPlotShapeToChart: aRSChart [

" Assume a receiver with two first columns as x and y coordinates.

Requires Roassal 3. Answer a Roassal scatter plot shape with points color configured to aColor "

| scatterPlot x y |

x := (self columnAt: 1) asOrderedCollection.

y := (self columnAt: 2) asOrderedCollection.

scatterPlot := RSScatterPlot new x: x y: y.

aRSChart addPlot: scatterPlot.

^ scatterPlot

]

{ #category : #applying }

DataFrame >> applyElementwise: aBlock [

"Applies a given block to all columns of a data frame"

self toColumns: self columnNames applyElementwise: aBlock

]

{ #category : #enumerating }

DataFrame >> applySize [

"Answer a new instance of the receiver with the size of each element at each element position"

^ self collectWithIndex: [ :r :i |

DataSeries

withValues: (r values collect: [ : e | e ifNil: [ 0 ] ifNotNil: [ e size ]])

name: i ]

]

{ #category : #private }

DataFrame >> applyToAllColumns: aSymbol [

"Sends the unary selector, aSymbol, to all columns of DataFrame and collects the result into a DataSeries object. Used by statistical functions of DataFrame"

| series column |

series := DataSeries withValues:

(self columnNames collect: [ :colName |

column := self column: colName.

column perform: aSymbol ]).

series name: aSymbol.

series keys: self columnNames.

^ series

]

{ #category : #converting }

DataFrame >> asArray [

"Converts DataFrame to the array of rows"

^ self asArrayOfRows

]

{ #category : #converting }

DataFrame >> asArrayOfColumns [

"Converts DataFrame to the array of columns"

^ contents asArrayOfColumns.

]

{ #category : #converting }

DataFrame >> asArrayOfRows [

"Converts DataFrame to the array of rows"

^ contents asArrayOfRows

]

{ #category : #converting }

DataFrame >> asArrayOfRowsWithName [

"Answer an OrderedCollection where each item is an Array with:

- the name of that row, in first place,

- the contents of that row.

^ self rowNames withIndexCollect: [ :name :index |

Array streamContents: [ :stream |

stream nextPut: name;

nextPutAll: (self at: index) ] ]

]

{ #category : #accessing }

DataFrame >> at: aNumber [

"Returns the row of a DataFrame at row index aNumber"

^ self rowAt: aNumber

]

{ #category : #accessing }

DataFrame >> at: rowNumber at: columnNumber [

"Returns the value whose row index is rowNumber and column index is columnNumber"

^ contents at: rowNumber at: columnNumber

]

{ #category : #accessing }

DataFrame >> at: rowNumber at: columnNumber put: value [

"Replaces the original value of a DataFrame at row index rowNumber and column index columnNumber with a given value"

contents at: rowNumber at: columnNumber put: value

]

{ #category : #accessing }

DataFrame >> at: rowIndex at: columnIndex transform: aBlock [

"Evaluate aBlock on the value at the intersection of rowIndex and columnIndex and replace that value with the result"

| value |

value := self at: rowIndex at: columnIndex.

self at: rowIndex at: columnIndex put: (aBlock value: value)

]

{ #category : #accessing }

DataFrame >> at: aNumber transform: aBlock [

"Evaluate aBlock on the row at aNumber and replace that row with the result"

^ self rowAt: aNumber transform: aBlock

]

{ #category : #accessing }

DataFrame >> atAll: indexes [

"For polymorphisme with other collections."

^ self rowsAt: indexes

]

{ #category : #statistics }

DataFrame >> average [

"Average is the ratio of sum of values in a set to the number of values in the set"

^ self applyToAllColumns: #average

]

{ #category : #'data-types' }

DataFrame >> calculateDataTypes [

self columns doWithIndex: [ :column :i | self dataTypes at: (self columnNames at: i) put: column calculateDataType ]

]

{ #category : #comparing }

DataFrame >> closeTo: aDataFrame [

aDataFrame species = self species

ifFalse: [ ^ false ].

aDataFrame dimensions = self dimensions

ifFalse: [ ^ false ].

(aDataFrame rowNames = self rowNames

and: [ aDataFrame columnNames = self columnNames ])

ifFalse: [ ^ false ].

1 to: self numberOfRows do: [ :i |

1 to: self numberOfColumns do: [ :j |

((self at: i at: j) closeTo: (aDataFrame at: i at: j))

ifFalse: [ ^ false ] ] ].

^ true

]

{ #category : #enumerating }

DataFrame >> collect: aBlock [

"Overrides the Collection>>collect to create DataFrame with the same number of columns as values in the first row"

| firstRow newDataFrame |

firstRow := aBlock value: (self rowAt: 1) copy.

newDataFrame := self class new: 0@firstRow size.

newDataFrame columnNames: firstRow keys.

self do: [:each | newDataFrame add: (aBlock value: each copy)].

^ newDataFrame

]

{ #category : #enumerating }

DataFrame >> collectWithIndex: aBlock [

"Overrides the Collection>>collect to create DataFrame with the same number of columns as values in the first row"

| firstRow newDataFrame |

firstRow := aBlock value: (self rowAt: 1) copy value: 1.

newDataFrame := self class new: 0@firstRow size.

newDataFrame columnNames: firstRow keys.

self doWithIndex: [ : each : index | newDataFrame add: (aBlock value: each copy value: index) ].

^ newDataFrame

]

{ #category : #accessing }

DataFrame >> column: columnName [

"Answer the column with columnName as a DataSeries or signal an exception if a column with that name was not found"

| index |

index := self indexOfColumnNamed: columnName.

^ self columnAt: index

]

{ #category : #accessing }

DataFrame >> column: columnName ifAbsent: exceptionBlock [

"Answer the column with columnName as a DataSeries or evaluate exception block if a column with that name was not found"

| index |

index := self

indexOfColumnNamed: columnName

ifAbsent: [ ^ exceptionBlock value ].

^ self columnAt: index

]

{ #category : #accessing }

DataFrame >> column: columnName put: anArray [

"Replace the current values of column with columnName with anArray or signal an exception if a column with that name was not found"

| index |

index := self indexOfColumnNamed: columnName.

^ self columnAt: index put: anArray

]

{ #category : #accessing }

DataFrame >> column: columnName put: anArray ifAbsent: exceptionBlock [

"Replace the current values of column with columnName with anArray or evaluate exception block if a column with that name was not found"

| index |

index := self

indexOfColumnNamed: columnName

ifAbsent: [ ^ exceptionBlock value ].

^ self columnAt: index put: anArray

]

{ #category : #accessing }

DataFrame >> column: columnName transform: aBlock [

"Evaluate aBlock on the column with columnName and replace column with the result. Signal an exception if columnName was not found"

| column |

column := self column: columnName.

self column: columnName put: (aBlock value: column) asArray

]

{ #category : #accessing }

DataFrame >> column: columnName transform: aBlock ifAbsent: exceptionBlock [

"Evaluate aBlock on the column with columnName and replace column with the result. Evaluate exceptionBlock if columnName was not found"

| column |

column := self column: columnName ifAbsent: [ ^ exceptionBlock value ].

self column: columnName put: (aBlock value: column)

]

{ #category : #accessing }

DataFrame >> columnAt: aNumber [

"Returns the column of a DataFrame at column index aNumber"

^ (DataSeries withKeys: self rowNames values: (contents columnAt: aNumber))

name: (self columnNames at: aNumber);

yourself

]

{ #category : #accessing }

DataFrame >> columnAt: aNumber put: anArray [

"Replaces the column at column index aNumber with contents of the array anArray"

anArray size = self numberOfRows

ifFalse: [ SizeMismatch signal ].

contents columnAt: aNumber put: anArray

]

{ #category : #accessing }

DataFrame >> columnAt: aNumber transform: aBlock [

"Evaluate aBlock on the column at aNumber and replace that column with the result"

| column |

column := self columnAt: aNumber.

self columnAt: aNumber put: (aBlock value: column) asArray

]

{ #category : #accessing }

DataFrame >> columnNames [

"Returns the column names of a DataFrame"

^ columnNames

]

{ #category : #accessing }

DataFrame >> columnNames: aCollection [

"Sets the column names of a DataFrame with contents of the collection aCollection"

| type |

aCollection size = self numberOfColumns

ifFalse: [ SizeMismatch signal: 'Wrong number of column names' ].

aCollection asSet size = aCollection size

ifFalse: [ Error signal: 'All column names must be distinct' ].

self columnNames ifNotNil: [

self columnNames withIndexDo: [ :currentColumnName :i |

type := dataTypes at: currentColumnName.

dataTypes removeKey: currentColumnName.

dataTypes at: (aCollection at: i) put: type ] ].

columnNames := aCollection asOrderedCollection

]

{ #category : #accessing }

DataFrame >> columns [

"Returns a collection of all columns"

^ self asArrayOfColumns

]

{ #category : #accessing }

DataFrame >> columns: anArrayOfNames [

"Returns a collection of columns whose column names are present in the array anArrayOfNames"

| anArrayOfNumbers |

anArrayOfNumbers := anArrayOfNames

collect: [ :name |

self indexOfColumnNamed: name ].

^ self columnsAt: anArrayOfNumbers

]

{ #category : #accessing }

DataFrame >> columns: anArrayOfColumnNames put: anArrayOfArrays [

"Replaces the columns whose column names are present in the array anArrayOfColumnNames with the contents of the array of arrays anArrayOfArrays"

anArrayOfArrays size = anArrayOfColumnNames size

ifFalse: [ SizeMismatch signal ].

anArrayOfColumnNames with: anArrayOfArrays do: [ :name :array |

self column: name put: array ]

]

{ #category : #accessing }

DataFrame >> columnsAt: anArrayOfNumbers [

"Returns a collection of columns whose column indices are present in the array anArrayOfNumbers"

| newColumnNames |

newColumnNames := (anArrayOfNumbers collect: [ :i |

self columnNames at: i ]).

^ DataFrame

withDataFrameInternal: (self contents columnsAt: anArrayOfNumbers)

rowNames: self rowNames

columnNames: newColumnNames

]

{ #category : #accessing }

DataFrame >> columnsAt: anArrayOfNumbers put: anArrayOfArrays [

"Replaces the columns whose column indices are present in the array anArrayOfNumbers with the contents of the array of arrays anArrayOfArrays"

anArrayOfArrays size = anArrayOfNumbers size

ifFalse: [ SizeMismatch signal ].

anArrayOfNumbers with: anArrayOfArrays do: [ :index :array |

self columnAt: index put: array ]

]

{ #category : #accessing }

DataFrame >> columnsFrom: begin to: end [

"Returns a collection of columns whose column indices are present between begin and end"

| array |

array := begin < end

ifTrue: [ (begin to: end) asArray ]

ifFalse: [ (end to: begin) asArray reverse ].

^ self columnsAt: array

]

{ #category : #accessing }

DataFrame >> columnsFrom: firstNumber to: secondNumber put: anArrayOfArrays [

"Replaces the columns whose column indices are present between firstNumber and secondNumber with the contents of the array of arrays anArrayOfArrays"

| interval |

anArrayOfArrays size = ((firstNumber - secondNumber) abs + 1)

ifFalse: [ SizeMismatch signal ].

interval := secondNumber >= firstNumber

ifTrue: [ (firstNumber to: secondNumber) ]

ifFalse: [ (secondNumber to: firstNumber) reversed ].

interval withIndexDo: [ :columnIndex :i |

self columnAt: columnIndex put: (anArrayOfArrays at: i) ]

]

{ #category : #accessing }

DataFrame >> contents [

"Returns all the values of the DataFrame"

^ contents

]

{ #category : #copying }

DataFrame >> copyReplace: missingValue in2DCollectionBy: arrayOfReplacementValues [

"I am a 2D collection and the goal is to return a copy replace the missing values by the values of my second parameter. The good value is the index of the missing value in the sub collection.

I am needed for the project pharo-ai/data-imputers. I can work without that method but the time it will take to replace the missing values will be huuuuuuuuuuuge"

| copy |

copy := self copy.

1 to: self numberOfColumns do: [ :columnIndex |

| replacementValue |

replacementValue := arrayOfReplacementValues at: columnIndex.

1 to: self numberOfRows do: [ :rowIndex | (self at: rowIndex at: columnIndex) = missingValue ifTrue: [ self copy at: rowIndex at: columnIndex put: replacementValue ] ] ].

^ copy

]

{ #category : #statistics }

DataFrame >> correlationMatrix [

"Calculate a correlation matrix (correlation of every column with every column) using Pearson's correlation coefficient"

^ self correlationMatrixUsing: DataPearsonCorrelationMethod

]

{ #category : #statistics }

DataFrame >> correlationMatrixUsing: aCorrelationCoefficient [

"Calculate a correlation matrix (correlation of every column with every column) using the given correlation coefficient"

| numericalColumnNames correlationMatrix firstColumn secondColumn correlation |

numericalColumnNames := self columnNames select: [ :columnName |

(self column: columnName) isNumerical ].

numericalColumnNames ifEmpty: [

Error signal: 'This data frame does not have any numerical columns' ].

correlationMatrix := self class

withRowNames: numericalColumnNames

columnNames: numericalColumnNames.

1 to: numericalColumnNames size do: [ :i |

1 to: i - 1 do: [ :j |

firstColumn := self column: (numericalColumnNames at: i).

secondColumn := self column: (numericalColumnNames at: j).

correlation := firstColumn correlationWith: secondColumn using: aCorrelationCoefficient.

correlationMatrix at: i at: j put: correlation.

correlationMatrix at: j at: i put: correlation ] ].

1 to: numericalColumnNames size do: [ :i |

correlationMatrix at: i at: i put: 1 ].

^ correlationMatrix

]

{ #category : #accessing }

DataFrame >> crossTabulate: colName1 with: colName2 [

"Returns the cross tabulation of a column named colName1 with the column named colName2 of the DataFrame"

| col1 col2 |

col1 := self column: colName1.

col2 := self column: colName2.

^ col1 crossTabulateWith: col2

]

{ #category : #copying }

DataFrame >> dataPreProcessingEncodeWith: anEncoder [

"This method is here to speed up pharo-ai/data-preprocessing algos without coupling both projects."

| copy cache |

copy := self copy.

cache := IdentityDictionary new.

self columns doWithIndex: [ :dataSerie :columnIndex |

| category |

category := cache at: columnIndex ifAbsentPut: [ ((anEncoder categories at: columnIndex) collectWithIndex: [ :elem :index | elem -> index ]) asDictionary ].

dataSerie doWithIndex: [ :element :rowIndex |

copy at: rowIndex at: columnIndex put: (category at: element ifAbsent: [ AIMissingCategory signalFor: element ]) ] ].

^ copy

]

{ #category : #'data-types' }

DataFrame >> dataTypeOfColumn: aColumnName [

"Given a column name of the DataFrame, it returns the data type of that column"

^ dataTypes at: aColumnName

]

{ #category : #'data-types' }

DataFrame >> dataTypeOfColumn: aColumnName put: aDataType [

"Given a column name and a data type, it replaces the original data type of that column with the data type that was given as a parameter"

dataTypes at: aColumnName put: aDataType

]

{ #category : #'data-types' }

DataFrame >> dataTypeOfColumnAt: aNumber [

"Given a column index of the DataFrame, it returns the data type of that column"

^ self dataTypeOfColumn: (columnNames at: aNumber)

]

{ #category : #'data-types' }

DataFrame >> dataTypeOfColumnAt: aNumber put: aDataType [

"Given a column index and a data type, it replaces the original data type of that column with the data type that was given as a parameter"

^ self dataTypeOfColumn: (columnNames at: aNumber) put: aDataType

]

{ #category : #accessing }

DataFrame >> dataTypes [

"Returns the data types of each column"

^ dataTypes

]

{ #category : #accessing }

DataFrame >> dataTypes: anObject [

dataTypes := anObject

]

{ #category : #accessing }

DataFrame >> defaultHeadTailSize [

^ 5

]

{ #category : #accessing }

DataFrame >> dimensions [

"Returns the number of rows and number of columns in a DataFrame"

^ (self numberOfRows) @ (self numberOfColumns)

]

{ #category : #enumerating }

DataFrame >> do: aBlock [

"We enumerate through the data enrties - through rows of a data frame"

| row |

1 to: self numberOfRows do: [ :i |

row := self rowAt: i.

aBlock value: row.

"A hack to allow modification of rows inside do block"

self rowAt: i put: row asArray ]

]

{ #category : #'find-select' }

DataFrame >> findAll: anObject atColumn: columnName [

"Returns rowNames of rows having anObject at columnName"

^ self rowNames select: [ :row | ((self column: columnName) at: row) = anObject ]

]

{ #category : #'find-select' }

DataFrame >> findAllIndicesOf: anObject atColumn: columnName [

"Returns indices of rows having anObject at columnName"

| output |

output := OrderedCollection new.

self rowNames withIndexDo: [ :row :index | ((self column: columnName) at: row) = anObject ifTrue: [ output add: index ]].

^ output

]

{ #category : #accessing }

DataFrame >> first [

"Returns the first row of the DataFrame"

^ self at: 1

]

{ #category : #statistics }

DataFrame >> firstQuartile [

"25% of the values in a set are smaller than or equal to the first Quartile of that set"

^ self applyToAllColumns: #firstQuartile

]

{ #category : #private }

DataFrame >> getJointColumnsWith: aDataFrame [

"comment stating purpose of message"

| columnIntersection outputColumns |

columnIntersection := (self columnNames intersection: (aDataFrame columnNames)) asSet.

outputColumns := OrderedCollection new.

self columnNames do: [ :column |

(columnIntersection includes: column)

ifTrue: [ outputColumns add: ('' join: {column, '_x'}) ]

ifFalse: [ outputColumns add: column ]

aDataFrame columnNames do: [ :column |

(columnIntersection includes: column)

ifTrue: [ outputColumns add: ('' join: {column, '_y'}) ]

ifFalse: [ outputColumns add: column ]

^ outputColumns

]

{ #category : #grouping }

DataFrame >> group: anAggregateColumnName by: aGroupColumnName aggregateUsing: aBlock [

"Group the values of the cloumn named anAggregateColumnName by the unique values of the column named aGroupColumnName, aggregate them using aBlock. With the same name as anAggregateColumnName"

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FilesExpand file tree

DataFrame.class.st

Latest commit

History

DataFrame.class.st

File metadata and controls